The gap is pretty large when validate with training data

oym1994 commented 5 months ago

Hi, Thanks for your great work! I have trained this model on my own dataset(only a third-view RGBD camera、an arm and a gripper, 100 demos). I have followed the step to train with batch_size=2000 and step = 500, the final loss is about 0.014. Then I feed the processed training data(use the same training data loader in train.py) to check the performance, unfortunately I found it is not OK, the gap is pretty large. Then can you give some advice to figure out the problem? What is the final loss or some more training result?

Here is the code when do this testing:

def load_model(self, config, infer_device):

    algo_name, ckpt_dict = FileUtils.algo_name_from_checkpoint(ckpt_path=self._checkpoint_path)
    if self._args.dp_eval_steps is not None:
        tmp_config, _ = FileUtils.config_from_checkpoint(ckpt_dict=ckpt_dict)
        with tmp_config.values_unlocked():
            if tmp_config.algo.ddpm.enabled:
                tmp_config.algo.ddpm.num_inference_timesteps = args.dp_eval_steps
            elif tmp_config.algo.ddim.enabled:
                tmp_config.algo.ddim.num_inference_timesteps = args.dp_eval_steps
            else:
                raise Exception("should not reach here")
        ckpt_dict['config'] = tmp_config.dump()
    self._model, self._ckpt_dict = FileUtils.policy_from_checkpoint(ckpt_dict=ckpt_dict, device=infer_device, verbose=True)
    self._action_normalization_stats = self._ckpt_dict.get("action_normalization_stats")

def test_by_training_set(self):

    import robomimic.utils.train_utils as TrainUtils
    from torch.utils.data import DataLoader
    config = self._config
    ObsUtils.initialize_obs_utils_with_config(config)
    eval_dataset_cfg = config.train.data[0]

    dataset_path = os.path.expandvars(os.path.expanduser(eval_dataset_cfg["path"]))

    ds_format = config.train.data_format

    if not os.path.exists(dataset_path):
        raise Exception("Dataset at provided path {} not found!".format(dataset_path))

    shape_meta = FileUtils.get_shape_metadata_from_dataset(
        dataset_path=dataset_path,
        action_keys=config.train.action_keys,
        all_obs_keys=config.all_obs_keys,
        ds_format=ds_format,
        verbose=True)

    trainset, validset = TrainUtils.load_data_for_training(
        config, obs_keys=shape_meta["all_obs_keys"])
    train_sampler = trainset.get_dataset_sampler()
    obs_normalization_stats = None

    if config.train.hdf5_normalize_obs:
        obs_normalization_stats = trainset.get_obs_normalization_stats()
    dataset_action_normalization_stats = trainset.get_action_normalization_stats()
    trainset.set_action_normalization_stats(self._action_normalization_stats)

    train_loader = DataLoader(
        dataset=trainset,
        sampler=train_sampler,
        batch_size=config.train.batch_size,
        shuffle=(train_sampler is None),
        num_workers=config.train.num_data_workers,
        drop_last=True)

    expected_traj = []
    output_traj = []
    data_loader_iter = iter(train_loader)
    num_steps = len(train_loader)

    with torch.no_grad():
        for i in range(num_steps):
            batch = next(data_loader_iter)
            obs = {}
            for k,v in batch['obs'].items():
                obs[k] = torch.squeeze(v, dim = 0)
            output_action_numpy = np.asarray(self._model(obs).cpu())
            output_action_numpy = np.squeeze(output_action_numpy)
            action_shapes = {"eef_position": 3, "eef_quaternion": 4, "gripper": 1}
            action_keys = ["eef_position", "eef_quaternion", "gripper"]
            action_dict = self.vector_to_action_dict(output_action_numpy, action_shapes, action_keys)
            expected_traj.append(np.squeeze(np.asarray(batch['action'])))
            output_traj.append(action_dict['eef_position'])

Thanks for your attention and waiting for your kind response!

oym1994 commented 5 months ago

One more thing, when saving a checkpoint, the loss is 0.14, while loading it, after the first epoch, the loss is 0.1, is it valid or not?

j96w commented 5 months ago

Hi @oym1994 , before going down to the loss gap between training and validation set, could you double-check your batch_size=2000. In this repo, we use batch_size=16 as Link and Link. About the loss gap, I assume you are saying you get "0.014" loss when saving checkpoint but receive "0.1" loss after loading again. There must be some bugs on this ckpt/dataset loading steps. From the code, it seems you load the action_normalization_stats properly. Have you double-checked whether your obs_normalization_stats is also correct after loading the ckpt?

oym1994 commented 5 months ago

Hi @oym1994 , before going down to the loss gap between training and validation set, could you double-check your batch_size=2000. In this repo, we use batch_size=16 as Link and Link. About the loss gap, I assume you are saying you get "0.014" loss when saving checkpoint but receive "0.1" loss after loading again. There must be some bugs on this ckpt/dataset loading steps. From the code, it seems you load the action_normalization_stats properly. Have you double-checked whether your obs_normalization_stats is also correct after loading the ckpt?

Hi，thanks for ur kind response！ I have set the batch size to 2000 for a much more powerful GPU，not the default one 16. But I have tested many learning rates from 0.01-0.0001and found that the maximum learning rate for convergence is 0.00025（not increase as the batch size, is it normal?）. After learning for 2days，the loss has converted to 0.0029，but the gap is also pretty large（better than before）.So I wander ur final loss.

The above code is loading data by using the method from train.py to guarantee the data process is the same.The model loading and inference are from run_trained_agent.py. The observation normalization are all None（from checkpoint or from the config）

More info: camera is Azure Kinect, pointcloud has been down sampled to 10000 points, a ur3 arm with a parallel gripper instead of a dexterous one. When collecting the data, we keep the initial state of the arm and the manipulation object are the same while the procedure is casual.

I run the train.py to save and load trained checkpoint without any modification and found the loss when loading from checkpoint is much more bigger than saved one. It really confused me.

Thanks for your attention and response again!!!

oym1994 commented 5 months ago

Here are some pictures where red curves are ground-truth trajectories and green curves are predicted curves, when it's a straight curve, the predicted results seems well(but still some difference). This experiment is conducted with 100 episodes data, batch_size=16 lr=0.0001 and epoch =1360.

j96w commented 4 months ago

Hi @oym1994 , we did a quick test with our codebase and we found there is no gap between training and validation loss after loading the checkpoints. You can find our full training logs and saved checkpoints here. Could you double-check your setups and make sure the checkpoints are loaded correctly? 9df84432-61ec-4897-9770-3f2a9104d2ba I also add a validation script valid.py to this codebase for your reference. This script will load a trained checkpoint and do a validation run through the training dataset, which is how we get the plot showing here. To use it, simply python scripts/valid.py --config training_config/diffusion_policy_pcd_packaging_1-20.json --resume '[YOUR_CHECKPOINT].pth'

oym1994 commented 4 months ago

Thanks for your sharing, can you also upload the training dataset("/media/jeremy/cde0dfff-70f1-4c1c-82aa-e0d469c14c62/image_demo.hdf5") above, which I will use to validate the predicted trajectory. Thanks again。

One more another question, as an imitation learning method, is it necessary to transform the point cloud to robot arm base link coordinate?

j96w commented 4 months ago

it can be any of our released processed hdf5 dataset. here I'm just using the one for the wiping task

yep, the transformation between the camera frame and robot frame needs to be calibrated. moving point cloud to robot frame allows you to figure whether it is out of robot's reachabililty. We have an example code

oym1994 commented 4 months ago

Thanks again for you kind and patient response. Currently I haven't transform the point-cloud from a third camera to the robot frame, I don't know whether it's the reason for the unexpected performance.
And I will follow more deeply into your work. Hope for your continuous support!

j96w / DexCap

The gap is pretty large when validate with training data #3