Closed oym1994 closed 3 months ago
One more thing, when saving a checkpoint, the loss is 0.14, while loading it, after the first epoch, the loss is 0.1, is it valid or not?
Hi @oym1994 , before going down to the loss gap between training and validation set, could you double-check your batch_size=2000. In this repo, we use batch_size=16 as Link and Link. About the loss gap, I assume you are saying you get "0.014" loss when saving checkpoint but receive "0.1" loss after loading again. There must be some bugs on this ckpt/dataset loading steps. From the code, it seems you load the action_normalization_stats
properly. Have you double-checked whether your obs_normalization_stats
is also correct after loading the ckpt?
Hi @oym1994 , before going down to the loss gap between training and validation set, could you double-check your batch_size=2000. In this repo, we use batch_size=16 as Link and Link. About the loss gap, I assume you are saying you get "0.014" loss when saving checkpoint but receive "0.1" loss after loading again. There must be some bugs on this ckpt/dataset loading steps. From the code, it seems you load the
action_normalization_stats
properly. Have you double-checked whether yourobs_normalization_stats
is also correct after loading the ckpt?
Hi,thanks for ur kind response! I have set the batch size to 2000 for a much more powerful GPU,not the default one 16. But I have tested many learning rates from 0.01-0.0001and found that the maximum learning rate for convergence is 0.00025(not increase as the batch size, is it normal?). After learning for 2days,the loss has converted to 0.0029,but the gap is also pretty large(better than before).So I wander ur final loss.
The above code is loading data by using the method from train.py to guarantee the data process is the same.The model loading and inference are from run_trained_agent.py. The observation normalization are all None(from checkpoint or from the config)
More info: camera is Azure Kinect, pointcloud has been down sampled to 10000 points, a ur3 arm with a parallel gripper instead of a dexterous one. When collecting the data, we keep the initial state of the arm and the manipulation object are the same while the procedure is casual.
I run the train.py to save and load trained checkpoint without any modification and found the loss when loading from checkpoint is much more bigger than saved one. It really confused me.
Thanks for your attention and response again!!!
Here are some pictures where red curves are ground-truth trajectories and green curves are predicted curves, when it's a straight curve, the predicted results seems well(but still some difference). This experiment is conducted with 100 episodes data, batch_size=16 lr=0.0001 and epoch =1360.
Hi @oym1994 , we did a quick test with our codebase and we found there is no gap between training and validation loss after loading the checkpoints. You can find our full training logs and saved checkpoints here. Could you double-check your setups and make sure the checkpoints are loaded correctly?
I also add a validation script valid.py
to this codebase for your reference. This script will load a trained checkpoint and do a validation run through the training dataset, which is how we get the plot showing here. To use it, simply
python scripts/valid.py --config training_config/diffusion_policy_pcd_packaging_1-20.json --resume '[YOUR_CHECKPOINT].pth'
Thanks for your sharing, can you also upload the training dataset("/media/jeremy/cde0dfff-70f1-4c1c-82aa-e0d469c14c62/image_demo.hdf5") above, which I will use to validate the predicted trajectory. Thanks again。
One more another question, as an imitation learning method, is it necessary to transform the point cloud to robot arm base link coordinate?
it can be any of our released processed hdf5 dataset. here I'm just using the one for the wiping task
yep, the transformation between the camera frame and robot frame needs to be calibrated. moving point cloud to robot frame allows you to figure whether it is out of robot's reachabililty. We have an example code
Thanks again for you kind and patient response.
Currently I haven't transform the point-cloud from a third camera to the robot frame, I don't know whether it's the reason for the unexpected performance.
And I will follow more deeply into your work. Hope for your continuous support!
Hi, Thanks for your great work! I have trained this model on my own dataset(only a third-view RGBD camera、an arm and a gripper, 100 demos). I have followed the step to train with batch_size=2000 and step = 500, the final loss is about 0.014. Then I feed the processed training data(use the same training data loader in train.py) to check the performance, unfortunately I found it is not OK, the gap is pretty large. Then can you give some advice to figure out the problem? What is the final loss or some more training result?
Here is the code when do this testing:
def load_model(self, config, infer_device):
def test_by_training_set(self):
Thanks for your attention and waiting for your kind response!