Closed mling2024 closed 1 month ago
@mling2024 Hi, thank you for the issue. For the dataset, as you tried the manual splits, you need to split data by yourself. Regarding the second point that DDP didn't work correctly, you're actually right. I realized that I didn't correctly support DDP.
To fix this issue, I made a change in the last commit https://github.com/takuseno/d3rlpy/commit/8e579b6b44b74ee4c4b8dc972a7cf1e5a2607e1c and I confirmed this fixes the issue. Due to this change, enable_ddp
flag moves from def fit(...)
to create(...)
method. Please check this updated example: https://github.com/takuseno/d3rlpy/blob/8e579b6b44b74ee4c4b8dc972a7cf1e5a2607e1c/examples/distributed_offline_training.py#L30
You can try the fixed version by installing d3rlpy from source code:
git clone https://github.com/takuseno/d3rlpy
cd d3rlpy
pip install -e .
Once you confirm it's working on your side as well, I'll release a new version to include this fix.
Hi Takuma, thank you very much for the quick response and fix! I will give it a try in the next few days and get back to you.
I noticed since version 2.6.0, MDPDataset is constructed differently from my version of 2.5.0. So I have to change my dataset construction as well as other changes. Hope this transition to new version is smooth.
Thanks, Meng
Hi Takuma, I confirm that it works on my side. Please go ahead. Thanks, Meng
Thank you for the test! I will release the new version by the end of today in Japan.
The new version 2.6.1
has been released now. Please let me close this issue.
Dear Takuma,
I have a large dataset and used DDP for training with 16 cpu processes. I tried two methods:
Loaded the entire dataset I noticed that memory consumption is roughly 16 times of single cpu training, so I guess each process loaded the entire dataset. This would crash my system. Isn't the DDP or Torchrun process supposed to split the dataset into 16 parts so that each process gets 1/16 of the data and memory?
Split dataset into 16 parts I split the dataset into 16 parts and generated MDPDataset for each part and fed to each process like this:
This worked fine and solved the memory problem. But the trained model for each process (I evaluated all ranks) vary drastically in performance, i.e., they were different models and DDP did not seem to work. Am I doing it wrong here?
Any responses will be greatly appreciated! Thanks, Meng