[QUESTION] Issue with Large Dataset in Distributed Training

takuseno / d3rlpy

An offline deep reinforcement learning library

https://takuseno.github.io/d3rlpy

MIT License

1.32k stars 243 forks source link

[QUESTION] Issue with Large Dataset in Distributed Training #418

Closed mling2024 closed 1 month ago

mling2024 commented 1 month ago

Dear Takuma,

I have a large dataset and used DDP for training with 16 cpu processes. I tried two methods:

Loaded the entire dataset I noticed that memory consumption is roughly 16 times of single cpu training, so I guess each process loaded the entire dataset. This would crash my system. Isn't the DDP or Torchrun process supposed to split the dataset into 16 parts so that each process gets 1/16 of the data and memory?
Split dataset into 16 parts I split the dataset into 16 parts and generated MDPDataset for each part and fed to each process like this:
```
df_part = pd.read_pickle(f'dataset_part{rank}.pkl')  # rank from 0 to 15
dataset = d3rlpy.dataset.MDPDataset(... df_part ...)  # build MDPDataset based on df_part for the specific rank
cql = d3rlpy.algos.DiscreteCQLConfig(...)
cql.fit(dataset ..., enable_ddp=True)
```
This worked fine and solved the memory problem. But the trained model for each process (I evaluated all ranks) vary drastically in performance, i.e., they were different models and DDP did not seem to work. Am I doing it wrong here?

Any responses will be greatly appreciated! Thanks, Meng

takuseno commented 1 month ago

@mling2024 Hi, thank you for the issue. For the dataset, as you tried the manual splits, you need to split data by yourself. Regarding the second point that DDP didn't work correctly, you're actually right. I realized that I didn't correctly support DDP.

To fix this issue, I made a change in the last commit https://github.com/takuseno/d3rlpy/commit/8e579b6b44b74ee4c4b8dc972a7cf1e5a2607e1c and I confirmed this fixes the issue. Due to this change, enable_ddp flag moves from def fit(...) to create(...) method. Please check this updated example: https://github.com/takuseno/d3rlpy/blob/8e579b6b44b74ee4c4b8dc972a7cf1e5a2607e1c/examples/distributed_offline_training.py#L30

You can try the fixed version by installing d3rlpy from source code:

git clone https://github.com/takuseno/d3rlpy
cd d3rlpy
pip install -e .

Once you confirm it's working on your side as well, I'll release a new version to include this fix.

mling2024 commented 1 month ago

Hi Takuma, thank you very much for the quick response and fix! I will give it a try in the next few days and get back to you.

I noticed since version 2.6.0, MDPDataset is constructed differently from my version of 2.5.0. So I have to change my dataset construction as well as other changes. Hope this transition to new version is smooth.

Thanks, Meng

mling2024 commented 1 month ago

Hi Takuma, I confirm that it works on my side. Please go ahead. Thanks, Meng

takuseno commented 1 month ago

Thank you for the test! I will release the new version by the end of today in Japan.

takuseno commented 1 month ago

The new version 2.6.1 has been released now. Please let me close this issue.