mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training
https://streaming.docs.mosaicml.com
Apache License 2.0
1.09k stars 136 forks source link

Suboptimal usage of 8xH100 GPUs - Streaming dataloader speed significantly fluctuates across batches #686

Open VSehwag opened 4 months ago

VSehwag commented 4 months ago

Setup

This issue is related #643 but concerns a more subtle issue with Streaming datasets. Over the course of training, we observe that the streaming dataloader speed randomly takes a dip. This is predominant between epochs but also happens at random steps. This is true for all model sizes we tested (a single layer to 1B param model).

Our current setup includes:

Our overall setup is same as diffusion training from https://github.com/mosaicml/diffusion i.e., we launch the script using composer run.py --config-path yamls/hydra-yamls --config-name SD-2-base-256.yaml

The epoch size is 652 batches where the dataloader gets stuck and take a lot of time. However the drop is throughput is also there at random steps between epochs. This test was done on a relatively small dataset with 1.3M (652x2048) samples.

x-axis: training steps x-axis: Wall-clock time
Chart 5_25_2024, 11_29_39 AM Chart 5_25_2024, 11_29_39 AM (1)

We also test it with two other datasets with disk size 1T and 2T (and corresponding epochs size of 2k and 5k batches) and observe the same drops in throughput. The plot of the right shows that the dataloader often hangs for more than a minute. Two subtle issues happening here:

We have tried a prefetch factor of up to 8 (with eight workers) and 2 (with 2 workers) for both datasets and didn't observe any resolution to the drops. We also ablate the fsdp mode to full_shard and no_shard but the dips in throughput presists. We used a 140M parameter model but the dips are there with another 1B model. We created the mds dataset by processing our raw data using 8 processed and then merged the index.json files.

Somehow the issue is less severe in the single-gpu training. We only observe the drop in throughput at the end of epoch (2k batches). In all previous tests we had used the 8 gpus ddp (with grad_sharding fsdp config). Note that we don't observe a perfectly linear speedup (8.5 batch/s on 8 gpus vs 1.4 batch/s on 1 gpu) which indicated a IO bound and could contribute to the worse dips in multi-gpu setting.

Overall there are three puzzling questions:

snarayan21 commented 4 months ago

Thanks for raising this issue @VSehwag! So we have seen drops in throughput between epochs, but given that your data is residing locally, the time between epochs shouldn't be very long. Could you make sure that drop_last=True in your DataLoader, and could you also try setting persistent_workers=True for the DataLoader? This will keep workers alive between epochs and should help decrease the inter-epoch downtime.

As for throughput drops during training, what is the memory usage of your GPUs during training? PyTorch has a known issue (a probable memory leak) so disabling garbage collection and only performing it once every N steps has resolved this sort of throughput issue in the past. You can also increase the StreamingDataset predownload argument to allow the dataset to download more samples in advance. Lastly, the Streaming Simulator can help you estimate what sort of training throughput you should expect, given your dataset and model iteration time.

Hope this helps.

VSehwag commented 4 months ago

Thanks for taking a look into it. I am currently using persistent workers with dropping last batch, so both drop_last=True and persistent_workers=True. I've observed that setting persistent workers to False increases the wait time between epochs. The major concern for us is the wait time at random steps during the epoch.

The memory usage of each gpu is ~35/80G. I am not sure if any memory leak is happening as we don't see the gpu memory increasing over steps. Just to clarify, the data is fully residing locally (remote is set to None). So the predownload setting and identifying cache limits and other bottlenecks from simulator aren't applicable in this case.

snarayan21 commented 4 months ago

@VSehwag Ah right, I forgot your data was local. Since you're using composer, you can refer to the scheduled gc callback we have in LLM foundry, which should work for your training setup as well. Here's documentation with info on how to enable callbacks. Even though GPU memory may not seem to increase, we have seen that only doing garbage collection at specified intervals will resolve throughput fluctuation/degradation over the course of training.

To narrow down whether this issue is with Streaming, could you try training without Streaming and check the throughput?

karan6181 commented 3 months ago

@VSehwag Wondering if you have gotten a chance to try @snarayan21 suggestion?

snarayan21 commented 1 month ago

Hey @VSehwag --

@XiaohanZhangCMU recently merged the above fix that addresses inter-epoch hangs. I'm wondering if it might also be affecting your runs. We're going to be cutting a new release soon, but in the meantime, feel free to retry from main!

vsehwag-sony commented 1 month ago

Thanks a lot for following up on the thread. Let me pull the latest changes and check if it resolves the issue.

XiaohanZhangCMU commented 3 weeks ago

Hey @VSehwag want to follow up here to see if the fix from the recent release works for your workloads.