mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training
https://streaming.docs.mosaicml.com
Apache License 2.0
1.1k stars 137 forks source link

Memory leak using download_file with DDP or FSDP #758

Closed nagadit closed 1 month ago

nagadit commented 1 month ago

Environment

To reproduce

Steps to reproduce the behavior:

  1. Use this dataset class
  2. Set remote path to video or images
  3. Start training using N_GPUS>8 with FSDP or DDP, set batch_size>128 and n_workers

e8c955b9-f596-44f6-ab8d-6383dcbeff80

mvpatel2000 commented 1 month ago

Just to confirm, do you not observe this if you avoid using streaming?

XiaohanZhangCMU commented 1 month ago

@nagadit thanks for bringing up the issue.

afaict, the peak memory goes from 13.75 to ~14 is normal, 2% increase. One possibility is the dataset is not fully shuffled so some samples are larger, which requires larger memory at runtime, and pytorch allocated the memory while did not free it until gc. overall, it is expected to have the peak memory accumulated during the training.

Can you provide more details? like if this is already isolated as a streaming issue^, why trainer do you use, a code snippet that your streamingdataset instance is created, and whether the mem leak continues etc. thanks.

nagadit commented 1 month ago

@nagadit thanks for bringing up the issue.

afaict, the peak memory goes from 13.75 to ~14 is normal, 2% increase. One possibility is the dataset is not fully shuffled so some samples are larger, which requires larger memory at runtime, and pytorch allocated the memory while did not free it until gc. overall, it is expected to have the peak memory accumulated during the training.

Can you provide more details? like if this is already isolated as a streaming issue^, why trainer do you use, a code snippet that your streamingdataset instance is created, and whether the mem leak continues etc. thanks.

I apologize, I uploaded the wrong screenshot by mistake) Corrected

nagadit commented 1 month ago

Just to confirm, do you not observe this if you avoid using streaming?

There is a leak, but in minimal quantities.

nagadit commented 1 month ago

For example, you can create a dataloader as shown below by wrapping the dataloader and model in a DDP.

s3_dataloader = StreamingOutsideGIWebVid(batch_size=360, extra_local="path/to/local", exetra_remote="s3://")

trainer = Trainer(
    model=Model(),
    train_dataloader=s3_dataloader,
    max_duration="2ep"
)
trainer.fit()

You will get a big memory leak when working with images or video files.

AugustDev commented 1 month ago

Any confirmation on this, seems like deal breaker if there's a memory leak in MosaicML?

mvpatel2000 commented 1 month ago

What's the size of your dataset? Note that streaming does not by default evict data as you might need it for multiple passes, but if your dataset is large you can limit the cache size: https://docs.mosaicml.com/projects/streaming/en/stable/dataset_configuration/shard_retrieval.html#cache-limit

nagadit commented 1 month ago

Hello everyone! The memory leak problem has been solved (a custom boto3 session handler has been written for mutriprocessing and multithreading). You can learn more about the problem here: https://github.com/boto/boto3/issues/1670

This issue can be closed or made the main one for future searches with such a problem.

mvpatel2000 commented 1 month ago

Nice! Glad to see it was't a bug on our end :). Thanks for hunting it down and flagging it.