Closed nagadit closed 1 month ago
Just to confirm, do you not observe this if you avoid using streaming?
@nagadit thanks for bringing up the issue.
afaict, the peak memory goes from 13.75 to ~14 is normal, 2% increase. One possibility is the dataset is not fully shuffled so some samples are larger, which requires larger memory at runtime, and pytorch allocated the memory while did not free it until gc. overall, it is expected to have the peak memory accumulated during the training.
Can you provide more details? like if this is already isolated as a streaming issue^, why trainer do you use, a code snippet that your streamingdataset instance is created, and whether the mem leak continues etc. thanks.
@nagadit thanks for bringing up the issue.
afaict, the peak memory goes from 13.75 to ~14 is normal, 2% increase. One possibility is the dataset is not fully shuffled so some samples are larger, which requires larger memory at runtime, and pytorch allocated the memory while did not free it until gc. overall, it is expected to have the peak memory accumulated during the training.
Can you provide more details? like if this is already isolated as a streaming issue^, why trainer do you use, a code snippet that your streamingdataset instance is created, and whether the mem leak continues etc. thanks.
I apologize, I uploaded the wrong screenshot by mistake) Corrected
Just to confirm, do you not observe this if you avoid using streaming?
There is a leak, but in minimal quantities.
For example, you can create a dataloader as shown below by wrapping the dataloader and model in a DDP.
s3_dataloader = StreamingOutsideGIWebVid(batch_size=360, extra_local="path/to/local", exetra_remote="s3://")
trainer = Trainer(
model=Model(),
train_dataloader=s3_dataloader,
max_duration="2ep"
)
trainer.fit()
You will get a big memory leak when working with images or video files.
Any confirmation on this, seems like deal breaker if there's a memory leak in MosaicML?
What's the size of your dataset? Note that streaming does not by default evict data as you might need it for multiple passes, but if your dataset is large you can limit the cache size: https://docs.mosaicml.com/projects/streaming/en/stable/dataset_configuration/shard_retrieval.html#cache-limit
Hello everyone! The memory leak problem has been solved (a custom boto3 session handler has been written for mutriprocessing and multithreading). You can learn more about the problem here: https://github.com/boto/boto3/issues/1670
This issue can be closed or made the main one for future searches with such a problem.
Nice! Glad to see it was't a bug on our end :). Thanks for hunting it down and flagging it.
Environment
To reproduce
Steps to reproduce the behavior: