When using the complete data set(about 800K data), the usage of MiB Mem will continue to increase, resulting in OOM. Is there any solution?

liuyuan-pal / SyncDreamer

[ICLR 2024 Spotlight] SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

https://liuyuan-pal.github.io/SyncDreamer/

MIT License

906 stars 39 forks source link

When using the complete data set(about 800K data), the usage of MiB Mem will continue to increase, resulting in OOM. Is there any solution? #42

Open wyg-okk opened 1 year ago

wyg-okk commented 1 year ago

When I train with the data set of about 800k objects, the number circled in the graph keeps increasing as the number of training steps increases. My configs/syncdreamer-train.yaml is the same as provided by the author, except for the data path https://github.com/liuyuan-pal/SyncDreamer/blob/main/configs/syncdreamer-train.yaml

liuyuan-pal commented 1 year ago

Hi, the training is based on the pytorch_lightning and it is supposed to manage the resources correctly. You can see that the dataset class https://github.com/liuyuan-pal/SyncDreamer/blob/eb41a0c73748cbb028ac9b007b11f8be70d09e48/ldm/data/sync_dreamer.py#L57 which simply loads data here and is not supposed to cause increasing memory usage. Maybe, you can check whether the memory usage is growing or not when running the dataset solely.

wyg-okk commented 1 year ago

Hi, the training is based on the pytorch_lightning and it is supposed to manage the resources correctly. You can see that the dataset class

https://github.com/liuyuan-pal/SyncDreamer/blob/eb41a0c73748cbb028ac9b007b11f8be70d09e48/ldm/data/sync_dreamer.py#L57

which simply loads data here and is not supposed to cause increasing memory usage. Maybe, you can check whether the memory usage is growing or not when running the dataset solely.

Thank you very much. We think this is a problem with the configured environment. I am checking with docker environment and will give feedback if there is any result.

rgxie commented 10 months ago

I have the same problem. Have you found the solution yet? The speed of OOM is proportional to the amount of num workers.

wyg-okk commented 10 months ago

I have the same problem. Have you found the solution yet? The speed of OOM is proportional to the amount of num workers.

When I run the code in docker provided by the author, this problem is solved. Your can try to run with author's docker.

rgxie commented 10 months ago

I have the same problem. Have you found the solution yet? The speed of OOM is proportional to the amount of num workers.

When I run the code in docker provided by the author, this problem is solved. Your can try to run with author's docker.

Thank you for your information. I also exactly use the docker env, this may indeed be a docker environment problem.