Open int11 opened 1 month ago
Thanks for your pr for these problems.
For memory leak
problem.
If evaluation
operation is not performed during the training process, there will be no problem of continuous memory growth. So I wonder if it's a problem with the dataloader? But not coco evaluation
?
it's a problem multiprocess when dataloader num_workers > 0 check out this pytorch issue https://github.com/pytorch/pytorch/issues/13246
As a result, Train / evaluation dataset both caus unnecessary memory usage. evaluation datasets also cause unnecessary memory growth, but not as much as Train datasets. the important thing is memory stops growing when the workers has accessed the entire data as far as i know
It sounds strange that there is no problem without evaluation operation. I already checked the huge memory increase in the coco dataset This has nothing to do with the train and evaluation datasets/operation. Any dataset has this problem with cpython object.
Even the memory_check.py code that I provided doesn't do either of the train/evaluation operations. I'm only doing fake read the data using pickle.dumps.
From what i see, the problem starts at synchronization and accumulation part at det_solver.py CocoDetection_share_memory doesnt help me
print("Averaged stats:", metric_logger)
if coco_evaluator is not None:
coco_evaluator.synchronize_between_processes()
# accumulate predictions from all images
if coco_evaluator is not None:
coco_evaluator.accumulate()
coco_evaluator.summarize()
From what i see, the problem starts at synchronization and accumulation part at det_solver.py CocoDetection_share_memory doesnt help me
@VladKatsman Show me two classes of memory usage in your environment. In my case, improve the overall memory efficiency by 1.5x when I use CocoDetection_share_memory.
main(dataset_class=CocoDetection, range_num=30000)
time PID rss pss uss shared shared_file
------ ------- ----- ----- ------ -------- -------------
55122 1993233 3.2G 1.7G 892.9M 2.3G 44.2M
55122 1993491 2.9G 1.5G 835.0M 2.1G 50.4M
55122 1993496 3.4G 2.1G 1.4G 2.0G 50.4M
totle pss : 5.367GB
iteration : 920 / 937, time : 10.729
time PID rss pss uss shared shared_file
------ ------- ----- ----- ------ -------- -------------
55133 1993233 3.1G 1.6G 763.3M 2.3G 44.2M
55133 1993491 2.9G 1.5G 833.1M 2.0G 50.4M
55133 1993496 3.1G 1.8G 1.0G 2.0G 50.4M
totle pss : 4.899GB
iteration : 930 / 937, time : 10.298
main(dataset_class=CocoDetection_share_memory, share_memory=False, range_num=30000)
time PID rss pss uss shared shared_file
------ ------- ----- ----- ------ -------- -------------
57902 2003746 3.2G 1.7G 899.0M 2.3G 43.9M
57902 2004024 2.9G 1.5G 864.9M 2.0G 50.8M
57902 2004029 3.2G 1.9G 1.2G 2.0G 50.8M
totle pss : 5.138GB
iteration : 910 / 937, time : 11.612
time PID rss pss uss shared shared_file
------ ------- ----- ----- ------ -------- -------------
57914 2003746 3.1G 1.6G 792.5M 2.3G 43.9M
57914 2004024 2.8G 1.5G 852.8M 2.0G 50.8M
57914 2004029 2.9G 1.6G 864.9M 2.0G 50.8M
totle pss : 4.706GB
iteration : 920 / 937, time : 11.366
time PID rss pss uss shared shared_file
------ ------- ----- ----- ------ -------- -------------
57925 2003746 3.2G 1.8G 932.8M 2.3G 43.9M
57925 2004024 2.9G 1.6G 903.0M 2.0G 50.8M
57925 2004029 2.8G 1.5G 854.4M 2.0G 50.8M
totle pss : 4.880GB
iteration : 930 / 937, time : 11.577
main(dataset_class=CocoDetection_share_memory, share_memory=True, range_num=30000)
time PID rss pss uss shared shared_file
------ ------- ----- ------- ------ -------- -------------
58961 2010117 2.1G 1.6G 1.3G 745.5M 44.9M
58961 2010422 1.5G 1010.2M 765.1M 764.0M 51.1M
58961 2010427 1.5G 1010.5M 765.8M 763.3M 51.2M
totle pss : 3.550GB
iteration : 910 / 937, time : 10.558
time PID rss pss uss shared shared_file
------ ------- ----- ------- ------ -------- -------------
58972 2010117 1.8G 1.3G 1.1G 745.5M 44.9M
58972 2010422 1.5G 1010.2M 765.1M 764.0M 51.1M
58972 2010427 1.5G 1010.5M 765.8M 763.3M 51.2M
totle pss : 3.254GB
iteration : 920 / 937, time : 11.179
time PID rss pss uss shared shared_file
------ ------- ----- ------- ------ -------- -------------
58982 2010117 2.1G 1.6G 1.3G 745.5M 44.9M
58982 2010422 1.6G 1.1G 900.5M 764.0M 51.1M
58982 2010427 1.5G 1010.5M 765.8M 763.3M 51.2M
totle pss : 3.682GB
iteration : 930 / 937, time : 9.704
Unless you don't use your cython object, I'm guessing that memory efficiency will definitely increase. When you test memory, you must consider swap memory and garbage collectors. so you must have enough memory when you testing.
This only reduces memory usage, not erasing memory altogether. If you don't have enough memory, it may seem like it doesn't help you in terms of memory.
Or as you said, det_solver.py problem also exists at the same time. Please provide an example code and a testing code that can help you resolve the det_solver.py problem.
I am sorry, i will reply to you from high level point of view, without code. My training and val datasets are about the same size (80k .jpg images 640x640 each). During training i use single machine 3 GPU with batch size of 24 each (72 total).
That is command i used before train params: CUDA_VISIBLE_DEVICES=0,1,2 torchrun --nproc_per_node=3
It takes about 21GB out of 24GB of each GPU memory and about 20 GB RAM. Now during evaluation, the number raises over 128 GB RAM (which my total RAM size). Your updated code did not solve that problem as well. There is still SEGFAULT error.
I've evaluated model using 1 GPU and 1 process so it took about 50 GB RAM for evaluation which is huge number as well. I dont know where to start to look for the problem, it looks like evaluation code itself is not memory efficient somehow.
If we will choose to use your project I will be happy to debug it and commit fixes and changes.
Currently, there is a problem with memory exploding in the coco dataset class.
https://github.com/lyuwenyu/RT-DETR/issues/93 https://github.com/lyuwenyu/RT-DETR/issues/172 https://github.com/lyuwenyu/RT-DETR/issues/207
The cause by Copy-on-read of the Forked CPython object if you want to explore this problem, check this blog post Demystify-RAM-Usage-in-Multiprocess-DataLoader
The CocoDetection_share_memory class uses less total pss memory than current repository coco dataset class. This can be found in memory_check.py.