lyuwenyu / RT-DETR

[CVPR 2024] Official RT-DETR (RTDETR paddle pytorch), Real-Time DEtection TRansformer, DETRs Beat YOLOs on Real-time Object Detection. 🔥 🔥 🔥
Apache License 2.0
2.31k stars 258 forks source link

fix import error, Copy-on-read Overhead ( called memory leak in repository ) and slightly refactor dist_utils.py for improved readability #418

Open int11 opened 1 month ago

int11 commented 1 month ago

Currently, there is a problem with memory exploding in the coco dataset class.

https://github.com/lyuwenyu/RT-DETR/issues/93 https://github.com/lyuwenyu/RT-DETR/issues/172 https://github.com/lyuwenyu/RT-DETR/issues/207

The cause by Copy-on-read of the Forked CPython object if you want to explore this problem, check this blog post Demystify-RAM-Usage-in-Multiprocess-DataLoader

The CocoDetection_share_memory class uses less total pss memory than current repository coco dataset class. This can be found in memory_check.py.

lyuwenyu commented 1 month ago

Thanks for your pr for these problems.

For memory leak problem.

If evaluation operation is not performed during the training process, there will be no problem of continuous memory growth. So I wonder if it's a problem with the dataloader? But not coco evaluation ?

int11 commented 1 month ago

it's a problem multiprocess when dataloader num_workers > 0 check out this pytorch issue https://github.com/pytorch/pytorch/issues/13246

As a result, Train / evaluation dataset both caus unnecessary memory usage. evaluation datasets also cause unnecessary memory growth, but not as much as Train datasets. the important thing is memory stops growing when the workers has accessed the entire data as far as i know

It sounds strange that there is no problem without evaluation operation. I already checked the huge memory increase in the coco dataset This has nothing to do with the train and evaluation datasets/operation. Any dataset has this problem with cpython object.

Even the memory_check.py code that I provided doesn't do either of the train/evaluation operations. I'm only doing fake read the data using pickle.dumps.

VladKatsman commented 1 month ago

From what i see, the problem starts at synchronization and accumulation part at det_solver.py CocoDetection_share_memory doesnt help me

    print("Averaged stats:", metric_logger)
    if coco_evaluator is not None:
        coco_evaluator.synchronize_between_processes()

    # accumulate predictions from all images
    if coco_evaluator is not None:
        coco_evaluator.accumulate()
        coco_evaluator.summarize()
int11 commented 1 month ago

From what i see, the problem starts at synchronization and accumulation part at det_solver.py CocoDetection_share_memory doesnt help me

@VladKatsman Show me two classes of memory usage in your environment. In my case, improve the overall memory efficiency by 1.5x when I use CocoDetection_share_memory.

main(dataset_class=CocoDetection, range_num=30000)

  time      PID  rss    pss    uss     shared    shared_file
------  -------  -----  -----  ------  --------  -------------
 55122  1993233  3.2G   1.7G   892.9M  2.3G      44.2M
 55122  1993491  2.9G   1.5G   835.0M  2.1G      50.4M
 55122  1993496  3.4G   2.1G   1.4G    2.0G      50.4M
totle pss : 5.367GB
iteration : 920 / 937, time : 10.729
  time      PID  rss    pss    uss     shared    shared_file
------  -------  -----  -----  ------  --------  -------------
 55133  1993233  3.1G   1.6G   763.3M  2.3G      44.2M
 55133  1993491  2.9G   1.5G   833.1M  2.0G      50.4M
 55133  1993496  3.1G   1.8G   1.0G    2.0G      50.4M
totle pss : 4.899GB
iteration : 930 / 937, time : 10.298

main(dataset_class=CocoDetection_share_memory, share_memory=False, range_num=30000)

  time      PID  rss    pss    uss     shared    shared_file
------  -------  -----  -----  ------  --------  -------------
 57902  2003746  3.2G   1.7G   899.0M  2.3G      43.9M
 57902  2004024  2.9G   1.5G   864.9M  2.0G      50.8M
 57902  2004029  3.2G   1.9G   1.2G    2.0G      50.8M
totle pss : 5.138GB
iteration : 910 / 937, time : 11.612
  time      PID  rss    pss    uss     shared    shared_file
------  -------  -----  -----  ------  --------  -------------
 57914  2003746  3.1G   1.6G   792.5M  2.3G      43.9M
 57914  2004024  2.8G   1.5G   852.8M  2.0G      50.8M
 57914  2004029  2.9G   1.6G   864.9M  2.0G      50.8M
totle pss : 4.706GB
iteration : 920 / 937, time : 11.366
  time      PID  rss    pss    uss     shared    shared_file
------  -------  -----  -----  ------  --------  -------------
 57925  2003746  3.2G   1.8G   932.8M  2.3G      43.9M
 57925  2004024  2.9G   1.6G   903.0M  2.0G      50.8M
 57925  2004029  2.8G   1.5G   854.4M  2.0G      50.8M
totle pss : 4.880GB
iteration : 930 / 937, time : 11.577

main(dataset_class=CocoDetection_share_memory, share_memory=True, range_num=30000)

  time      PID  rss    pss      uss     shared    shared_file
------  -------  -----  -------  ------  --------  -------------
 58961  2010117  2.1G   1.6G     1.3G    745.5M    44.9M
 58961  2010422  1.5G   1010.2M  765.1M  764.0M    51.1M
 58961  2010427  1.5G   1010.5M  765.8M  763.3M    51.2M
totle pss : 3.550GB
iteration : 910 / 937, time : 10.558
  time      PID  rss    pss      uss     shared    shared_file
------  -------  -----  -------  ------  --------  -------------
 58972  2010117  1.8G   1.3G     1.1G    745.5M    44.9M
 58972  2010422  1.5G   1010.2M  765.1M  764.0M    51.1M
 58972  2010427  1.5G   1010.5M  765.8M  763.3M    51.2M
totle pss : 3.254GB
iteration : 920 / 937, time : 11.179
  time      PID  rss    pss      uss     shared    shared_file
------  -------  -----  -------  ------  --------  -------------
 58982  2010117  2.1G   1.6G     1.3G    745.5M    44.9M
 58982  2010422  1.6G   1.1G     900.5M  764.0M    51.1M
 58982  2010427  1.5G   1010.5M  765.8M  763.3M    51.2M
totle pss : 3.682GB
iteration : 930 / 937, time : 9.704

Unless you don't use your cython object, I'm guessing that memory efficiency will definitely increase. When you test memory, you must consider swap memory and garbage collectors. so you must have enough memory when you testing.

This only reduces memory usage, not erasing memory altogether. If you don't have enough memory, it may seem like it doesn't help you in terms of memory.

Or as you said, det_solver.py problem also exists at the same time. Please provide an example code and a testing code that can help you resolve the det_solver.py problem.

VladKatsman commented 1 month ago

I am sorry, i will reply to you from high level point of view, without code. My training and val datasets are about the same size (80k .jpg images 640x640 each). During training i use single machine 3 GPU with batch size of 24 each (72 total).

That is command i used before train params: CUDA_VISIBLE_DEVICES=0,1,2 torchrun --nproc_per_node=3

It takes about 21GB out of 24GB of each GPU memory and about 20 GB RAM. Now during evaluation, the number raises over 128 GB RAM (which my total RAM size). Your updated code did not solve that problem as well. There is still SEGFAULT error.

I've evaluated model using 1 GPU and 1 process so it took about 50 GB RAM for evaluation which is huge number as well. I dont know where to start to look for the problem, it looks like evaluation code itself is not memory efficient somehow.

If we will choose to use your project I will be happy to debug it and commit fixes and changes.