lyuwenyu / RT-DETR

[CVPR 2024] Official RT-DETR (RTDETR paddle pytorch), Real-Time DEtection TRansformer, DETRs Beat YOLOs on Real-time Object Detection. 🔥 🔥 🔥
Apache License 2.0
2.31k stars 259 forks source link

PyTorch code training may have memory leak #207

Open DrRyanHuang opened 7 months ago

DrRyanHuang commented 7 months ago

image image

I encounter memory overflow on another server, leading to system freeze, which may cause the following problems:

lyuwenyu commented 7 months ago

( add related issue https://github.com/lyuwenyu/RT-DETR/issues/93, https://github.com/lyuwenyu/RT-DETR/issues/172

Can you do more test locally and try to solve this problem?

DrRyanHuang commented 7 months ago

2 days ago, I used gc to analyze memory leaks. It seemed that the data set was not released after training/eval for one epoch, but I was very unsure because I didn't have enough time to do it.

image

image

Hope this helps you solve this problem, I add these codes after train_one_epoch.

    # if cuda_empty_cache:
    #     del metric_logger
    #     gc.collect()
    #     # torch.cuda.empty_cache()

    # print(f"Number of objects in gc.garbage: {len(gc.garbage)}")

    # ann = []
    # for cycle in cycles:
    #     if isinstance(cycle, dict) and 'bbox' in cycle:
    #         ann.append(cycle)

    # for obj in ann: 
    #     referrers = gc.get_referrers(obj)
    #     print(f"Referrers of {obj}: {referrers}")
    #     break
meirashaf commented 3 months ago

2 days ago, I used gc to analyze memory leaks. It seemed that the data set was not released after training/eval for one epoch, but I was very unsure because I didn't have enough time to do it.

image

image

Hope this helps you solve this problem, I add these codes after train_one_epoch.

    # if cuda_empty_cache:
    #     del metric_logger
    #     gc.collect()
    #     # torch.cuda.empty_cache()

    # print(f"Number of objects in gc.garbage: {len(gc.garbage)}")

    # ann = []
    # for cycle in cycles:
    #     if isinstance(cycle, dict) and 'bbox' in cycle:
    #         ann.append(cycle)

    # for obj in ann: 
    #     referrers = gc.get_referrers(obj)
    #     print(f"Referrers of {obj}: {referrers}")
    #     break

hello would you mind providing the full file? I'm confused how to use your solution. For example, I don't understand what's contained in the cycles variable.