Closed GooseFather990 closed 2 years ago
You mean the Ram memory or GPU memory. I do training in rather large RAM and GPU mem. A tip to save memory while training is to disable validation by using --skip_validate
I have used a A100 40gb with the code release about a month ago and it never crashed. Now I switched pc with a RTX3090 24gb and a Tesla T4 15gb and I tried using both GPU's and just the 3090. Eventually the training causes the pc to reboot with no error given in the Ubuntu log or nvidia log. I tried limiting the memory usage to 85% by putting a limiter in front of the main and I also cleared the cache before main like this:
torch.cuda.set_per_process_memory_fraction(0,85) torch.cuda.empty_cache() main()
but this still caused it to crash although it took a few hours longer for it to crash.
I tried batch sizes 4 and 6 but the memory doesnt seem to change much, it tries to allocate as much as possible after ~2 epochs and clears the memory randomly mid epoch now and then. It must be a problem with the training because it never crashes outside training. --skip_validate made it last longer as well for the first training, but after the first one it seemed like it had no effect anymore.
Can you share the train log file.
Here is a log of the training that stopped last night
I have also attached the previous crash of the same training:
When i check the log the maximum of memory is only 6.3G.
I do have to mention I am using a custom dataset which might contain larger point clouds compared to s3dis, I'll check that in a moment. Although whenever the systems runs out of memory I expect pytorch to give an error because it is trying to allocate to much memory?
I dont think it is memory leak problem. Your problem may due to large data with large batch size.
I swapped the Area's from s3dis to a train folder and it contains 155 folders with pointclouds with a max size of ~100MB per folder. S3DIS biggest Area contains 69 folders with a max size of ~160MB per folder inside the area. If I calculate the difference it would mean there should only be a max 35% increase in datasize, but the GPU memory tops at 21gb for the 3090 and 9gb for the Tesla so that's strange. Besides that, do you think the crash is caused by the memory? I'll create a log of the gpu utilization and send the log of the training in a moment
4 minutes later:
memory seems to be fine but it is still using a lot compared to your training as u mentioned before
Which batch size are you using.
For this training I used the following config:
I saw another issue talking about the "input_conv", does that have anything to do with the crashing maybe?
I think you should use smaller batch size. I dont think the issue related to input_conv.
You use 4 Titan X's with 12gb each right? Could you try to run it on 3 to mimic my total GPU memory (~38,5gb) and check the max batch size you can run it with? For me it crashes anywhere between 30min and 3 hours with batch 6 and num_workers 6.
I used RTX 8000 48G, batch size 4.
Was that the maximum you were able to run on that videocard with s3dis? I am no AI expert in any means but wont changing batch down from 6 to 3 or 2 reduce performance greatly?
I also see that whenever I start the training it returns the correct amount of test scans (10), but it returns a lot of train data as well. In the image I attached it says 1000 scans even though I only gave it 50 labeled point clouds. That basically means 50 situations or rooms as s3dis would use that contain labels. I also see that whenever I train without restarting, it allocates more memory and ignores the memory_fraction(0.85)
.
We use a data repeat factor to avoid data loading time between epochs. https://github.com/thangvubk/SoftGroup/blob/d8665970f91aaf6ef1a0361c70103bc9aa67084d/configs/softgroup_s3dis_fold5.yaml#L36
Related to your training memory issue, i suggest to check the issue on the standard dataset provided in the readme. If no issue happens, you can work on the modified data.
When I am training on batch size 4 and numworkers 4 my training starts off at ~8gb. During the epochs it climbs up with steps of ~4gb. Eventually Ubuntu crashes, maybe because of another issue but the growing memory in the epoch is quite strange. I have not changed anything in the training or the softgroup model, is anyone aware of this issue and how can I fix the growing memory. I tried num workers 0 and increasing batch size but it keeps crashing, even in the first epoch compared to every ~ 4th epoch with batch 4 and num workers 4.