RuntimeError: CUDA out of memory.

uysalfurkan commented 2 years ago

Hi, I am working in Kaggle with custom dataset. I got RuntimeError and I have no idea how to solve it. Can you help me?

RuntimeError: CUDA out of memory. Tried to allocate 2.44 GiB (GPU 0; 15.90 GiB total capacity; 9.95 GiB already allocated; 1.70 GiB free; 13.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

sovit-123 commented 2 years ago

@uysalfurkan Can you try reducing the batch size? I think that should solve the issue.

uysalfurkan commented 2 years ago

@sovit-123 firstly thank you for your answer, yes it helps me but I need to compare yolov5 and faster r can in my task.

In yolov5, training batch_size=64 gave me the best results. So I need to complete this process with batch_size=64, is it possible ?

sovit-123 commented 2 years ago

@uysalfurkan Try with smaller batch size, in my opinion. Faster RCNN models are generally larger YOLOv5 models. From my experience, Faster RCNN models give good results with batch size 4 as well. Please try that. And if you like the library, I will surely love to get some feedback from you.

uysalfurkan commented 2 years ago

Thanks four your advice, yes I like it, you have prepared a very nice and clear library. I clearly understand the codes even though I am new in object detection.

I need to add some new lines to your source code to obtain more detailed result (csv) file that contain performance metrics for both val and train. I will connect you if I face a problem.

Thank you!

sovit-123 commented 2 years ago

Thank you for the feedback. @uysalfurkan I CSV file with mAP metric is already saved. Let me know what other things you want to track in the CSV file. I will add them as well as part of the library.

uysalfurkan commented 2 years ago

My purpose is to visualize the metrics (mAP [0.5 and 0.5_0.9], recall, precision, loss [object and box]) of both validation and training per epoch over the same plot. In order to do that I need a csv file which contains these metrics and epoch number.

Additionally, it would be perfect if we have a result_info.txt file that generated at the end of the training process and contains hyper parameters such as learning rate, optimizer name, batch size and model informations such as pre-trained model and backbone version.

I will be appropriate If you can add all this things as a part of the library. I am trying to add but I'm having trouble figuring out in which file the performance results are generated and what the variable names are.

Thank you!

sovit-123 commented 2 years ago

@uysalfurkan Hi, Some of the things like an opt.yaml file containing all the hyperparameters and model names are already saved to the resulting directory. I will try to add the other things. Some of the things like the validation losses are a bit difficult to add as the PyTorch Faster RCNN models don't output any loss values in eval() mode. A lot of things apart from the validation loss are already saved. I will try to add the other things. But may take some time as I am the only person working on this project.

In the meantime I can add the mAP values, and all the training losses to the CSV file.

sovit-123 commented 2 years ago

@uysalfurkan I have also updated WandB to plot everything according to epoch instead of iterations which is a bit easier to interpret.

uysalfurkan commented 2 years ago

@sovit-123 Hi, thank you for your interest. I am waiting for updates. I have added a screenshot of csv example which I need to get. results_csv_example.csv

By the way, can I modify the train.py file from the notebook? For example, I will change the learning rate value or add a new plot function with confidence score to the annotation.py file.

This is easy when i work in local but how can i do it in kaggle environment. (For example, we replace the yaml file with %%writefile)

sovit-123 commented 2 years ago

@uysalfurkan The CSV file update has been made. It's slightly different as of now compared to what you are asking. It has four losses, mAP @ 0.5 and mAP @ 0.50:0.95.

And yes, you can use the %%writefile method to overwrite your own modifications. It will work.

uysalfurkan commented 2 years ago

I need to get all of columns inside the csv that I attached above. I need to see train-val performances over the same plot for analyzing epochs and overfitting. Did you make another update for that?

Or which py file should I focus on to make this updates myself.

Thank you!

sovit-123 commented 2 years ago

@uysalfurkan I am not yet analyzing the validation values. It will take some time. It requires modification in the validation function inside the engine.py script. This is because Faster RCNN models do not output any validation loss values in model.eval() mode. They do so only in the model.train mode.

uysalfurkan commented 2 years ago

@sovit-123 Ok, I'm waiting, I would be very happy if you let me know when the modification is finished. Also, it would be great if train mAP values are in the file as well as validation.

sovit-123 commented 2 years ago

@uysalfurkan mAP is a validation metric already. In object detection, we calculate mAP on the validation dataset only which is the case with this code base as well. I hope this helps.

uysalfurkan commented 2 years ago

@sovit-123 Hi, I'm a little confused after the conversation. How can I understand whether there is overfitting or not from the your result csv ?

sovit-123 commented 2 years ago

@uysalfurkan In object detection, you can check that a model is overfitting if the mAP starts decreasing instead of increasing. mAP is always calculated on the validation dataset.

uysalfurkan commented 1 year ago

Hi @sovit-123

I got the log below when I run the train.py with create_fasterrcnn_model as model.

UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked)) Traceback (most recent call last): File "train.py", line 505, in main(args) File "train.py", line 258, in main build_model = create_model[args['model']] KeyError: 'create_fasterrcnn_model'

sovit-123 commented 1 year ago

Hello @uysalfurkan That above warning says that are there 2 cores in the CPU but you are trying to use 4 cores (4 workers). I think this is a common warning on Kaggle. For now, you may ignore it or pass --workers 2 to the training command if you want to avoid this warning.

Regarding the KeyError. You need to pass a valid model name key to the --model flag in the train.py command. Looks like you have passed create_fasterrcnn_model as the key which is not valid. By default the key is fasterrcnn_resnet50_fpn_v2. You may also pass a model name key like this: python train.py --model fasterrcnn_resnet50_fpn <rest of the command> You can find all the model name keys that you can pass here: https://github.com/sovit-123/fasterrcnn-pytorch-training-pipeline#A-List-of-All-Model-Flags-to-Use-With-the-Training-Script

uysalfurkan commented 1 year ago

@sovit-123 Hi, I want to get mAP_0.5:0.90 rather than mAP_0.5:0.95. How can I change the code ?

sovit-123 commented 1 year ago

@uysalfurkan Hello, that would require changing the Pycocotools code. But at the moment, I cannot say for sure at which place we need to change the code.

uysalfurkan commented 1 year ago

Hi again,

After starting to run train.py command, I got the error below,

OSError: /opt/conda/lib/python3.7/site-packages/nvidia/cublas/lib/libcublas.so.11: undefined symbol: cublasLtHSHMatmulAlgoInit, version libcublasLt.so.11

How can I fix this ?

sovit-123 commented 1 year ago

@uysalfurkan Looks like CUDA issue. Which GPU do you have, is it an RTX or GTX GPU?

uysalfurkan commented 1 year ago

I am using Kaggle GPUs.

NVIDIA TESLA P100 GPU
TESLA T4 x2

sovit-123 commented 1 year ago

@uysalfurkan Ok. I understand the issue now. Looks like pip install -r requirements.txt is installing torch 1.13.1 which has issues with CUDA on Kaggle. This is because of this line in the file torch>=1.12.0, !=1.13.0 For now, PyTorch 1.12.0 works best. I will update the requirements file with torch==1.12.0 by end of day. You may also manually install it in the Kaggle environment and it will work fine.

uysalfurkan commented 1 year ago

Hi @sovit-123 thanks for your fast replies.

I got the error below and could not figure it out. I set the epoch number as 100 but at epoch 55 the process becomes fails. RuntimeError: DataLoader worker (pid 18111) is killed by signal: Killed.

sovit-123 commented 1 year ago

@uysalfurkan Are you running on Kaggle?

uysalfurkan commented 1 year ago

@sovit-123 Yes I am running on Kaggle

sovit-123 commented 1 year ago

@uysalfurkan I was also facing the issue yesterday. But never saw it before. Still meed to debug. Can you try --workers 2 and train again?

sovit-123 / fasterrcnn-pytorch-training-pipeline

RuntimeError: CUDA out of memory. #25