GPU for training of baseline model

icarl-ad commented 1 month ago

Hi there,

I set up laypa and started a training for baseline recognition to check if it´s working properly.

I was suprised to see that the GPU seems to be only used for validation and not for the training itself. Forthermore I was confused about the duration of the training. I used approx. 1600 documents for training and the whole process was done after 10 min.

Obviously training duration depends on the hardware etc. but it is possible that I need to change some configurations? And if so, where do I do that? I use the main.py script for training, the training and validation data as well as a config file. I used a very simple configuration with only the test and training weights defined and an output directory.

Thanks in advance!

stefanklut commented 1 month ago

My initial thought is that training doesn't start. Like your observation 10 minutes is too short

What I think is going wrong: If you set the training weights the model will resume training from that step. So if you set the training weights then the model will start at the iteration at which the training was finished. As a test could you run a different config without setting the training weights. I recommend trying the baseline_general.yaml config. Also just to see what configs can be changed.

If you do want to continue training from said weights at that step you may want to look at the SOLVER.MAX_ITER in the config and raise them. Or if you want to start from step 0, but do use the weights given set the MODEL.WEIGHTS and not the TRAIN.WEIGHTS

If this turns out not to be the issue let me know

icarl-ad commented 1 month ago

Hi Stefan,

thank you for your help! This has seemed to help. Although not I have some other questions that you hopefully can help me answer:

How do I change the configuration to start training with a base model? I want to improve the baseline2 model that is delivered with loghi
The current training is taking way to lang. I started the training with the standard configurations, I only changed the model name and output folder. However I saw that eta is in 1083 days. What can I do to lower the training duration? I´m thinking lowering the number of iterations or maybe something else. However I´m not sure what parameter I should change

I hope you can help me, thanks in advance!

stefanklut commented 1 month ago

Or if you want to start from step 0, but do use the weights given set the MODEL.WEIGHTS and not the TRAIN.WEIGHTS

This is where you set the path of the baseline2 weights

However I saw that eta is in 1083 days

Again you are right that is way too long. Can you tell me which hardware you are using and when training starts do you see your GPU being used? Changing parameters will obviously help, but not such a big difference. Also are you training through the docker or in a conda environment? It still seems like you are not using a GPU, or you have some other bottleneck. With a single GPU I get an ETA of 2 weeks, but that is for the full training. You should just be finetuning.

SOLVER.MAX_ITER

This controls the number of iterations. If you are finetuning the model, set this to be a lot lower than it currently is.

Also look at: DATALOADER.NUM_WORKERS SOLVER.IMS_PER_BATCH

Also check if your GPU is capable of AMP, and change this if not: MODEL.AMP_TRAIN.ENABLED

Keep me updated

icarl-ad commented 1 month ago

I´m using a NVIDIA RTX 3500 Ada, 8 GB, GDDR6. My CPU is a intel i7-13850HX with 20 Cores and 28 logical processors. And I have 64 GB of RAM.

For NUM_WORKERS I chose 28 and IMS_PER_BATCH 32.

I set up a WSL environment for laypa on my windows pc. I´m Training though a conda environment.

I noticed that the GPU is only being used during the actual training, the GPU is not used during the preprocessing of the images. During the training the GPU is on 100 % utilization

stefanklut commented 3 weeks ago

The code for the preprocessing is not made to run on the GPU. So it should not use it there. The fact that the GPU is at 100 percent during training is good to hear. I would suggest first trying to run it with a smaller batch size and number of workers. Let say 4 images and 4 workers, just to see if it will run in a more logical time. A batch of 32 is more suited to the server, not a laptop. And as you are finetuning, it shouldn't need as much computing anyways. The eta you got was after training had started, right? Not with preprocessing as well?

icarl-ad commented 3 weeks ago

I think I got the eta after the preprocessing, although I´m not totally sure.

I was able to run a training with the same configurations (except for max_iterations I chose 100) and a base model in about 4 hours. I tested the model and was able to see an improvement.

stefanklut / laypa

GPU for training of baseline model #44