Training too slow - Githubissues

fattynoparents commented 9 months ago

When trying to train a laypa model I'm not going anywhere because the training is very slow, here's part of the output:

[02/21 12:42:34 detectron2.engine.train_loop]: Starting training from iteration 0
[02/21 12:52:57 detectron2.utils.events]:  eta: 87 days, 12:43:12  iter: 19  total_loss: 3.551    time: 31.8952  last_time: 27.2403  data_time: 0.8271  last_data_time: 1.4133   lr: 3.9962e-06  max_mem: 8727M
[02/21 13:03:34 detectron2.utils.events]:  eta: 87 days, 12:33:07  iter: 39  total_loss: 1.935    time: 31.8659  last_time: 25.1255  data_time: 0.6448  last_data_time: 1.4127   lr: 7.9922e-06  max_mem: 8727M
^C[02/21 13:04:34 detectron2.engine.hooks]: Overall training speed: 39 iterations in 0:21:10 (32.5881 s / it)
[02/21 13:04:34 detectron2.engine.hooks]: Total training time: 0:21:11 (0:00:00 on hooks)

Is there any possibility to speed up the process or is it because my machine is too weak in characteristics? Here's the environment info:

[02/21 12:09:07 laypa]: Environment info:
-------------------------------  -----------------------------------------------------------------------------
sys.platform                     linux
Python                           3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 08:03:24) [GCC 12.3.0]
numpy                            1.26.3
detectron2                       0.6 @/opt/conda/envs/laypa/lib/python3.12/site-packages/detectron2
Compiler                         GCC 9.4
CUDA compiler                    not available
DETECTRON2_ENV_MODULE            <not set>
PyTorch                          2.2.0 @/opt/conda/envs/laypa/lib/python3.12/site-packages/torch
PyTorch debug build              False
torch._C._GLIBCXX_USE_CXX11_ABI  False
GPU available                    Yes
GPU 0                            NVIDIA GeForce GTX 1650 Ti (arch=7.5)
Driver version                   551.23
CUDA_HOME                        /opt/conda/envs/laypa
Pillow                           10.2.0
torchvision                      0.17.0 @/opt/conda/envs/laypa/lib/python3.12/site-packages/torchvision
torchvision arch flags           5.0, 6.0, 7.0, 7.5, 8.0, 8.6, 9.0
fvcore                           0.1.5.post20221221
iopath                           0.1.9
cv2                              4.9.0
-------------------------------  -----------------------------------------------------------------------------

Thanks in advance for any suggestions.

stefanklut commented 9 months ago

Hi there,

Could you please specify what config you are currently using? And are you trying to train a model from scratch or do you want to finetune an existing model?

fattynoparents commented 9 months ago

Thanks for quick reply. I'm trying to finetune an existing model suggested for use by the loghi project, here's the link to the config https://surfdrive.surf.nl/files/index.php/s/YA8HJuukIUKznSP?path=%2Flaypa%2Fgeneral%2Fbaseline#editor

The only things I changed was setting the DATALOADER.NUM_WORKERS to 1 and SOLVER.IMS_PER_BATCH to 4

Here's the code I run:

docker run $DOCKERGPUPARAMS --rm -it -u $(id -u ${USER}):$(id -g ${USER}) -m 32000m --shm-size 10240m -v $LAYPADIR:$LAYPADIR -v $TRAINDIR:$TRAINDIR $DOCKERLAYPA \
        python main.py \
        -c $LAYPAMODEL \
        -t $TRAINDIR \
        -v $TRAINDIR \
        --opts SOLVER.IMS_PER_BATCH 4

stefanklut commented 9 months ago

I think the main issue is that you are retraining the model from scratch since you aren't loading the weights back in for the training. This is done using the MODEL.WEIGHTS config option. Then you can lower the learning rate (SOLVER.BASE_LR) and the number of iterations that you use (SOLVER.MAX_ITER). The config you are currently using is the one that was used for the full training which took a couple of day on a more powerful machine.

Also make sure the $DOCKERGPUPARAMS is set correctly so you are actually using the GPU. And I don't recommend having only one worker for the dataloader, this will stop parallel processing of the images.

fattynoparents commented 9 months ago

I see, thanks a lot for your explanations and suggestions! I will try to implement them and see if it helps.

fattynoparents commented 9 months ago

I tried to load the weights, so now in my MODEL.WEIGHTS I have/home/user/laypa/general/baseline/model_best_mIoU.pth instead of detectron2://ImageNetPretrained/MSRA/R-50.pkl but now I get the following error -

    assert os.path.isfile(path), "Checkpoint {} not found!".format(path)
AssertionError: Checkpoint /home/user/laypa/general/baseline/model_best_mIoU.pth not found!

The file is there, I double-checked. I am totally new to this subject, so I might really be doing something stupid in trying to train the model.

stefanklut commented 9 months ago

Because you are trying to run the training in a docker environment you need to also link the folder correctly. This may be your problem, but I don't know for sure. Is -v $LAYPADIR:$LAYPADIR this also where the weights are? Otherwise you may want to like -v <Location of weights>:<Location of weights>. And is your user just called user? Or is this not the actual path? If this doesn't work perhaps you could try to run the training outside of the docker environment?

fattynoparents commented 9 months ago

Jeez that was a really stupid error from my side that I forgot to mount the directory, thanks so much for your help.

fattynoparents commented 9 months ago

So I diminished the SOLVER.BASE_LR and SOLVER.MAX_ITER values, and managed to run a training round in about 27 minutes. I don't know yet how the results will look, but since I have managed to run a training round successfully I will close the issue as completed. Thanks again for your help.

stefanklut commented 9 months ago

Glad to hear you got it working. 27 minutes seems rather quick, but I think it's best if you try to play around with it a bit. You can ofcourse train longer if results seem promising but could be better

stefanklut / laypa

Training too slow #28