Closed fattynoparents closed 9 months ago
Hi there,
Could you please specify what config you are currently using? And are you trying to train a model from scratch or do you want to finetune an existing model?
Thanks for quick reply. I'm trying to finetune an existing model suggested for use by the loghi project, here's the link to the config https://surfdrive.surf.nl/files/index.php/s/YA8HJuukIUKznSP?path=%2Flaypa%2Fgeneral%2Fbaseline#editor
The only things I changed was setting the DATALOADER.NUM_WORKERS to 1 and SOLVER.IMS_PER_BATCH to 4
Here's the code I run:
docker run $DOCKERGPUPARAMS --rm -it -u $(id -u ${USER}):$(id -g ${USER}) -m 32000m --shm-size 10240m -v $LAYPADIR:$LAYPADIR -v $TRAINDIR:$TRAINDIR $DOCKERLAYPA \
python main.py \
-c $LAYPAMODEL \
-t $TRAINDIR \
-v $TRAINDIR \
--opts SOLVER.IMS_PER_BATCH 4
I think the main issue is that you are retraining the model from scratch since you aren't loading the weights back in for the training. This is done using the MODEL.WEIGHTS
config option. Then you can lower the learning rate (SOLVER.BASE_LR
) and the number of iterations that you use (SOLVER.MAX_ITER
). The config you are currently using is the one that was used for the full training which took a couple of day on a more powerful machine.
Also make sure the $DOCKERGPUPARAMS is set correctly so you are actually using the GPU. And I don't recommend having only one worker for the dataloader, this will stop parallel processing of the images.
I see, thanks a lot for your explanations and suggestions! I will try to implement them and see if it helps.
I tried to load the weights, so now in my MODEL.WEIGHTS
I have/home/user/laypa/general/baseline/model_best_mIoU.pth
instead of detectron2://ImageNetPretrained/MSRA/R-50.pkl
but now I get the following error -
assert os.path.isfile(path), "Checkpoint {} not found!".format(path)
AssertionError: Checkpoint /home/user/laypa/general/baseline/model_best_mIoU.pth not found!
The file is there, I double-checked. I am totally new to this subject, so I might really be doing something stupid in trying to train the model.
Because you are trying to run the training in a docker environment you need to also link the folder correctly. This may be your problem, but I don't know for sure. Is -v $LAYPADIR:$LAYPADIR
this also where the weights are? Otherwise you may want to like -v <Location of weights>:<Location of weights>
. And is your user just called user
? Or is this not the actual path? If this doesn't work perhaps you could try to run the training outside of the docker environment?
Jeez that was a really stupid error from my side that I forgot to mount the directory, thanks so much for your help.
So I diminished the SOLVER.BASE_LR
and SOLVER.MAX_ITER
values, and managed to run a training round in about 27 minutes. I don't know yet how the results will look, but since I have managed to run a training round successfully I will close the issue as completed. Thanks again for your help.
Glad to hear you got it working. 27 minutes seems rather quick, but I think it's best if you try to play around with it a bit. You can ofcourse train longer if results seem promising but could be better
When trying to train a laypa model I'm not going anywhere because the training is very slow, here's part of the output:
Is there any possibility to speed up the process or is it because my machine is too weak in characteristics? Here's the environment info:
Thanks in advance for any suggestions.