How to continue train? - Githubissues

ygjwd12345 commented 3 years ago

when I use script llike

CUDA_VISIBLE_DEVICES=0 python3 -u trainUDA_gta.py --config ./configs/configUDA_gta2city.json --name UDA-gta --resume /saved/DeepLabv2-depth-gtamono-cityscapestereo/05-03_02-13-UDA-gta/checkpoint-iter95000.pth | tee ./gta-corda.log

It would run again but the new checkpoint would be saved.

qinenergy commented 3 years ago

Hi. The training skeleton is directly from DACS, we didn't test the resume function. We trained the model uninterrupted for 250000 iterations.
For your specific use case, maybe this can help:

change "--resume /saved/DeepLabv2-depth-gtamono-cityscapestereo/05-03_02-13-UDA-gta/checkpoint-iter95000.pth" to "--resume ../saved/DeepLabv2-depth-gtamono-cityscapestereo/05-03_02-13-UDA-gta/checkpoint-iter95000.pth" as the default save folder is one level up. The new checkpoints should show up in ../saved/DeepLabv2-depth-gtamono-cityscapestereo/05-03_02-13-UDA-gta-resume/

We didn't test this and maybe it is easier to train from scratch for 250000 to reproduce the results. Please let me know if you have further questions.

ygjwd12345 commented 3 years ago

I find the error causing by if args.resume: checkpoint_dir = os.path.join(*args.resume.split('/')[:-1]) + '_resume-' + start_writeable else: checkpoint_dir = os.path.join(config['utils']['checkpoint_dir'], start_writeable + '-' + args.name) I remove ` if args.resume: checkpoint_dir = os.path.join(*args.resume.split('/')[:-1]) + '_resume-' + start_writeable else: The problem is solved.

qinenergy / corda

How to continue train? #3