Closed qijindao closed 3 years ago
Hi,
You can use pods_train --num-gpus 8
instead of directly running with train_net.py
.
BTW, could you provide more details about how you install YOLOF
and how you train with YOLOF
?
Thank you for your reply! I appreciate it.I find the pods_train,but it is not .py file,so i don't know how to use it. My environment is torch1.6 python3.8.When I try to run with train_net.py,I consistently install many modules according to error prompt.I also met the problem about cvpods,I just used 'python setup.py develop' according to the instruction.
sorry,I haven't expressed my meaning clearly.I want to say' you means i needn't care about train_net.py although errors exists. What i need to do is use the instuction 'pods_train -- num-gpus 1''
pods_train
is a shell script, you can use it directly with pods_train --num-gpus 8
in the directory (e.g., YOLOF/playground/detection/coco/yolof/yolof.res50.C5.1x
).
BTW, you can find the pods_train
file in YOLOF/tools/
.
@qijindao Hi, you can try:
cd YOLOF/playground/detection/coco/yolof/yolof.res50.C5.1x
python YOLOF/tools/train_net.py -- num-gpus 8
or
cd YOLOF/playground/detection/coco/yolof/yolof.res50.C5.1x
pods_train -- num-gpus 8
pods_train
is a shell script, you can use it directly withpods_train --num-gpus 8
in the directory (e.g.,YOLOF/playground/detection/coco/yolof/yolof.res50.C5.1x
).BTW, you can find the
pods_train
file inYOLOF/tools/
.
Thank you for your reply.
@qijindao Hi, you can try:
cd YOLOF/playground/detection/coco/yolof/yolof.res50.C5.1x python YOLOF/tools/train_net.py -- num-gpus 8
or
cd YOLOF/playground/detection/coco/yolof/yolof.res50.C5.1x pods_train -- num-gpus 8
Thank you for your reply.Have you trained the code successfully?I may have some questions
我爆显存了,根据以往经验都是更改batchsize的大小,但是在这个文件夹里一直没有找到有关batchsize的代码,不知道是不是我漏读了
Could you provide more details about how you train with YOLOF?
根据目录,我将coco2017的数据集放在datasets文件夹里。根据 cd YOLOF/playground/detection/coco/yolof/yolof.res50.C5.1x python YOLOF/tools/train_net.py -- num-gpus 1 指令运行来执行训练
YOLOF_res50_C5
needs 5.2~5.3G to train. If your GPU's memory is less than that, you should reduce the IMS_PER_DIVECE
in the config.py
file.
好的,非常感谢你。因为我的电脑只有一个gpu。当我把config里面的devices改为1的时候,程序可以跑了。但是跑了一会时间,就出现了新的错误AssertionError: Box regression deltas become infinite or NaN!
@qijindao Can you provide you log file? It is at YOLOF/playground/detection/coco/yolof/yolof.res50.C5.1x/log/log.txt
.
BTW, I think it is because you modify the batch size but dose not modify the learning rate or the warmup iterations.
@qijindao Can you provide you log file? It is at
YOLOF/playground/detection/coco/yolof/yolof.res50.C5.1x/log/log.txt
. BTW, I think it is because you modify the batch size but dose not modify the learning rate or the warmup iterations.
@qijindao Can you provide you log file? It is at
YOLOF/playground/detection/coco/yolof/yolof.res50.C5.1x/log/log.txt
. BTW, I think it is because you modify the batch size but dose not modify the learning rate or the warmup iterations.
Hi, the cvpods can automatically adjust the learning rate and iterations if you use a different number of gpus. However, the default setting is 8 images per GPU, if you use 1 image per GPU, you need to decrease the base learning rate by a factor of 8 and increase the iteration (as well as the warmup iteration) by a factor of 8. And you should also replace the Batchnorm with Groupnrom.
Ok,thank you for your detailed reply.I can roughly understand your instruction.I am still uncertain of some code.First,in the runnning instruction'pods_train -- num-gpus 8' ,is '8' of 'gpus 8' the id of gpu in a computer? Or, is '8' of 'gpus 8' the quantity of gpu in a computer.Second, IMS_PER_DIVECE=8 means 8 images per GPU? Three,Do the values of IMS_PER_BATCH and IMS_PER_DIVECE have to be proportional? After many experiments of mine, I feel as if the ratio is equal to 8 to get through.Idon't know why.
@qijindao
@qijindao
- In 'pods_train -- num-gpus 8', "8" means that it uses a total of 8 GPUs.
- Yes, IMS_PER_DIVECE=8 means 8 images per GPU
- IMS_PER_BATCH = IMS_PER_DIVECE * num-gpus
Thank you very much! I get it!
Hi! I have questions to disturb you. When trying to run train_net.py, I have no way to solve 'from config import config'.when the error exists'no mudule named ''config',I try to 'pip install config'.But there are still errors.I have searched for some way,but no way works.Can you help me ?