Closed Ema1997 closed 4 years ago
Hi,
Thanks for your kind question!
Could you share the environment variables with us? We haven't encounter such problems using this code base.
Really appreciate that you could reply.
Best, Hao.
Package Version future 0.18.2 numpy 1.17.0 opencv-python 4.0.1.24 Pillow 6.1.0 ptflops 0.6.2 tensorboard 2.3.0 tensorboard-plugin-wit 1.7.0 tensorboardX 1.2 thop 0.0.31.post2005241907 timm 0.1.20 torch 1.2.0 torchvision 0.2.1 yacs 0.1.8
thank you very much
same problem.
Hi,
We have carefully checked the source codes and environments, this bug is from torch.dist.distributed. We thought apex was not required before. However, due to the implemention in torch DDP, we could not train the supernet in SPOS mechaism.
Thus, to solve this bug, it's necessary to run over apex package. You should install apex before supernet training. We would fix installation steps in README.md.
Thanks. Hao.
@Ema1997 @macn3388 Would you check whether the issue has been solved? Thanks.
The same error occurs, when using apex.
Adding "for name, param in model.named_parameters(recurse=True): param.grad = None" at the beginning of update_student_weights_only solves my problem. It is caused by optimizer.step(), which changes the parameters of meta network.
Adding "for name, param in model.named_parameters(recurse=True): param.grad = None" at the beginning of update_student_weights_only solves my problem. It is caused by optimizer.step(), which changes the parameters of meta network.
In our experience, if the installation strictly follows the README, this issue should not occur.
Adding "for name, param in model.named_parameters(recurse=True): param.grad = None" at the beginning of update_student_weights_only solves my problem. It is caused by optimizer.step(), which changes the parameters of meta network.
HI,
Could you share your environment variables with us?
We have tested the codes. When using apex(installed following REAME), it should not occur.
Best, Hao.
Adding "for name, param in model.named_parameters(recurse=True): param.grad = None" at the beginning of update_student_weights_only solves my problem. It is caused by optimizer.step(), which changes the parameters of meta network.
hello i want to ask where you add the code? i ocuur the same problem after i have installed apex using pip.
Adding "for name, param in model.named_parameters(recurse=True): param.grad = None" at the beginning of update_student_weights_only solves my problem. It is caused by optimizer.step(), which changes the parameters of meta network.
hello i want to ask where you add the code? i ocuur the same problem after i have installed apex using pip.
Hi,
You should install apex with cpp extension and cuda extension as indicated in this URL
python ./apex/setup.py install --cpp_ext --cuda_ext
Or you could add the above codes as SPOS did: Set the grad to None in each training iteration.
Best, Hao.
I encountered with a runtime error when I tried to search for an architecture based on your code.
I tried to locate the source of the error, and I find that whenever the code update the meta network or add the kd_loss to the final loss the error above appears. How can I fix this problem?