ruocwang / darts-pt

[ICLR2021 Outstanding Paper] Rethinking Architecture Selection in Differentiable NAS
Apache License 2.0
102 stars 14 forks source link

About the generation of structures #4

Open Liu-pf opened 2 years ago

Liu-pf commented 2 years ago

Hi Ruochen, while I was learning your code, some errors came upcode, some errors came up before a new epoch started training. before a new epoch started training. image Please give me some tips about this, thank you~

thuangb commented 2 years ago

First check if the library match with the requirements.txt file then make sure you load the model in cuda not in cpu

ruocwang commented 2 years ago

Thanks thuangb for the comment. Yes, it looks like a mismatch of tensor locations. Some part of the model seems to be on CPU (either loaded parameter, original model parameter, or input), which causes the problem. I just downloaded and reran a fresh version but did not seem to encounter such an issue. I would suggest following thuangb's reply to make sure that all parts of the model (including inputs) are on cuda.

Liu-pf commented 2 years ago

首先检查库是否与requirements.txt文件匹配,然后确保在cuda中而不是在cpu中加载模型

Thanks for your suggestions, but the problem does not seem to be that simple. Of course, I have also noticed what you have put forward. I have tested the author's code on Cifar-10, everything is normal, when I changed a data set, minor errors appeared.

ruocwang commented 2 years ago

Can you provide detailed steps to help us reproduce the issue you encountered?

Liu-pf commented 2 years ago

您是否可以提供的步骤来帮助我们重现您详细的遇到的问题?

Sure,thank you,Wang.I am trying to test the performance of your proposed search algorithm on a specific task. This problem occurred when I loaded the data set I wanted to use (the model and other configurations were not modified). After the first epoch has finished running, before the second epoch starts training (), an error is reported. ` genotype = model.genotype()

    logging.info('param size = %f', num_params)
    logging.info('genotype = %s', genotype)
    model.printing(logging)
    train_acc, train_obj = train(train_queue, valid_queue, model, architect, model.optimizer, lr, epoch,
                                 perturb_alpha, epsilon_alpha)`
ruocwang commented 2 years ago

From your reply, it seems that you have changed the task. In this case, I might only be able to provide limited help since I could not run the adapted codebase. Not sure if you have already tried this, but from the error message you provided and the extra details on where the error occurs, it looks like the model weight was changed from GPU to CPU between epochs for some reason, whereas it should be on GPU. I would suggest doing a sanity check by tracing why the model weight's location changed between these epochs, and perhaps deleting any code that uses model weights in between two epochs (e.g. model saving).

Liu-pf commented 2 years ago

From your reply, it seems that you have changed the task. In this case, I might only be able to provide limited help since I could not run the adapted codebase. Not sure if you have already tried this, but from the error message you provided and the extra details on where the error occurs, it looks like the model weight was changed from GPU to CPU between epochs for some reason, whereas it should be on GPU. I would suggest doing a sanity check by tracing why the model weight's location changed between these epochs, and perhaps deleting any code that uses model weights in between two epochs (e.g. model saving).

As you said, this is indeed the problem. I just modified it and now the problem is solved. Thank you very much for your work on NAS and your timely help. However, there is one more problem. I don’t know if you noticed it, and I’m not sure if it’s my cause. When I set batchsize=32 and the size of the picture is 128*64, a runtimeerror occurres: RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 31.75 GiB total capacity; 25.10 GiB already allocated; 66.44 MiB free; 25.12 GiB reserved in total by PyTorch) I am not sure if it is caused by the improper way of data loading, I will find the cause of the error in the follow-up

ruocwang commented 2 years ago

You are welcome. The input dimension of your task is much larger than cifar10/cifar100 we used. So the memory consumption might exceed your GPU limit. It could help if try to optimize memory usage here and there for that.

thuangb commented 2 years ago

From your reply, it seems that you have changed the task. In this case, I might only be able to provide limited help since I could not run the adapted codebase. Not sure if you have already tried this, but from the error message you provided and the extra details on where the error occurs, it looks like the model weight was changed from GPU to CPU between epochs for some reason, whereas it should be on GPU. I would suggest doing a sanity check by tracing why the model weight's location changed between these epochs, and perhaps deleting any code that uses model weights in between two epochs (e.g. model saving).

As you said, this is indeed the problem. I just modified it and now the problem is solved. Thank you very much for your work on NAS and your timely help. However, there is one more problem. I don’t know if you noticed it, and I’m not sure if it’s my cause. When I set batchsize=32 and the size of the picture is 128*64, a runtimeerror occurres: RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 31.75 GiB total capacity; 25.10 GiB already allocated; 66.44 MiB free; 25.12 GiB reserved in total by PyTorch) I am not sure if it is caused by the improper way of data loading, I will find the cause of the error in the follow-up

This is out-of-memory problem, you can try smaller batch size, more efficient data loading method or use better GPU

Liu-pf commented 2 years ago

From your reply, it seems that you have changed the task. In this case, I might only be able to provide limited help since I could not run the adapted codebase. Not sure if you have already tried this, but from the error message you provided and the extra details on where the error occurs, it looks like the model weight was changed from GPU to CPU between epochs for some reason, whereas it should be on GPU. I would suggest doing a sanity check by tracing why the model weight's location changed between these epochs, and perhaps deleting any code that uses model weights in between two epochs (e.g. model saving).

As you said, this is indeed the problem. I just modified it and now the problem is solved. Thank you very much for your work on NAS and your timely help. However, there is one more problem. I don’t know if you noticed it, and I’m not sure if it’s my cause. When I set batchsize=32 and the size of the picture is 128*64, a runtimeerror occurres: RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 31.75 GiB total capacity; 25.10 GiB already allocated; 66.44 MiB free; 25.12 GiB reserved in total by PyTorch) I am not sure if it is caused by the improper way of data loading, I will find the cause of the error in the follow-up

This is out-of-memory problem, you can try smaller batch size, more efficient data loading method or use better GPU

You are right, I am trying to use multithreading to solve the problem :)