Closed ByungKwanLee closed 2 years ago
I changed ffcv the latest version and added argument of gpu_id in torch.load(checkpoint_path, map_location=gpu_id) function.
Then, it is addressed!
Hi @ByungKwanLee, I am facing the same issue of unbalanced memory allocation. Which torch.load
are you referring you here? I can't find this anywhere inside the authors implementation.
First, download ffcv 1.0.0 or 0.4.0 version and copy and paste all files in ffcv folder of the downloaded to the path: /home/$username/anaconda3/envs/$env_name(ex:ffcv)/lib/$python_version/site-packages/ffcv
Second, in my code, I need to load pre-trained weight, thus I use torch.load. When I use it, if I do not point out what gpu is used for the checkpoint parameter by torch.load(checkpoint_path), then the checkpoint parameters are embedded into other gpu id. But, once I use torch.load(checkpoint_path, map_location=gpu_id), then it is solved.
Thanks for the prompt response. I'll try this, although I don't have a pretrained model. I'll just update to v1.0.0 and see if it works.
I have a problem with unexpected GPU memory allocation issue.
If I run a code for training CIFAR-10 based on FFCV with 5 number of GPU (id: 0,1,2,3,4), four unexpected GPU memories are allocated on id number 0.
In addition, if I run the code with 4 number of GPU (id: 0,1,2,3), three unexpected GPU memories are allocated on id number 0.
After I debugged it to investigate what the problem is, I found that the unexpected GPU allocation happens in the following line.
Therefore, I guess that the problem will be attributed to the data loader code due to my modification as below.
However, I cannot find out..
I wish it is a trivial problem.