Closed darkzyf closed 4 years ago
Thanks for raising this issue.
First of all, the original paper was implemented with tensorflow, which is faster than PyTorch in its implementation. Secondly, if you are using our resnet configuration, you will figure that we use a larger image and a larger backbone than the origial paper.
However, I still believe there are something wrong with 960 hours per epoch. Our experiments run on 8 GTX 1080 Ti, ~100 epochs take about 3-4 days.
I see. Thank you very much. I'll check the code to see where wrong.
---Original--- From: "Yuge Zhang"<notifications@github.com> Date: Sat, Nov 9, 2019 16:02 PM To: "noahcao/Pixel2Mesh"<Pixel2Mesh@noreply.github.com>; Cc: "Author"<author@noreply.github.com>;"darkzyf"<309835934@qq.com>; Subject: Re: [noahcao/Pixel2Mesh] About the training time (#9)
Thanks for raising this issue.
First of all, the original paper was implemented with tensorflow, which is faster than PyTorch in its implementation. Secondly, if you are using our resnet configuration, you will figure that we use a larger image and a larger backbone than the origial paper.
However, I still believe there are something wrong with 960 hours per epoch. Our experiments run on 8 GTX 1080 Ti, ~100 epochs take about 3-4 days.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Thanks for raising this issue. First of all, the original paper was implemented with tensorflow, which is faster than PyTorch in its implementation. Secondly, if you are using our resnet configuration, you will figure that we use a larger image and a larger backbone than the origial paper. However, I still believe there are something wrong with 960 hours per epoch. Our experiments run on 8 GTX 1080 Ti, ~100 epochs take about 3-4 days.
Hello , When I train resnet on the 8 GTX 1080Ti, the program is stuck in the checkpoint not loaded, I set the batchsize: 8, num_workers: 8. I checked the gpu usage, the first gpu usage: 6071MiB/11178MiB , The remaining 7 pieces of gpu use: 771MiB/11178MiB, the gpu util is all 0. The CPU has only one thread at work.
Hi. Can you locate on which line is the program stuck? Try to set batchsize
to 1 and num_workers
to 0 to debug.
OK, I‘ll try. Thank you very much!
Hi. Can you locate on which line is the program stuck? Try to set batchsize to 1 and num_workers to 0 to debug.
In line 96 of trainer.py: out = self.model(images) is stuck, and maybe is this line:self.model = torch.nn.DataParallel(self.model, device_ids=self.gpus).cuda() happended sth wrong
This is likely your driver/cuda/pytorch issue. Maybe this will help. Try to write a minimum example of DataParallel
to see whether it works.
I tested a pytorch official example, it can work well. And I tried the solution you provide, but it didn't work for me, I got a littile up upset...
According to your statement, you believe there is something wrong with DataParallel
. Did you test without DataParallel
and it works fine?
Try the following thing:
, device_ids=self.gpus
..cuda()
.DataParallel
, only do .cuda()
.Hack into the model code, and try the following:
If all the above doesn't work, I don't think it's the problem with DataParallel
, you can dig deeper into forward
method in model to see on which line it gets stuck without DataParallel
.
And keep calm. The best way to solve a problem is to rule out issues one by one, with patience.
Thank you so much! I'll try.
---Original--- From: "Yuge Zhang"<notifications@github.com> Date: Wed, Nov 13, 2019 18:39 PM To: "noahcao/Pixel2Mesh"<Pixel2Mesh@noreply.github.com>; Cc: "State change"<state_change@noreply.github.com>;"darkzyf"<309835934@qq.com>; Subject: Re: [noahcao/Pixel2Mesh] About the training time (#9)
According to your statement, you believe there is something wrong with DataParallel. Did you test without DataParallel and it works fine?
Try the following thing:
Remove , device_ids=self.gpus.
Remove .cuda().
Remove DataParallel, only do .cuda().
Hack into the model code, and try the following:
Return input immediately without calculating.
Remove some submodules to see whether it works without.
If all the above doesn't work, I don't think it's the problem with DataParallel, you can dig deeper into forward method in model to see on which line it gets stuck without DataParallel.
And keep calm. The best way to solve a problem is to rule out issues one by one, with patience.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe.
According to your statement, you believe there is something wrong with
DataParallel
. Did you test withoutDataParallel
and it works fine?Try the following thing:
- Remove
, device_ids=self.gpus
.- Remove
.cuda()
.- Remove
DataParallel
, only do.cuda()
.Hack into the model code, and try the following:
- Return input immediately without calculating.
- Remove some submodules to see whether it works without.
If all the above doesn't work, I don't think it's the problem with
DataParallel
, you can dig deeper intoforward
method in model to see on which line it gets stuck withoutDataParallel
.And keep calm. The best way to solve a problem is to rule out issues one by one, with patience.
Thank you very much! I found that when I turn off the ACS options ,then it can work well. By the way, I get the solution in this. Thank you agagin!
Hi, Do you know how to download the dataset and unzip it from the link below? https://drive.google.com/open?id=131dH36qXCabym1JjSmEpSQZg4dmZVQid
Hi, Do you know how to use their provided YAML files and pretrained weights downloaded from here to do the evaluation? To my understanding, I should use the downloaded meta folder. Then use the provide pretrained weights. The command I use is
python entrypoint_eval.py --name xxx --options experiments/default/tensorflow.yml --checkpoint datasets/data/pretrained/vgg16-p2m.pth
But it seems that it only generated some meaningless meshes. And the score is really low. I am not sure what I did wrong.
Excecuse me. When I was training the data, I found that I need spend 960 hours to finish one epoch, My Gpu is GTX 1080 Ti * 1, but I found that in the paper the training time is 72 hours for 50 epoch. Did I miss something?