noahcao / Pixel2Mesh

A complete Pixel2Mesh implementation in PyTorch
230 stars 39 forks source link

About the training time #9

Closed darkzyf closed 4 years ago

darkzyf commented 5 years ago

Excecuse me. When I was training the data, I found that I need spend 960 hours to finish one epoch, My Gpu is GTX 1080 Ti * 1, but I found that in the paper the training time is 72 hours for 50 epoch. Did I miss something?

ultmaster commented 5 years ago

Thanks for raising this issue.

First of all, the original paper was implemented with tensorflow, which is faster than PyTorch in its implementation. Secondly, if you are using our resnet configuration, you will figure that we use a larger image and a larger backbone than the origial paper.

However, I still believe there are something wrong with 960 hours per epoch. Our experiments run on 8 GTX 1080 Ti, ~100 epochs take about 3-4 days.

darkzyf commented 5 years ago

I see. Thank you very much. I'll check the code to see where wrong.

---Original--- From: "Yuge Zhang"<notifications@github.com> Date: Sat, Nov 9, 2019 16:02 PM To: "noahcao/Pixel2Mesh"<Pixel2Mesh@noreply.github.com>; Cc: "Author"<author@noreply.github.com>;"darkzyf"<309835934@qq.com>; Subject: Re: [noahcao/Pixel2Mesh] About the training time (#9)

Thanks for raising this issue.

First of all, the original paper was implemented with tensorflow, which is faster than PyTorch in its implementation. Secondly, if you are using our resnet configuration, you will figure that we use a larger image and a larger backbone than the origial paper.

However, I still believe there are something wrong with 960 hours per epoch. Our experiments run on 8 GTX 1080 Ti, ~100 epochs take about 3-4 days.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

darkzyf commented 5 years ago

Thanks for raising this issue. First of all, the original paper was implemented with tensorflow, which is faster than PyTorch in its implementation. Secondly, if you are using our resnet configuration, you will figure that we use a larger image and a larger backbone than the origial paper. However, I still believe there are something wrong with 960 hours per epoch. Our experiments run on 8 GTX 1080 Ti, ~100 epochs take about 3-4 days.

Hello , When I train resnet on the 8 GTX 1080Ti, the program is stuck in the checkpoint not loaded, I set the batchsize: 8, num_workers: 8. I checked the gpu usage, the first gpu usage: 6071MiB/11178MiB , The remaining 7 pieces of gpu use: 771MiB/11178MiB, the gpu util is all 0. The CPU has only one thread at work.

ultmaster commented 5 years ago

Hi. Can you locate on which line is the program stuck? Try to set batchsize to 1 and num_workers to 0 to debug.

darkzyf commented 5 years ago

OK, I‘ll try. Thank you very much!

darkzyf commented 5 years ago

Hi. Can you locate on which line is the program stuck? Try to set batchsize to 1 and num_workers to 0 to debug.

In line 96 of trainer.py: out = self.model(images) is stuck, and maybe is this line:self.model = torch.nn.DataParallel(self.model, device_ids=self.gpus).cuda() happended sth wrong

ultmaster commented 5 years ago

This is likely your driver/cuda/pytorch issue. Maybe this will help. Try to write a minimum example of DataParallel to see whether it works.

darkzyf commented 5 years ago

I tested a pytorch official example, it can work well. And I tried the solution you provide, but it didn't work for me, I got a littile up upset...

ultmaster commented 5 years ago

According to your statement, you believe there is something wrong with DataParallel. Did you test without DataParallel and it works fine?

Try the following thing:

Hack into the model code, and try the following:

If all the above doesn't work, I don't think it's the problem with DataParallel, you can dig deeper into forward method in model to see on which line it gets stuck without DataParallel.

And keep calm. The best way to solve a problem is to rule out issues one by one, with patience.

darkzyf commented 5 years ago

Thank you so much! I'll try.

---Original--- From: "Yuge Zhang"<notifications@github.com> Date: Wed, Nov 13, 2019 18:39 PM To: "noahcao/Pixel2Mesh"<Pixel2Mesh@noreply.github.com>; Cc: "State change"<state_change@noreply.github.com>;"darkzyf"<309835934@qq.com>; Subject: Re: [noahcao/Pixel2Mesh] About the training time (#9)

According to your statement, you believe there is something wrong with DataParallel. Did you test without DataParallel and it works fine?

Try the following thing:

Remove , device_ids=self.gpus.

Remove .cuda().

Remove DataParallel, only do .cuda().

Hack into the model code, and try the following:

Return input immediately without calculating.

Remove some submodules to see whether it works without.

If all the above doesn't work, I don't think it's the problem with DataParallel, you can dig deeper into forward method in model to see on which line it gets stuck without DataParallel.

And keep calm. The best way to solve a problem is to rule out issues one by one, with patience.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe.

darkzyf commented 5 years ago

According to your statement, you believe there is something wrong with DataParallel. Did you test without DataParallel and it works fine?

Try the following thing:

  • Remove , device_ids=self.gpus.
  • Remove .cuda().
  • Remove DataParallel, only do .cuda().

Hack into the model code, and try the following:

  • Return input immediately without calculating.
  • Remove some submodules to see whether it works without.

If all the above doesn't work, I don't think it's the problem with DataParallel, you can dig deeper into forward method in model to see on which line it gets stuck without DataParallel.

And keep calm. The best way to solve a problem is to rule out issues one by one, with patience.

Thank you very much! I found that when I turn off the ACS options ,then it can work well. By the way, I get the solution in this. Thank you agagin!

zshyang commented 4 years ago

Hi, Do you know how to download the dataset and unzip it from the link below? https://drive.google.com/open?id=131dH36qXCabym1JjSmEpSQZg4dmZVQid

zshyang commented 4 years ago

Hi, Do you know how to use their provided YAML files and pretrained weights downloaded from here to do the evaluation? To my understanding, I should use the downloaded meta folder. Then use the provide pretrained weights. The command I use is

python entrypoint_eval.py --name xxx --options experiments/default/tensorflow.yml --checkpoint datasets/data/pretrained/vgg16-p2m.pth 

But it seems that it only generated some meaningless meshes. And the score is really low. I am not sure what I did wrong. Screenshot from 2020-09-22 02-08-41