Closed yijingshihenxiule closed 1 year ago
I am not sure why this is the case. If many people face this issue, I can try to look into it. Try to see what happens when using just 1 gpu (CUDA_VISIBLE_DEVICES=1 train.py ...
)
Thank you for your answer. Maybe I find thereson. In https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#save-and-load-checkpoints, it says
If map_location is missing, torch.load will first load the module to CPU and then copy each parameter to where it was saved, which would result in all processes on the same machine using the same set of devices.
But I tried to load checkpoint in cpu like vits1 and load checkpoint in GPU, both did not work for me. Need Help!
Thank you for your answer. Maybe I find thereson. In https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#save-and-load-checkpoints, it says
If map_location is missing, torch.load will first load the module to CPU and then copy each parameter to where it was saved, which would result in all processes on the same machine using the same set of devices.
But I tried to load checkpoint in cpu like vits1 and load checkpoint in GPU, both did not work for me. Need Help!
Maybe the case has nothing to do with checkpoint load. I am confused about it.
I'll keep this issue open. We can fix it later if it is causing any performance issues.
Thank you for reply. I don't know whether it will cause any performance issues. But it will cause OOM even if I set num_workers to 0. I do not solved it so far.
I solved it.
Btw, I noticed in MonoTransformerFlowLayer()
in model.py
x0, x1 = torch.split(x, [self.half_channels] * 2, 1)
x0_ = x0 * x_mask
x0_ = self.pre_transformer(x0, x_mask) # vits2
the input of self.pre_transformer
is not x0_
but x0
. Could you please clarify?
Thank you
It is just a typo. The mask is being used in the transformers either way. Fixed in latest patch. Thanks for the catch. What was your solution to the many process problem?
It was just my fault. I used wrong monotonic. Now it fixed. You can close this issue.
Hello, Thank you for your awesome work. When I train with two gpus in recent two days, I find too many process in GPU0? How can I deal with it? .