Thanks for your refreshing work! I am trying to reproduce your example by running accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml" Then training looks good, but when it goes to validation, it raises RuntimeError: Input type (c10::Half) and bias type (float) should be the same. I was wondering if there is any chance that your training/inferences forwards are different. If yes, could you point out the directions that I can debug? Thanks!
Traceback (most recent call last): | 0/50 [00:00<?, ?it/s] File "/home/jeffliang/Tune-A-Video/train_tuneavideo.py", line 369, in <module> main(**OmegaConf.load(args.config)) File "/home/jeffliang/Tune-A-Video/train_tuneavideo.py", line 328, in main ddim_inv_latent = ddim_inversion( File "/home/jeffliang/anaconda3/envs/txt2video/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/jeffliang/Tune-A-Video/tuneavideo/util.py", line 83, in ddim_inversion ddim_latents = ddim_loop(pipeline, ddim_scheduler, video_latent, num_inv_steps, prompt) File "/home/jeffliang/anaconda3/envs/txt2video/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/jeffliang/Tune-A-Video/tuneavideo/util.py", line 75, in ddim_loop noise_pred = get_noise_pred_single(latent, t, cond_embeddings, pipeline.unet) File "/home/jeffliang/Tune-A-Video/tuneavideo/util.py", line 63, in get_noise_pred_single noise_pred = unet(latents, t, encoder_hidden_states=context)["sample"] File "/home/jeffliang/anaconda3/envs/txt2video/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/jeffliang/Tune-A-Video/tuneavideo/models/unet.py", line 358, in forward sample = self.conv_in(sample) File "/home/jeffliang/anaconda3/envs/txt2video/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/jeffliang/Tune-A-Video/tuneavideo/models/resnet.py", line 15, in forward x = super().forward(x) File "/home/jeffliang/anaconda3/envs/txt2video/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 463, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/jeffliang/anaconda3/envs/txt2video/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: Input type (c10::Half) and bias type (float) should be the same Steps: 2%|▊ | 10/500 [00:18<14:56, 1.83s/it, lr=3e-5, step_loss=0.0766] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 69332) of binary: /home/jeffliang/anaconda3/envs/txt2video/bin/python
Hi @zhangjiewu ,
Thanks for your refreshing work! I am trying to reproduce your example by running
accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml"
Then training looks good, but when it goes to validation, it raisesRuntimeError: Input type (c10::Half) and bias type (float) should be the same
. I was wondering if there is any chance that your training/inferences forwards are different. If yes, could you point out the directions that I can debug? Thanks!Traceback (most recent call last): | 0/50 [00:00<?, ?it/s] File "/home/jeffliang/Tune-A-Video/train_tuneavideo.py", line 369, in <module> main(**OmegaConf.load(args.config)) File "/home/jeffliang/Tune-A-Video/train_tuneavideo.py", line 328, in main ddim_inv_latent = ddim_inversion( File "/home/jeffliang/anaconda3/envs/txt2video/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/jeffliang/Tune-A-Video/tuneavideo/util.py", line 83, in ddim_inversion ddim_latents = ddim_loop(pipeline, ddim_scheduler, video_latent, num_inv_steps, prompt) File "/home/jeffliang/anaconda3/envs/txt2video/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/jeffliang/Tune-A-Video/tuneavideo/util.py", line 75, in ddim_loop noise_pred = get_noise_pred_single(latent, t, cond_embeddings, pipeline.unet) File "/home/jeffliang/Tune-A-Video/tuneavideo/util.py", line 63, in get_noise_pred_single noise_pred = unet(latents, t, encoder_hidden_states=context)["sample"] File "/home/jeffliang/anaconda3/envs/txt2video/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/jeffliang/Tune-A-Video/tuneavideo/models/unet.py", line 358, in forward sample = self.conv_in(sample) File "/home/jeffliang/anaconda3/envs/txt2video/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/jeffliang/Tune-A-Video/tuneavideo/models/resnet.py", line 15, in forward x = super().forward(x) File "/home/jeffliang/anaconda3/envs/txt2video/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 463, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/jeffliang/anaconda3/envs/txt2video/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: Input type (c10::Half) and bias type (float) should be the same Steps: 2%|▊ | 10/500 [00:18<14:56, 1.83s/it, lr=3e-5, step_loss=0.0766] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 69332) of binary: /home/jeffliang/anaconda3/envs/txt2video/bin/python