Closed tansangxtt closed 1 year ago
Hi, I managed to execute 2 phrases of training without any problems. But evaluation doesnot work, please check the following log. Thank you
(DiffusionRet) hai@user:~/sang$ CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port 2502 --nnodes=1 --nproc_per_node=1 eval.py --workers 8 --batch_size_val 128 --anno_path data/MSR-VTT/anns --video_path data/MSR-VTT/ MSRVTT_Videos --datatype msrvtt --max_words 32 --max_frames 12 --video_framerate 1 --diffusion_steps 50 --noise_schedule cosine --init_model best.pth --output_dir output_eval /home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects `--local-rank` argument to be set, please change it to read from `os.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions warnings.warn( [2023-10-03 11:06:03,359 tvr 110 INFO]: local_rank: 0 world_size: 1 [2023-10-03 11:06:03,359 tvr 117 INFO]: Effective parameters: [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< agg_module: seqTransf [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< anno_path: data/MSR-VTT/anns [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< base_encoder: ViT-B/32 [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< batch_size: 128 [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< batch_size_val: 128 [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< d_temp: 100 [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< datatype: msrvtt [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< device: cuda:0 [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< diffusion_steps: 50 [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< distributed: 0 [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< epochs: 5 [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< init_model: best.pth [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< interaction: wti [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< local_rank: 0 [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< max_frames: 12 [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< max_words: 32 [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< neg: 0 [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< noise_schedule: cosine [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< num: 127 [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< num_hidden_layers: 4 [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< output_dir: output_eval [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< seed: 42 [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< sigma_small: True [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< t2v_alpha: 1 [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< t2v_num: 32 [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< t2v_temp: 1 [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< temp: 1 [2023-10-03 11:06:03,359 tvr 119 INFO]: <<< v2t_alpha: 1 [2023-10-03 11:06:03,360 tvr 119 INFO]: <<< v2t_num: 32 [2023-10-03 11:06:03,360 tvr 119 INFO]: <<< v2t_temp: 1 [2023-10-03 11:06:03,360 tvr 119 INFO]: <<< video_framerate: 1 [2023-10-03 11:06:03,360 tvr 119 INFO]: <<< video_path: data/MSR-VTT/MSRVTT_Videos [2023-10-03 11:06:03,360 tvr 119 INFO]: <<< workers: 8 [2023-10-03 11:06:03,360 tvr 119 INFO]: <<< world_size: 1 [val] Unique sentence is 995 , all num is 1000 Video number: 1000 Total Pairs: 1000 [2023-10-03 11:06:10,770 tvr 159 INFO]: ***** Running test ***** [2023-10-03 11:06:10,770 tvr 160 INFO]: Num examples = 1000 [2023-10-03 11:06:10,770 tvr 161 INFO]: Batch size = 128 [2023-10-03 11:06:10,770 tvr 162 INFO]: Num steps = 8 [2023-10-03 11:06:10,770 tvr 163 INFO]: ***** Running val ***** [2023-10-03 11:06:10,770 tvr 164 INFO]: Num examples = 1000 [2023-10-03 11:06:10,773 tvr 375 INFO]: [start] extract text+video feature 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:10<00:00, 8.86s/it] [2023-10-03 11:07:21,813 tvr 403 INFO]: [finish] extract text+video feature [2023-10-03 11:07:21,813 tvr 407 INFO]: 1000 1000 1000 1000 [2023-10-03 11:07:21,813 tvr 411 INFO]: [start] calculate the similarity [2023-10-03 11:07:21,813 tvr 205 INFO]: [finish] map to main gpu [2023-10-03 11:07:21,814 tvr 214 INFO]: [finish] map to main gpu [2023-10-03 11:07:22,397 tvr 227 INFO]: diffusion Traceback (most recent call last): File "/home/hai/sang/eval.py", line 493, in <module> main() File "/home/hai/sang/eval.py", line 490, in main eval_epoch(args, model, test_dataloader, args.device, diffusion) File "/home/hai/sang/eval.py", line 413, in eval_epoch new_t2v_matrix, new_v2t_matrix = _run_on_single_gpu(args, model, batch_mask_t, File "/home/hai/sang/eval.py", line 255, in _run_on_single_gpu model.diffusion_model, File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__ raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'DiffusionRet' object has no attribute 'diffusion_model' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 259728) of binary: /home/hai/anaconda3/envs/DiffusionRet/bin/python Traceback (most recent call last): File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/launch.py", line 196, in <module> main() File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/hai/anaconda3/envs/DiffusionRet/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I'm sorry to reply to you so late. I have fixed this bug, just update the eval.py file and run it again.
Hi, I managed to execute 2 phrases of training without any problems. But evaluation doesnot work, please check the following log. Thank you