Closed lokeaichirou closed 3 years ago
@lokeaichirou Sorry for the confusion about the INFO Weight doesn't exsits
, which is useless information and needs to ignore. These bin files visual_pytorch_model.bin, cross_pytorch_model.bin, decoder_pytorch_model.bin
are not contained in the program and all pretrained weights are contained in univl.pretrained.bin
. They will not influent the execution and results.
For the question that The eval_epoch() does not execute actually and the program finishes.
Are there any errors? Or can you provide a full log for your running?
@lokeaichirou Sorry for the confusion about the INFO
Weight doesn't exsits
, which is useless information and needs to ignore. These bin filesvisual_pytorch_model.bin, cross_pytorch_model.bin, decoder_pytorch_model.bin
are not contained in the program and all pretrained weights are contained inunivl.pretrained.bin
. They will not influent the execution and results.For the question that
The eval_epoch() does not execute actually and the program finishes.
Are there any errors? Or can you provide a full log for your running?
Hi, there are no errors reported in evaluation. For the 'action' argument setting, I set them as parser.set_defaults(do_pretrain=False, do_train=False, do_eval=True) for evaluation (only evaluation based on pre-trained weights) I attach my log.txt below.
@lokeaichirou Sorry for the confusion about the INFO
Weight doesn't exsits
, which is useless information and needs to ignore. These bin filesvisual_pytorch_model.bin, cross_pytorch_model.bin, decoder_pytorch_model.bin
are not contained in the program and all pretrained weights are contained inunivl.pretrained.bin
. They will not influent the execution and results.For the question that
The eval_epoch() does not execute actually and the program finishes.
Are there any errors? Or can you provide a full log for your running? Hi, this is my log.text log.txt
@lokeaichirou Sorry for the confusion about the INFO
Weight doesn't exsits
, which is useless information and needs to ignore. These bin filesvisual_pytorch_model.bin, cross_pytorch_model.bin, decoder_pytorch_model.bin
are not contained in the program and all pretrained weights are contained inunivl.pretrained.bin
. They will not influent the execution and results.For the question that
The eval_epoch() does not execute actually and the program finishes.
Are there any errors? Or can you provide a full log for your running?
And I tried to set 'action argument' stage-two to be True, then it can enter the step of 'for batch in test_dataloader', however, it reports error in this step with 'RuntimeError: DataLoader worker (pid(s) 1643) exited unexpectedly'.
RuntimeError Traceback (most recent call last) /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout) 985 try: --> 986 data = self._data_queue.get(timeout=timeout) 987 return (True, data)
11 frames /usr/lib/python3.7/multiprocessing/queues.py in get(self, block, timeout) 103 timeout = deadline - time.monotonic() --> 104 if not self._poll(timeout): 105 raise Empty
/usr/lib/python3.7/multiprocessing/connection.py in poll(self, timeout) 256 self._check_readable() --> 257 return self._poll(timeout) 258
/usr/lib/python3.7/multiprocessing/connection.py in _poll(self, timeout) 413 def _poll(self, timeout): --> 414 r = wait([self], timeout) 415 return bool(r)
/usr/lib/python3.7/multiprocessing/connection.py in wait(object_list, timeout) 920 while True: --> 921 ready = selector.select(timeout) 922 if ready:
/usr/lib/python3.7/selectors.py in select(self, timeout) 414 try: --> 415 fd_event_list = self._selector.poll(timeout) 416 except InterruptedError:
/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/signal_handling.py in handler(signum, frame) 65 # Python can still get and update the process status successfully. ---> 66 _error_if_any_worker_fails() 67 if previous_handler is not None:
RuntimeError: DataLoader worker (pid 1643) is killed by signal: Killed.
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
@lokeaichirou It seems that something wrong with multiprocessing in the Dataloader. Can you test the below command?
The different is replacing --do_train --num_thread_reader=4
with --do_eval --num_thread_reader=0
. --num_thread_reader
is used to set the number of subprocessors. Besides, --stage_two
should be true when captioning.
python -m torch.distributed.launch --nproc_per_node=4 main_task_caption.py --do_eval --num_thread_reader=0 --epochs=5 --batch_size=16 --n_display=100 --train_csv ${TRAIN_CSV} --val_csv ${VAL_CSV} --data_path ${DATA_PATH} --features_path ${FEATURES_PATH} --output_dir ${OUTPUT_ROOT}/ckpt_youcook_caption --bert_model bert-base-uncased --do_lower_case --lr 3e-5 --max_words 128 --max_frames 96 --batch_size_val 64 --visual_num_hidden_layers 6 --decoder_num_hidden_layers 3 --stage_two --init_model ${INIT_MODEL}
@lokeaichirou It indeed needs to take up a lot of GPU memory. You can reduce the batch size, e.g., --batch_size_val 64
-> --batch_size_val 8
, or reduce the token length --max_words 128 --max_frames 96
to find a trade-off.
@lokeaichirou It seems that something wrong with multiprocessing in the Dataloader. Can you test the below command? The different is replacing
--do_train --num_thread_reader=4
with--do_eval --num_thread_reader=0
.--num_thread_reader
is used to set the number of subprocessors. Besides,--stage_two
should be true when captioning.python -m torch.distributed.launch --nproc_per_node=4 main_task_caption.py --do_eval --num_thread_reader=0 --epochs=5 --batch_size=16 --n_display=100 --train_csv ${TRAIN_CSV} --val_csv ${VAL_CSV} --data_path ${DATA_PATH} --features_path ${FEATURES_PATH} --output_dir ${OUTPUT_ROOT}/ckpt_youcook_caption --bert_model bert-base-uncased --do_lower_case --lr 3e-5 --max_words 128 --max_frames 96 --batch_size_val 64 --visual_num_hidden_layers 6 --decoder_num_hidden_layers 3 --stage_two --init_model ${INIT_MODEL}
Hi, @ArrowLuo , I followed your suggestion, setting num_thread_reader=0. It works for evaluation now based on pre-trained weights provided! Many thanks. I will try with training later, could it be based on pre-trained weights as well? I will check with you on here if any issue for training stage. Thanks again!
@lokeaichirou It indeed needs to take up a lot of GPU memory. You can reduce the batch size, e.g.,
--batch_size_val 64
->--batch_size_val 8
, or reduce the token length--max_words 128 --max_frames 96
to find a trade-off.
Yes, I reduced them, it finally works. Thanks!
@lokeaichirou It indeed needs to take up a lot of GPU memory. You can reduce the batch size, e.g.,
--batch_size_val 64
->--batch_size_val 8
, or reduce the token length--max_words 128 --max_frames 96
to find a trade-off.
Hi, @ArrowLuo , and may I ask you another basic question? Since the youcookii_data.no_transcript.pickle, youcookii_val.csv and youcookii_videos_features.pickle are all feed into the testing dataloader, in the paper, and in the captioning evaluation performance table, it is written that the input form could be single V, or single T, or V+T, may I ask what kind of formation is it for input by default argument and based on dataloader_youcook setting:
youcook_testset = Youcook_Caption_DataLoader(
csv=args.val_csv,
data_path=args.data_path,
features_path=args.features_path,
max_words=args.max_words,
feature_framerate=args.feature_framerate,
tokenizer=tokenizer,
max_frames=args.max_frames,
)
test_sampler = SequentialSampler(youcook_testset)
dataloader_youcook = DataLoader(
youcook_testset,
sampler=test_sampler,
batch_size=args.batch_size_val,
num_workers=args.num_thread_reader,
pin_memory=False,
)
Because the youcookii_data.no_transcript.pickle
has no transcript (replaced by 'none'), the input form is single V. We control the input type V or T with masked T or V, respectively. So the single V, or single T, and V+T share the same DataLoader.
Because the
youcookii_data.no_transcript.pickle
has no transcript (replaced by 'none'), the input form is single V. We control the input type V or T with masked T or V, respectively. So the single V, or single T, and V+T share the same DataLoader.
ok, I see. Many thanks!
Because the
youcookii_data.no_transcript.pickle
has no transcript (replaced by 'none'), the input form is single V. We control the input type V or T with masked T or V, respectively. So the single V, or single T, and V+T share the same DataLoader.ok, I see. Many thanks!
您好,可以请教下您如何下载youcookii数据集的原始视频吗?谢谢!
当我基于预训练的weights进行evaluation时, when I do the evaluation based on pre-trained weights 我遇到了如下问题: I meet issues below:
eval_epoch()没有实际运行就结束了. The eval_epoch() does not execute actually and the program finishes.
There is lack of visual_pytorch_model.bin, cross_pytorch_model.bin, decoder_pytorch_model.bin in visual-base, cross-base , decoder-base on Github page. 在主页上的visual-base, cross-base , decoder-base文件夹里不存在visual_pytorch_model.bin, cross_pytorch_model.bin, decoder_pytorch_model.bin