Disk full when fine-tuning Image Question Answering

junyi-tiger commented 3 years ago

Thank you for your work! I encountered a problem when running VQA fine-tuning with:\ `horovodrun -np 1 python src/tasks/run_vqa.py \ --config src/configs/vqa_base_resnet50.json \ --output_dir $OUTPUT_DIR` The output message is as follows:\ root@a2d64a8b9de3:/clipbert# horovodrun -np 1 python src/tasks/run_vqa.py --config src/configs/vqa_base_resnet50.json --output_dir ./output [1,0]<stderr>:04/18/2021 11:07:07 - INFO - main - device: cuda:0 n_gpu: 1, rank: 0, 16-bits training: True [1,0]<stderr>:04/18/2021 11:07:07 - INFO - main - Setup model... [1,0]<stderr>:04/18/2021 11:07:07 - INFO - main - setup e2e model [1,0]<stdout>:cnn_cls <class 'src.modeling.grid_feat.GridFeatBackbone'> [1,0]<stderr>:04/18/2021 11:07:10 - INFO - main - Loading e2e weights from /pretrain/clipbert_image_text_pretrained.pt [1,0]<stderr>:04/18/2021 11:07:34 - INFO - main - You can ignore the keys withnum_batches_tracked` or from task heads [1,0]:04/18/2021 11:07:34 - INFO - main - Keys in loaded but not in model: [1,0]:04/18/2021 11:07:34 - INFO - main - In total 9, ['transformer.cls.predictions.bias', 'transformer.cls.predictions.decoder.bias', 'transformer.cls.predictions.decoder.weight', 'transformer.cls.predictions.transform.LayerNorm.bias', 'transformer.cls.predictions.transform.LayerNorm.weight', 'transformer.cls.predictions.transform.dense.bias', 'transformer.cls.predictions.transform.dense.weight', 'transformer.cls.seq_relationship.bias', 'transformer.cls.seq_relationship.weight'] [1,0]:04/18/2021 11:07:34 - INFO - main - Keys in model but not in loaded: [1,0]:04/18/2021 11:07:34 - INFO - main - In total 4, ['transformer.classifier.0.bias', 'transformer.classifier.0.weight', 'transformer.classifier.2.bias', 'transformer.classifier.2.weight'] [1,0]:04/18/2021 11:07:34 - INFO - main - Keys in model and loaded, but shape mismatched: [1,0]:04/18/2021 11:07:34 - INFO - main - In total 0, [] [1,0]:04/18/2021 11:07:37 - INFO - main - Setup model done! [1,0]:Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights. [1,0]: [1,0]:Defaults for this optimization level are: [1,0]:enabled : True [1,0]:opt_level : O2 [1,0]:cast_model_type : torch.float16 [1,0]:patch_torch_functions : False [1,0]:keep_batchnorm_fp32 : True [1,0]:master_weights : True [1,0]:loss_scale : dynamic [1,0]:Processing user overrides (additional kwargs that are not None)... [1,0]:After processing overrides, optimization options are: [1,0]:enabled : True [1,0]:opt_level : O2 [1,0]:cast_model_type : torch.float16 [1,0]:patch_torch_functions : False [1,0]:keep_batchnorm_fp32 : True [1,0]:master_weights : True [1,0]:loss_scale : dynamic [1,0]:/pytorch/torch/csrc/utils/python_argparser.cpp:756: UserWarning: This overload of add is deprecated: [1,0]: add(Number alpha, Tensor other) [1,0]:Consider using one of the following signatures instead: [1,0]: add(Tensor other, , Number alpha) [1,0]:04/18/2021 11:07:41 - INFO - transformers.tokenization_utils - Model name '/pretrain/bert-base-uncased/' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming '/pretrain/bert-base-uncased/' is a path, a model identifier, or url to a directory containing tokenizer files. [1,0]:04/18/2021 11:07:41 - INFO - transformers.tokenization_utils - Didn't find file /pretrain/bert-base-uncased/added_tokens.json. We won't load it. [1,0]:04/18/2021 11:07:41 - INFO - transformers.tokenization_utils - loading file /pretrain/bert-base-uncased/vocab.txt [1,0]:04/18/2021 11:07:41 - INFO - transformers.tokenization_utils - loading file None [1,0]:04/18/2021 11:07:41 - INFO - transformers.tokenization_utils - loading file /pretrain/bert-base-uncased/special_tokens_map.json [1,0]:04/18/2021 11:07:41 - INFO - transformers.tokenization_utils - loading file /pretrain/bert-base-uncased/tokenizer_config.json [1,0]:04/18/2021 11:07:41 - INFO - main - Init. train_loader and val_loader... [1,0]:Using example_unique_key question_id to check whether input and output ids m [1,0]:04/18/2021 11:07:53 - INFO - main - is_train True, dataset size 587314 groups, each group 2 [1,0]:Using example_unique_key question_id to check whether input and output ids m [1,0]:04/18/2021 11:07:54 - INFO - main - is_train False, dataset size 26280 groups, each group 1 [1,0]:04/18/2021 11:07:54 - INFO - main - Saving training meta... [1,0]:04/18/2021 11:07:54 - INFO - main - Saving code from /clipbert to ./output/code.zip... [1,0]:Traceback (most recent call last): [1,0]: File "/opt/conda/lib/python3.6/zipfile.py", line 1646, in write [1,0]: shutil.copyfileobj(src, dest, 10248) [1,0]: File "/opt/conda/lib/python3.6/shutil.py", line 82, in copyfileobj [1,0]: fdst.write(buf) [1,0]: File "/opt/conda/lib/python3.6/zipfile.py", line 1015, in write [1,0]: self._fileobj.write(data) [1,0]:OSError: [Errno 28] No space left on device [1,0]: [1,0]:During handling of the above exception, another exception occurred: [1,0]: [1,0]:Traceback (most recent call last): [1,0]: File "/clipbert/src/utils/basic_utils.py", line 122, in make_zipfile [1,0]: zf.write(absname, arcname) [1,0]: File "/opt/conda/lib/python3.6/zipfile.py", line 1646, in write [1,0]: shutil.copyfileobj(src, dest, 1024*8) [1,0]: File "/opt/conda/lib/python3.6/zipfile.py", line 1043, in close [1,0]: raise RuntimeError('File size unexpectedly exceeded ZIP64 ' [1,0]:RuntimeError: File size unexpectedly exceeded ZIP64 limit [1,0]: [1,0]:During handling of the above exception, another exception occurred: [1,0]: [1,0]:Traceback (most recent call last): [1,0]: File "src/tasks/run_vqa.py", line 568, in [1,0]: start_training(input_cfg) [1,0]: File "src/tasks/run_vqa.py", line 314, in start_training [1,0]: save_training_meta(cfg) [1,0]: File "/clipbert/src/utils/load_save.py", line 39, in save_training_meta [1,0]: exclude_extensions=[".pyc", ".ipynb", ".swap"]) [1,0]: File "/clipbert/src/utils/basic_utils.py", line 122, in make_zipfile [1,0]: zf.write(absname, arcname) [1,0]: File "/opt/conda/lib/python3.6/zipfile.py", line 1174, in exit [1,0]: self.close() [1,0]: File "/opt/conda/lib/python3.6/zipfile.py", line 1695, in close [1,0]: raise ValueError("Can't close the ZIP file while there is " [1,0]:ValueError: Can't close the ZIP file while there is an open writing handle on it. Close the writing handle before closing the zip.

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[48786,1],0] Exit code: 1

`\ I found my disk storage is full after running it: /dev/nvme0n1p10 83G 83G 0 100% / Is this normal? How can I solve this problem?

junyi-tiger commented 3 years ago

I found this file: './output/code.zip' is too big, more than 60G.

jayleicn commented 3 years ago

Hi @straightAYiJun,

Our code will automatically save a copy from the current codebase, please do not set --output_dir to be within the project directory. See more details here: https://github.com/jayleicn/ClipBERT/blob/d6385faa531d2999e647d46a4050e27e0749daf8/src/utils/load_save.py#L31-L35

junyi-tiger commented 3 years ago

Thank you！It is soved.

jayleicn / ClipBERT

Disk full when fine-tuning Image Question Answering #7

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[48786,1],0] Exit code: 1