Closed jyrana closed 1 year ago
Hi, Can I ask how to fix your problem? I'm facing same issue
Hi, Can I ask how to fix your problem? I'm facing same issue
downgrade your ruamel_yaml's version to 0.16.6
Or you can use runmel.yaml libraby. It need a change in config variable:
yaml = yaml.YAML(typ='rt') with open(args.config, 'r') as config_file: config = yaml.load(config_file)
On Thu, Jan 25, 2024 at 8:53 AM Cusyoung @.***> wrote:
Hi, Can I ask how to fix your problem? I'm facing same issue
downgrade your ruamel_yaml's version to 0.16.6
— Reply to this email directly, view it on GitHub https://github.com/salesforce/BLIP/issues/189#issuecomment-1909280415, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQOKJQ5LVVZRAHWRZXWVN2LYQHFZ7AVCNFSM6AAAAAA634D2IGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBZGI4DANBRGU . You are receiving this because you modified the open/close state.Message ID: @.***>
Or you can use runmel.yaml libraby. It need a change in config variable: #config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader) to yaml = yaml.YAML(typ='rt') with open(args.config, 'r') as config_file: config = yaml.load(config_file) … On Thu, Jan 25, 2024 at 8:53 AM Cusyoung @.> wrote: Hi, Can I ask how to fix your problem? I'm facing same issue downgrade your ruamel_yaml's version to 0.16.6 — Reply to this email directly, view it on GitHub <#189 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQOKJQ5LVVZRAHWRZXWVN2LYQHFZ7AVCNFSM6AAAAAA634D2IGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBZGI4DANBRGU . You are receiving this because you modified the open/close state.Message ID: @.>
hello, I'm trying to run python -m torch.distributed.run --nproc_per_node=8 train_caption.py --evaluate. However, I'm not sure which dataset should be used cuz a large number of Coco's versions. Can u give some advice about this problem? It seems that Coco 2014 val should be used. Thanks!!!
You can use only validation dataset for evaluation.
Make sure to update path in config folder.
On Thu, Jan 25, 2024 at 9:30 AM Cusyoung @.***> wrote:
Or you can use runmel.yaml libraby. It need a change in config variable:
config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader) to yaml =
yaml.YAML(typ='rt') with open(args.config, 'r') as config_file: config = yaml.load(configfile) … <#m-8065531327018731275_> On Thu, Jan 25, 2024 at 8:53 AM Cusyoung @.> wrote: Hi, Can I ask how to fix your problem? I'm facing same issue downgrade your ruamel_yaml's version to 0.16.6 — Reply to this email directly, view it on GitHub <#189 (comment) https://github.com/salesforce/BLIP/issues/189#issuecomment-1909280415>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQOKJQ5LVVZRAHWRZXWVN2LYQHFZ7AVCNFSM6AAAAAA634D2IGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBZGI4DANBRGU https://github.com/notifications/unsubscribe-auth/AQOKJQ5LVVZRAHWRZXWVN2LYQHFZ7AVCNFSM6AAAAAA634D2IGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBZGI4DANBRGU . You are receiving this because you modified the open/close state.Message ID: @.>
hello, I'm trying to run python -m torch.distributed.run --nproc_per_node=8 train_caption.py --evaluate. However, I'm not sure which dataset should be used cuz a large number of Coco's versions. Can u give some advice about this problem? It seems that Coco 2014 val should be used. Thanks!!!
— Reply to this email directly, view it on GitHub https://github.com/salesforce/BLIP/issues/189#issuecomment-1909305124, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQOKJQ5MH3P6KWSUKTMEN2TYQHKEFAVCNFSM6AAAAAA634D2IGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBZGMYDKMJSGQ . You are receiving this because you modified the open/close state.Message ID: @.***>
I have been running it over HPC with 1 gpu for starters to make sure code is working as I have latest version of torch and transformers, It's giving me error loading yaml. I am unable to find any solution for this. Can you help me out here?
[jpr8961@gv002 BLIP]$ singularity exec --overlay /scratch/jpr8961/pytorch-example/torch.ext3:ro /scratch/work/public/singularity/cuda11.6.124-cudnn8.4.0.27-devel-ubuntu20.04.4.sif /bin /bash -c 'source /ext3/env.sh; python -m torch.distributed.run --nproc_per_node=1 train_retrieval.py --config ./configs/retrieval_coco.yaml --output_dir output/retrieval_coco --evaluat e' Traceback (most recent call last): File "/scratch/jpr8961/BLIP/train_retrieval.py", line 340, in
config = yaml.load(config_file, Loader=yaml.Loader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jpr8961/.local/lib/python3.11/site-packages/ruamel/yaml/main.py", line 1085, in load
error_deprecation('load', 'load', arg=_error_dep_arg, comment=_error_dep_comment)
File "/home/jpr8961/.local/lib/python3.11/site-packages/ruamel/yaml/main.py", line 1039, in error_deprecation
raise AttributeError(s, name=None)
AttributeError:
"load()" has been removed, use
yaml = YAML(typ='rt')
yaml.load(...)
and register any classes that you use, or check the tag attribute on the loaded data,
instead of file "/scratch/jpr8961/BLIP/train_retrieval.py", line 340
config = yaml.load(config_file, Loader=yaml.Loader)
[2023-11-03 00:34:53,056] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2705500) of binary: /ext3/miniconda3/bin/python
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 810, in
main()
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ext3/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_retrieval.py FAILED
Failures: