jyrana commented 1 year ago

I have been running it over HPC with 1 gpu for starters to make sure code is working as I have latest version of torch and transformers, It's giving me error loading yaml. I am unable to find any solution for this. Can you help me out here?

[jpr8961@gv002 BLIP]$ singularity exec --overlay /scratch/jpr8961/pytorch-example/torch.ext3:ro /scratch/work/public/singularity/cuda11.6.124-cudnn8.4.0.27-devel-ubuntu20.04.4.sif /bin /bash -c 'source /ext3/env.sh; python -m torch.distributed.run --nproc_per_node=1 train_retrieval.py --config ./configs/retrieval_coco.yaml --output_dir output/retrieval_coco --evaluat e' Traceback (most recent call last): File "/scratch/jpr8961/BLIP/train_retrieval.py", line 340, in config = yaml.load(config_file, Loader=yaml.Loader) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jpr8961/.local/lib/python3.11/site-packages/ruamel/yaml/main.py", line 1085, in load error_deprecation('load', 'load', arg=_error_dep_arg, comment=_error_dep_comment) File "/home/jpr8961/.local/lib/python3.11/site-packages/ruamel/yaml/main.py", line 1039, in error_deprecation raise AttributeError(s, name=None) AttributeError: "load()" has been removed, use yaml = YAML(typ='rt') yaml.load(...) and register any classes that you use, or check the tag attribute on the loaded data, instead of file "/scratch/jpr8961/BLIP/train_retrieval.py", line 340 config = yaml.load(config_file, Loader=yaml.Loader) [2023-11-03 00:34:53,056] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2705500) of binary: /ext3/miniconda3/bin/python Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/ext3/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 810, in main() File "/ext3/miniconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) ^^^^^^^^^^^^^^^^^^ File "/ext3/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/ext3/miniconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/ext3/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/ext3/miniconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_retrieval.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-11-03_00:34:53 host : gv002.hpc.nyu.edu rank : 0 (local_rank: 0) exitcode : 1 (pid: 2705500) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

aTunass commented 10 months ago

Hi, Can I ask how to fix your problem? I'm facing same issue

Cuzyoung commented 10 months ago

Hi, Can I ask how to fix your problem? I'm facing same issue

downgrade your ruamel_yaml's version to 0.16.6

jyrana commented 10 months ago

Or you can use runmel.yaml libraby. It need a change in config variable:

config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader) to

yaml = yaml.YAML(typ='rt') with open(args.config, 'r') as config_file: config = yaml.load(config_file)

On Thu, Jan 25, 2024 at 8:53 AM Cusyoung @.***> wrote:

Hi, Can I ask how to fix your problem? I'm facing same issue

downgrade your ruamel_yaml's version to 0.16.6

— Reply to this email directly, view it on GitHub https://github.com/salesforce/BLIP/issues/189#issuecomment-1909280415, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQOKJQ5LVVZRAHWRZXWVN2LYQHFZ7AVCNFSM6AAAAAA634D2IGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBZGI4DANBRGU . You are receiving this because you modified the open/close state.Message ID: @.***>

Cuzyoung commented 10 months ago

Or you can use runmel.yaml libraby. It need a change in config variable: #config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader) to yaml = yaml.YAML(typ='rt') with open(args.config, 'r') as config_file: config = yaml.load(config_file) … On Thu, Jan 25, 2024 at 8:53 AM Cusyoung @.> wrote: Hi, Can I ask how to fix your problem? I'm facing same issue downgrade your ruamel_yaml's version to 0.16.6 — Reply to this email directly, view it on GitHub <#189 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQOKJQ5LVVZRAHWRZXWVN2LYQHFZ7AVCNFSM6AAAAAA634D2IGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBZGI4DANBRGU . You are receiving this because you modified the open/close state.Message ID: @.>

hello, I'm trying to run python -m torch.distributed.run --nproc_per_node=8 train_caption.py --evaluate. However, I'm not sure which dataset should be used cuz a large number of Coco's versions. Can u give some advice about this problem? It seems that Coco 2014 val should be used. Thanks!!!

jyrana commented 10 months ago

You can use only validation dataset for evaluation.

Make sure to update path in config folder.

On Thu, Jan 25, 2024 at 9:30 AM Cusyoung @.***> wrote:

Or you can use runmel.yaml libraby. It need a change in config variable:

config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader) to yaml =

yaml.YAML(typ='rt') with open(args.config, 'r') as config_file: config = yaml.load(configfile) … <#m-8065531327018731275_> On Thu, Jan 25, 2024 at 8:53 AM Cusyoung @.> wrote: Hi, Can I ask how to fix your problem? I'm facing same issue downgrade your ruamel_yaml's version to 0.16.6 — Reply to this email directly, view it on GitHub <#189 (comment) https://github.com/salesforce/BLIP/issues/189#issuecomment-1909280415>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQOKJQ5LVVZRAHWRZXWVN2LYQHFZ7AVCNFSM6AAAAAA634D2IGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBZGI4DANBRGU https://github.com/notifications/unsubscribe-auth/AQOKJQ5LVVZRAHWRZXWVN2LYQHFZ7AVCNFSM6AAAAAA634D2IGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBZGI4DANBRGU . You are receiving this because you modified the open/close state.Message ID: @.>

hello, I'm trying to run python -m torch.distributed.run --nproc_per_node=8 train_caption.py --evaluate. However, I'm not sure which dataset should be used cuz a large number of Coco's versions. Can u give some advice about this problem? It seems that Coco 2014 val should be used. Thanks!!!

— Reply to this email directly, view it on GitHub https://github.com/salesforce/BLIP/issues/189#issuecomment-1909305124, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQOKJQ5MH3P6KWSUKTMEN2TYQHKEFAVCNFSM6AAAAAA634D2IGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBZGMYDKMJSGQ . You are receiving this because you modified the open/close state.Message ID: @.***>

salesforce / BLIP

I am having trouble running evaluation code #189

train_retrieval.py FAILED

config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader) to

config = yaml.load(open(args.config, 'r'), Loader=yaml.Loader) to yaml =