hku-systems / vpipe

26 stars 3 forks source link

AttributeError: module 'vgpus=4' has no attribute 'arch' #2

Closed Hyaloid closed 1 year ago

Hyaloid commented 1 year ago

I have setup the enviroment and downloaded the dataset using the dockerfile offered in the repo, and I have already modified the data locations in config files. When I execute python driver.py --config_file configs/bert_4vpipe.yml, a command nvidia-docker run -it -v $(dirname $PWD):/workspace --net=host --ipc=host bert /bin/bash -c 'export GLOO_SOCKET_IFNAME=enp216s0; cp ../launch.py .; python -m launch --nnodes 1 --node_rank 0 --nproc_per_node 4 main_with_runtime.py --data_dir data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/bookcorpus --master_addr localhost --module vgpus=4 --checkpoint_dir output/2023-03-18T07:46:23 --partition vgpus=4/vpipe.json --sync_mode asp --distributed_backend gloo -b 16 --lr 0.050000 --lr_policy polynomial --weight-decay 0.000000 --epochs 40 --print-freq 100 --verbose 0 --num_ranks_in_server 4 --config_path vgpus=4/mp_conf.json 2>&1 | tee output/2023-03-18T07:46:23/output.log.0; rm launch.py' is generated in _commandhistory.log. According to the enviroment, I appended PYTHONPATH and changed the paths of some configs of the command. But when I execute nvidia-docker run -it -v $(dirname $PWD):/workspace --net=host --ipc=host vpipe:bert /bin/bash -c 'export GLOO_SOCKET_IFNAME=enp216s0 PYTHONPATH=$PYTHONPATH:../runtime; cp ../runtime/bert/launch.py .; python -m launch --nnodes 1 --node_rank 0 --nproc_per_node 4 ../runtime/bert/main_with_runtime.py --data_dir data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/bookcorpus --master_addr localhost --module vgpus=4 --checkpoint_dir output/2023-03-18T07:46:23 --partition ../runtime/bert/vgpus=4/vpipe.json --sync_mode asp --distributed_backend gloo -b 16 --lr 0.050000 --lr_policy polynomial --weight-decay 0.000000 --epochs 40 --print-freq 100 --verbose 0 --num_ranks_in_server 4 --config_path ../runtime/bert/vgpus=4/mp_conf.json 2>&1 | tee ../runtime/bert/output/2023-03-18T07:46:23/output.log.0; rm ../runtime/bert/launch.py', an error occured:

NVIDIA Release 20.03 (build 11122848)
PyTorch Version 1.5.0a0+8f84ded

Container image Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.

Copyright (c) 2014-2019 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

Traceback (most recent call last):
  File "../runtime/bert/main_with_runtime.py", line 576, in <module>
    main()
  File "../runtime/bert/main_with_runtime.py", line 187, in main
    args.arch = module.arch()
AttributeError: module 'vgpus=4' has no attribute 'arch'
Traceback (most recent call last):
  File "../runtime/bert/main_with_runtime.py", line 576, in <module>
Traceback (most recent call last):
  File "../runtime/bert/main_with_runtime.py", line 576, in <module>
Traceback (most recent call last):
  File "../runtime/bert/main_with_runtime.py", line 576, in <module>
    main()
  File "../runtime/bert/main_with_runtime.py", line 187, in main
    main()
  File "../runtime/bert/main_with_runtime.py", line 187, in main
    main()
  File "../runtime/bert/main_with_runtime.py", line 187, in main
    args.arch = module.arch()
AttributeError: module 'vgpus=4' has no attribute 'arch'
    args.arch = module.arch()
AttributeError: module 'vgpus=4' has no attribute 'arch'
    args.arch = module.arch()
AttributeError: module 'vgpus=4' has no attribute 'arch'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/workspace/bert/launch.py", line 173, in <module>
    main()
  File "/workspace/bert/launch.py", line 169, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', '../runtime/bert/main_with_runtime.py', '--rank=3', '--local_rank=3', '--data_dir', 'data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/bookcorpus', '--master_addr', 'localhost', '--module', 'vgpus=4', '--checkpoint_dir', 'output/2023-03-18T07:46:23', '--partition', '../runtime/bert/vgpus=4/vpipe.json', '--sync_mode', 'asp', '--distributed_backend', 'gloo', '-b', '16', '--lr', '0.050000', '--lr_policy', 'polynomial', '--weight-decay', '0.000000', '--epochs', '40', '--print-freq', '100', '--verbose', '0', '--num_ranks_in_server', '4', '--config_path', '../runtime/bert/vgpus=4/mp_conf.json']' returned non-zero exit status 1.

Should I just use docker-in-docker or connect one docker to another docker? Any help would be so appreciated.

SimonZsx commented 1 year ago

Hi @Hyaloid ,

You do not need a docker-in-docker setting.

I just figure out the error as we missed several lines of code when uploading the initial commit to this repo. You can now pull the latest commit and the error should be fixed.