Closed Hyaloid closed 1 year ago
Hi @Hyaloid ,
You do not need a docker-in-docker setting.
I just figure out the error as we missed several lines of code when uploading the initial commit to this repo. You can now pull the latest commit and the error should be fixed.
I have setup the enviroment and downloaded the dataset using the dockerfile offered in the repo, and I have already modified the data locations in config files. When I execute
python driver.py --config_file configs/bert_4vpipe.yml
, a commandnvidia-docker run -it -v $(dirname $PWD):/workspace --net=host --ipc=host bert /bin/bash -c 'export GLOO_SOCKET_IFNAME=enp216s0; cp ../launch.py .; python -m launch --nnodes 1 --node_rank 0 --nproc_per_node 4 main_with_runtime.py --data_dir data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/bookcorpus --master_addr localhost --module vgpus=4 --checkpoint_dir output/2023-03-18T07:46:23 --partition vgpus=4/vpipe.json --sync_mode asp --distributed_backend gloo -b 16 --lr 0.050000 --lr_policy polynomial --weight-decay 0.000000 --epochs 40 --print-freq 100 --verbose 0 --num_ranks_in_server 4 --config_path vgpus=4/mp_conf.json 2>&1 | tee output/2023-03-18T07:46:23/output.log.0; rm launch.py'
is generated in _commandhistory.log. According to the enviroment, I appended PYTHONPATH and changed the paths of some configs of the command. But when I executenvidia-docker run -it -v $(dirname $PWD):/workspace --net=host --ipc=host vpipe:bert /bin/bash -c 'export GLOO_SOCKET_IFNAME=enp216s0 PYTHONPATH=$PYTHONPATH:../runtime; cp ../runtime/bert/launch.py .; python -m launch --nnodes 1 --node_rank 0 --nproc_per_node 4 ../runtime/bert/main_with_runtime.py --data_dir data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/bookcorpus --master_addr localhost --module vgpus=4 --checkpoint_dir output/2023-03-18T07:46:23 --partition ../runtime/bert/vgpus=4/vpipe.json --sync_mode asp --distributed_backend gloo -b 16 --lr 0.050000 --lr_policy polynomial --weight-decay 0.000000 --epochs 40 --print-freq 100 --verbose 0 --num_ranks_in_server 4 --config_path ../runtime/bert/vgpus=4/mp_conf.json 2>&1 | tee ../runtime/bert/output/2023-03-18T07:46:23/output.log.0; rm ../runtime/bert/launch.py'
, an error occured:Should I just use docker-in-docker or connect one docker to another docker? Any help would be so appreciated.