Closed shrijayan closed 1 month ago
deepspeed --num_gpus=8 train.py --config=configs/train/mlm.yaml --deepspeed_config=configs/deepspeed/ds_config.json --dtype=bf16
[2024-05-09 13:37:31,470] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-09 13:37:31,821] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-05-09 13:37:31,834] [INFO] [runner.py:568:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py --config=configs/train/mlm.yaml --deepspeed_config=configs/deepspeed/ds_config.json --dtype=bf16 [2024-05-09 13:37:33,078] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-09 13:37:33,315] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2024-05-09 13:37:33,316] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0 [2024-05-09 13:37:33,316] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2024-05-09 13:37:33,316] [INFO] [launch.py:163:main] dist_world_size=8 [2024-05-09 13:37:33,316] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 [2024-05-09 13:37:33,316] [INFO] [launch.py:253:main] process 50297 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--config=configs/train/mlm.yaml', '--deepspeed_config=configs/deepspeed/ds_config.json', '--dtype=bf16'] [2024-05-09 13:37:33,316] [INFO] [launch.py:253:main] process 50298 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=1', '--config=configs/train/mlm.yaml', '--deepspeed_config=configs/deepspeed/ds_config.json', '--dtype=bf16'] [2024-05-09 13:37:33,316] [INFO] [launch.py:253:main] process 50299 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=2', '--config=configs/train/mlm.yaml', '--deepspeed_config=configs/deepspeed/ds_config.json', '--dtype=bf16'] [2024-05-09 13:37:33,317] [INFO] [launch.py:253:main] process 50300 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=3', '--config=configs/train/mlm.yaml', '--deepspeed_config=configs/deepspeed/ds_config.json', '--dtype=bf16'] [2024-05-09 13:37:33,317] [INFO] [launch.py:253:main] process 50301 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=4', '--config=configs/train/mlm.yaml', '--deepspeed_config=configs/deepspeed/ds_config.json', '--dtype=bf16'] [2024-05-09 13:37:33,317] [INFO] [launch.py:253:main] process 50302 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=5', '--config=configs/train/mlm.yaml', '--deepspeed_config=configs/deepspeed/ds_config.json', '--dtype=bf16'] [2024-05-09 13:37:33,317] [INFO] [launch.py:253:main] process 50303 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=6', '--config=configs/train/mlm.yaml', '--deepspeed_config=configs/deepspeed/ds_config.json', '--dtype=bf16'] [2024-05-09 13:37:33,317] [INFO] [launch.py:253:main] process 50304 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=7', '--config=configs/train/mlm.yaml', '--deepspeed_config=configs/deepspeed/ds_config.json', '--dtype=bf16'] [2024-05-09 13:37:34,745] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-09 13:37:34,749] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-09 13:37:34,749] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-09 13:37:34,779] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-09 13:37:34,786] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-09 13:37:34,786] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-09 13:37:34,794] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-09 13:37:34,803] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) Traceback (most recent call last): File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module> from contrastors.read import read_config ModuleNotFoundError: No module named 'contrastors' Traceback (most recent call last): File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module> from contrastors.read import read_config ModuleNotFoundError: No module named 'contrastors' Traceback (most recent call last): File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module> from contrastors.read import read_config ModuleNotFoundError: No module named 'contrastors' Traceback (most recent call last): File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module> from contrastors.read import read_config ModuleNotFoundError: No module named 'contrastors' Traceback (most recent call last): File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module> from contrastors.read import read_config ModuleNotFoundError: No module named 'contrastors' Traceback (most recent call last): File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module> from contrastors.read import read_config ModuleNotFoundError: No module named 'contrastors' Traceback (most recent call last): Traceback (most recent call last): File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module> File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module> from contrastors.read import read_configfrom contrastors.read import read_config ModuleNotFoundErrorModuleNotFoundError: : No module named 'contrastors'No module named 'contrastors' [2024-05-09 13:37:36,321] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 50297 [2024-05-09 13:37:36,333] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 50298 [2024-05-09 13:37:36,337] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 50299 [2024-05-09 13:37:36,339] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 50300 [2024-05-09 13:37:36,342] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 50301 [2024-05-09 13:37:36,342] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 50302 [2024-05-09 13:37:36,345] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 50303 [2024-05-09 13:37:36,347] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 50304 [2024-05-09 13:37:36,349] [ERROR] [launch.py:322:sigkill_handler] ['/usr/bin/python3', '-u', 'train.py', '--local_rank=7', '--config=configs/train/mlm.yaml', '--deepspeed_config=configs/deepspeed/ds_config.json', '--dtype=bf16'] exits with return code = 1
torchrun --nproc-per-node=8 train.py --config=configs/train/contrastive_pretrain.yaml --dtype=bf16
[2024-05-09 13:41:03,942] torch.distributed.run: [WARNING] [2024-05-09 13:41:03,942] torch.distributed.run: [WARNING] ***************************************** [2024-05-09 13:41:03,942] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-05-09 13:41:03,942] torch.distributed.run: [WARNING] ***************************************** [2024-05-09 13:41:05,237] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-09 13:41:05,238] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-09 13:41:05,254] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-09 13:41:05,254] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-09 13:41:05,254] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-09 13:41:05,257] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-09 13:41:05,262] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-09 13:41:05,285] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) Traceback (most recent call last): File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module> from contrastors.read import read_config ModuleNotFoundError: No module named 'contrastors' Traceback (most recent call last): File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module> from contrastors.read import read_config ModuleNotFoundError: No module named 'contrastors' Traceback (most recent call last): File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module> from contrastors.read import read_config ModuleNotFoundError: No module named 'contrastors' Traceback (most recent call last): File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module> from contrastors.read import read_config ModuleNotFoundError: No module named 'contrastors' Traceback (most recent call last): File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module> from contrastors.read import read_config ModuleNotFoundError: No module named 'contrastors' Traceback (most recent call last): File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module> from contrastors.read import read_config ModuleNotFoundError: No module named 'contrastors' Traceback (most recent call last): File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module> from contrastors.read import read_config ModuleNotFoundError: No module named 'contrastors' Traceback (most recent call last): File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module> from contrastors.read import read_config ModuleNotFoundError: No module named 'contrastors' [2024-05-09 13:41:08,951] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 50848) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/karrtik/.local/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/karrtik/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/karrtik/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/home/karrtik/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/home/karrtik/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/karrtik/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-05-09_13:41:08 host : karrtik-gpu rank : 1 (local_rank: 1) exitcode : 1 (pid: 50849) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-05-09_13:41:08 host : karrtik-gpu rank : 2 (local_rank: 2) exitcode : 1 (pid: 50850) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-05-09_13:41:08 host : karrtik-gpu rank : 3 (local_rank: 3) exitcode : 1 (pid: 50851) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-05-09_13:41:08 host : karrtik-gpu rank : 4 (local_rank: 4) exitcode : 1 (pid: 50852) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2024-05-09_13:41:08 host : karrtik-gpu rank : 5 (local_rank: 5) exitcode : 1 (pid: 50853) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-05-09_13:41:08 host : karrtik-gpu rank : 6 (local_rank: 6) exitcode : 1 (pid: 50854) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [7]: time : 2024-05-09_13:41:08 host : karrtik-gpu rank : 7 (local_rank: 7) exitcode : 1 (pid: 50855) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-05-09_13:41:08 host : karrtik-gpu rank : 0 (local_rank: 0) exitcode : 1 (pid: 50848) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
did you run pip install -e . from the base of contrastors?
pip install -e .
I am trying to finetune
Command
Error
Command
Error