nomic-ai / contrastors

Train Models Contrastively in Pytorch
Apache License 2.0
459 stars 35 forks source link

No module named 'contrastors'No module named 'contrastors' #36

Closed shrijayan closed 1 month ago

shrijayan commented 1 month ago

I am trying to finetune

Command

deepspeed --num_gpus=8 train.py --config=configs/train/mlm.yaml --deepspeed_config=configs/deepspeed/ds_config.json --dtype=bf16

Error

[2024-05-09 13:37:31,470] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 13:37:31,821] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-05-09 13:37:31,834] [INFO] [runner.py:568:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py --config=configs/train/mlm.yaml --deepspeed_config=configs/deepspeed/ds_config.json --dtype=bf16
[2024-05-09 13:37:33,078] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 13:37:33,315] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-05-09 13:37:33,316] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-05-09 13:37:33,316] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-05-09 13:37:33,316] [INFO] [launch.py:163:main] dist_world_size=8
[2024-05-09 13:37:33,316] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2024-05-09 13:37:33,316] [INFO] [launch.py:253:main] process 50297 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--config=configs/train/mlm.yaml', '--deepspeed_config=configs/deepspeed/ds_config.json', '--dtype=bf16']
[2024-05-09 13:37:33,316] [INFO] [launch.py:253:main] process 50298 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=1', '--config=configs/train/mlm.yaml', '--deepspeed_config=configs/deepspeed/ds_config.json', '--dtype=bf16']
[2024-05-09 13:37:33,316] [INFO] [launch.py:253:main] process 50299 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=2', '--config=configs/train/mlm.yaml', '--deepspeed_config=configs/deepspeed/ds_config.json', '--dtype=bf16']
[2024-05-09 13:37:33,317] [INFO] [launch.py:253:main] process 50300 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=3', '--config=configs/train/mlm.yaml', '--deepspeed_config=configs/deepspeed/ds_config.json', '--dtype=bf16']
[2024-05-09 13:37:33,317] [INFO] [launch.py:253:main] process 50301 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=4', '--config=configs/train/mlm.yaml', '--deepspeed_config=configs/deepspeed/ds_config.json', '--dtype=bf16']
[2024-05-09 13:37:33,317] [INFO] [launch.py:253:main] process 50302 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=5', '--config=configs/train/mlm.yaml', '--deepspeed_config=configs/deepspeed/ds_config.json', '--dtype=bf16']
[2024-05-09 13:37:33,317] [INFO] [launch.py:253:main] process 50303 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=6', '--config=configs/train/mlm.yaml', '--deepspeed_config=configs/deepspeed/ds_config.json', '--dtype=bf16']
[2024-05-09 13:37:33,317] [INFO] [launch.py:253:main] process 50304 spawned with command: ['/usr/bin/python3', '-u', 'train.py', '--local_rank=7', '--config=configs/train/mlm.yaml', '--deepspeed_config=configs/deepspeed/ds_config.json', '--dtype=bf16']
[2024-05-09 13:37:34,745] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 13:37:34,749] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 13:37:34,749] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 13:37:34,779] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 13:37:34,786] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 13:37:34,786] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 13:37:34,794] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 13:37:34,803] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
  File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module>
    from contrastors.read import read_config
ModuleNotFoundError: No module named 'contrastors'
Traceback (most recent call last):
  File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module>
    from contrastors.read import read_config
ModuleNotFoundError: No module named 'contrastors'
Traceback (most recent call last):
  File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module>
    from contrastors.read import read_config
ModuleNotFoundError: No module named 'contrastors'
Traceback (most recent call last):
  File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module>
    from contrastors.read import read_config
ModuleNotFoundError: No module named 'contrastors'
Traceback (most recent call last):
  File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module>
    from contrastors.read import read_config
ModuleNotFoundError: No module named 'contrastors'
Traceback (most recent call last):
  File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module>
    from contrastors.read import read_config
ModuleNotFoundError: No module named 'contrastors'
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module>
  File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module>
        from contrastors.read import read_configfrom contrastors.read import read_config

ModuleNotFoundErrorModuleNotFoundError: : No module named 'contrastors'No module named 'contrastors'

[2024-05-09 13:37:36,321] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 50297
[2024-05-09 13:37:36,333] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 50298
[2024-05-09 13:37:36,337] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 50299
[2024-05-09 13:37:36,339] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 50300
[2024-05-09 13:37:36,342] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 50301
[2024-05-09 13:37:36,342] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 50302
[2024-05-09 13:37:36,345] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 50303
[2024-05-09 13:37:36,347] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 50304
[2024-05-09 13:37:36,349] [ERROR] [launch.py:322:sigkill_handler] ['/usr/bin/python3', '-u', 'train.py', '--local_rank=7', '--config=configs/train/mlm.yaml', '--deepspeed_config=configs/deepspeed/ds_config.json', '--dtype=bf16'] exits with return code = 1

Command

torchrun --nproc-per-node=8 train.py --config=configs/train/contrastive_pretrain.yaml --dtype=bf16

Error

[2024-05-09 13:41:03,942] torch.distributed.run: [WARNING]
[2024-05-09 13:41:03,942] torch.distributed.run: [WARNING] *****************************************
[2024-05-09 13:41:03,942] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-05-09 13:41:03,942] torch.distributed.run: [WARNING] *****************************************
[2024-05-09 13:41:05,237] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 13:41:05,238] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 13:41:05,254] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 13:41:05,254] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 13:41:05,254] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 13:41:05,257] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 13:41:05,262] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-09 13:41:05,285] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
  File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module>
    from contrastors.read import read_config
ModuleNotFoundError: No module named 'contrastors'
Traceback (most recent call last):
  File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module>
    from contrastors.read import read_config
ModuleNotFoundError: No module named 'contrastors'
Traceback (most recent call last):
  File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module>
    from contrastors.read import read_config
ModuleNotFoundError: No module named 'contrastors'
Traceback (most recent call last):
  File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module>
    from contrastors.read import read_config
ModuleNotFoundError: No module named 'contrastors'
Traceback (most recent call last):
  File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module>
    from contrastors.read import read_config
ModuleNotFoundError: No module named 'contrastors'
Traceback (most recent call last):
  File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module>
    from contrastors.read import read_config
ModuleNotFoundError: No module named 'contrastors'
Traceback (most recent call last):
  File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module>
    from contrastors.read import read_config
ModuleNotFoundError: No module named 'contrastors'
Traceback (most recent call last):
  File "/home/karrtik/Projects/Embedding_Model/training/nomic-ai/contrastors/src/contrastors/train.py", line 9, in <module>
    from contrastors.read import read_config
ModuleNotFoundError: No module named 'contrastors'
[2024-05-09 13:41:08,951] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 50848) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/home/karrtik/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/karrtik/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/karrtik/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/karrtik/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/karrtik/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/karrtik/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-05-09_13:41:08
  host      : karrtik-gpu
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 50849)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-05-09_13:41:08
  host      : karrtik-gpu
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 50850)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-05-09_13:41:08
  host      : karrtik-gpu
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 50851)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-05-09_13:41:08
  host      : karrtik-gpu
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 50852)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-05-09_13:41:08
  host      : karrtik-gpu
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 50853)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-05-09_13:41:08
  host      : karrtik-gpu
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 50854)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2024-05-09_13:41:08
  host      : karrtik-gpu
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 50855)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-09_13:41:08
  host      : karrtik-gpu
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 50848)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
zanussbaum commented 1 month ago

did you run pip install -e . from the base of contrastors?