DDP error when running training script on mac

sciencecw commented 7 months ago

I tried to run the trainer on an M1 mac and it is giving me errors related to distributed computing, which I am not familiar with. Any help would be appreciated!

python run.py --per_device_train_batch_size 128 --per_device_eval_batch_size 128 --max_seq_length 128 --model_name_or_path t5-base --dataset_name msmarco --embedder_model_name gtr_base --num_repeat_tokens 16 --embedder_no_grad True --num_train_epochs 100 --max_eval_samples 500 --eval_steps 20000 --warmup_steps 10000 --use_frozen_embeddings_as_input True --experiment inversion --lr_scheduler_type constant_with_warmup  --learning_rate 0.001 --output_dir ./saves/gtr-1

Set num workers to 0
Experiment output_dir = ./saves/gtr-1
01/22/2024 23:31:59 - WARNING - experiments - Process rank: 0, device: mps, n_gpu: 1, fp16 training: False, bf16 training: False
/Users/sciencecw/Repos/references/vec2text/.venv/lib/python3.11/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/Users/sciencecw/Repos/references/vec2text/.venv/lib/python3.11/site-packages/transformers/models/t5/tokenization_t5_fast.py:160: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(
Loading datasets with TOKENIZERS_PARALLELISM = False
>> using fast tokenizers: True True
Traceback (most recent call last):
  File "/Users/sciencecw/Repos/references/vec2text/vec2text/utils/utils.py", line 111, in dataset_map_multi_worker
    rank = torch.distributed.get_rank()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sciencecw/Repos/references/vec2text/.venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1469, in get_rank
    default_pg = _get_default_group()
                 ^^^^^^^^^^^^^^^^^^^^
  File "/Users/sciencecw/Repos/references/vec2text/.venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 940, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/sciencecw/Repos/references/vec2text/vec2text/run.py", line 16, in <module>
    main()
  File "/Users/sciencecw/Repos/references/vec2text/vec2text/run.py", line 12, in main
    experiment.run()
  File "/Users/sciencecw/Repos/references/vec2text/vec2text/experiments.py", line 158, in run
    self.train()
  File "/Users/sciencecw/Repos/references/vec2text/vec2text/experiments.py", line 175, in train
    trainer = self.load_trainer()
              ^^^^^^^^^^^^^^^^^^^
  File "/Users/sciencecw/Repos/references/vec2text/vec2text/experiments.py", line 631, in load_trainer
    train_dataset, eval_dataset = self.load_train_and_val_datasets(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sciencecw/Repos/references/vec2text/vec2text/experiments.py", line 570, in load_train_and_val_datasets
    train_datasets = self._load_train_dataset_uncached(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sciencecw/Repos/references/vec2text/vec2text/experiments.py", line 396, in _load_train_dataset_uncached
    raw_datasets[key] = dataset_map_multi_worker(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sciencecw/Repos/references/vec2text/vec2text/utils/utils.py", line 118, in dataset_map_multi_worker
    kwargs["num_proc"] = kwargs.get("num_proc", len(os.sched_getaffinity(0)))
                                                    ^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'os' has no attribute 'sched_getaffinity'
[2024-01-22 23:32:03,873] torch._dynamo.utils: [INFO] TorchDynamo compilation metrics:
[2024-01-22 23:32:03,873] torch._dynamo.utils: [INFO] Function    Runtimes (s)
[2024-01-22 23:32:03,873] torch._dynamo.utils: [INFO] ----------  --------------

jxmorris12 commented 7 months ago

Hi! It's a long stack trace but this is the issue:

    kwargs["num_proc"] = kwargs.get("num_proc", len(os.sched_getaffinity(0)))

I think someone fixed this somewhere else in our coded but the idea is that os.sched_getaffinity only works on Linux. Let me try and fix this for you.

jxmorris12 commented 7 months ago

Hi again @sciencecw -- I just pushed a new commit that should hopefully fix this error. The new release isn't on pypi yet so you'll have to download the code and install from source (pip install -e .). Can you try that and rerun your command and let me know how it goes?

sciencecw commented 7 months ago

Almost correct. I just had to remove the assert torch.cuda.is_available() and it runs

( I found other issues with using sentence transformers. I will open another issue for that)

jxmorris12 commented 7 months ago

great!

jxmorris12 / vec2text

DDP error when running training script on mac #26