tunib-ai / parallelformers

Parallelformers: An Efficient Model Parallelization Toolkit for Deployment
https://tunib-ai.github.io/parallelformers
Apache License 2.0
778 stars 61 forks source link

Title: RuntimeError: Timed out initializing process group in store based barrier #54

Open hugocool opened 1 year ago

hugocool commented 1 year ago

    from transformers import TrainingArguments
    import torch

    # get the number of gpus
    num_gpus = torch.cuda.device_count()
    if num_gpus > 1:
        from parallelformers import parallelize

        parallelize(model, num_gpus=num_gpus, fp16=True, verbose="detail")

gives

RuntimeError: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=8, worker_count=9, timeout=0:30:00) WARNING No nodes ran. Repeat the previous runner.py:213 command to attempt a new run. [10/15/23 12:57:26] ERROR Node 'sort_using_baal: node.py:356 preprocess_and_sort([baal.reed_textkernel_labeled,params:reed.pretrained_model_name,reed.aimwel_labeled.finetuned_pre_trained_isco_classifier]) -> [reed.textkernel_labeled.sorted_jobs,baal.reed_textkernel_labeled_parquet]' failed with error: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=8, worker_count=9, timeout=0:30:00)

Environment

python 3.10.1 parralelformers latest o: ubuntu