huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.73k stars 26.23k forks source link

Loading HuBERT models with DeepSpeed ZeRO-3 causes program to hang #31797

Open anferico opened 2 months ago

anferico commented 2 months ago

System Info

Who can help?

@sanchit-gandhi @muellerzr

Information

Tasks

Reproduction

hubert_mre.py:

from transformers import AutoConfig, HubertModel, TrainingArguments, HfArgumentParser

def main():
    parser = HfArgumentParser(TrainingArguments)
    training_args = parser.parse_args_into_dataclasses()[0]

    config = AutoConfig.from_pretrained("facebook/hubert-large-ls960-ft")

    model = HubertModel.from_pretrained(
        "facebook/hubert-large-ls960-ft", config=config
    )

if __name__ == "__main__":
    main()

hubert_mre.sh:

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export CUDA_LAUNCH_BLOCKING=1

OUTPUT_DIR=$HOME/hubert_mre

deepspeed \
    --num_gpus 2 \
    --master_port 60000 \
    ./hubert_mre.py \
    --output_dir $OUTPUT_DIR \
    --deepspeed zero3.json

zero3.json:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}

Run hubert_mre.sh and watch the script hang indefinitely.

The curious thing is that this seems to happen only with HuBERT models. If, for example, you replace HubertModel.from_pretrained("facebook/hubert-large-ls960-ft") with Wav2Vec2BertModel.from_pretrained("facebook/w2v-bert-2.0"), the script runs just fine.

Also, this works fine if you pass --num_gpus 1.

Expected behavior

The script runs to completion without hanging indefinitely.

anferico commented 1 month ago

Any help with this? @sanchit-gandhi @muellerzr

github-actions[bot] commented 4 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

anferico commented 4 weeks ago

Up @sanchit-gandhi @muellerzr