huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.72k stars 27.17k forks source link

Loading HuBERT models with DeepSpeed ZeRO-3 causes program to hang #31797

Open anferico opened 5 months ago

anferico commented 5 months ago

System Info

Who can help?

@sanchit-gandhi @muellerzr

Information

Tasks

Reproduction

hubert_mre.py:

from transformers import AutoConfig, HubertModel, TrainingArguments, HfArgumentParser

def main():
    parser = HfArgumentParser(TrainingArguments)
    training_args = parser.parse_args_into_dataclasses()[0]

    config = AutoConfig.from_pretrained("facebook/hubert-large-ls960-ft")

    model = HubertModel.from_pretrained(
        "facebook/hubert-large-ls960-ft", config=config
    )

if __name__ == "__main__":
    main()

hubert_mre.sh:

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export CUDA_LAUNCH_BLOCKING=1

OUTPUT_DIR=$HOME/hubert_mre

deepspeed \
    --num_gpus 2 \
    --master_port 60000 \
    ./hubert_mre.py \
    --output_dir $OUTPUT_DIR \
    --deepspeed zero3.json

zero3.json:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}

Run hubert_mre.sh and watch the script hang indefinitely.

The curious thing is that this seems to happen only with HuBERT models. If, for example, you replace HubertModel.from_pretrained("facebook/hubert-large-ls960-ft") with Wav2Vec2BertModel.from_pretrained("facebook/w2v-bert-2.0"), the script runs just fine.

Also, this works fine if you pass --num_gpus 1.

Expected behavior

The script runs to completion without hanging indefinitely.

anferico commented 4 months ago

Any help with this? @sanchit-gandhi @muellerzr

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

anferico commented 3 months ago

Up @sanchit-gandhi @muellerzr

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

LysandreJik commented 2 months ago

Pinging @ylacombe and @eustlb as Sanchit is away for a few months

ylacombe commented 2 months ago

Hey @eustlb, I've never used DeepSpeed myself, would you like to take a stab at it? If not, I'll try to reproduce the issue on my side

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Rocketknight1 commented 1 month ago

Gentle ping @eustlb @ylacombe!

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.