Loading HuBERT models with DeepSpeed ZeRO-3 causes program to hang

anferico commented 5 months ago

System Info

transformers version: 4.42.3
Platform: Linux-5.14.0-362.24.1.el9_3.x86_64-x86_64-with-glibc2.34
Python version: 3.10.14
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: 0.31.0
Accelerate config: not found
PyTorch version (GPU?): 2.3.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: yes
Using GPU in script?: no
GPU type: NVIDIA A100-SXM4-40GB

Who can help?

@sanchit-gandhi @muellerzr

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

hubert_mre.py:

from transformers import AutoConfig, HubertModel, TrainingArguments, HfArgumentParser

def main():
    parser = HfArgumentParser(TrainingArguments)
    training_args = parser.parse_args_into_dataclasses()[0]

    config = AutoConfig.from_pretrained("facebook/hubert-large-ls960-ft")

    model = HubertModel.from_pretrained(
        "facebook/hubert-large-ls960-ft", config=config
    )

if __name__ == "__main__":
    main()

hubert_mre.sh:

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export CUDA_LAUNCH_BLOCKING=1

OUTPUT_DIR=$HOME/hubert_mre

deepspeed \
    --num_gpus 2 \
    --master_port 60000 \
    ./hubert_mre.py \
    --output_dir $OUTPUT_DIR \
    --deepspeed zero3.json

zero3.json:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}

Run hubert_mre.sh and watch the script hang indefinitely.

The curious thing is that this seems to happen only with HuBERT models. If, for example, you replace HubertModel.from_pretrained("facebook/hubert-large-ls960-ft") with Wav2Vec2BertModel.from_pretrained("facebook/w2v-bert-2.0"), the script runs just fine.

Also, this works fine if you pass --num_gpus 1.

Expected behavior

The script runs to completion without hanging indefinitely.

anferico commented 4 months ago

Any help with this? @sanchit-gandhi @muellerzr

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

anferico commented 3 months ago

Up @sanchit-gandhi @muellerzr

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

LysandreJik commented 2 months ago

Pinging @ylacombe and @eustlb as Sanchit is away for a few months

ylacombe commented 2 months ago

Hey @eustlb, I've never used DeepSpeed myself, would you like to take a stab at it? If not, I'll try to reproduce the issue on my side

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Rocketknight1 commented 1 month ago

Gentle ping @eustlb @ylacombe!

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers