huggingface / optimum-habana

Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)
Apache License 2.0
148 stars 186 forks source link

Device Aquire failed #367

Closed DannyAtal closed 8 months ago

DannyAtal commented 1 year ago

System Info

running this command in single Gaudi works very well:
optimum-habana/examples/language-modeling/run_lora_clm.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --train_file merged_final_ultimate_andy.json \
    --bf16 True \
    --output_dir ./model_lora_llama \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy no \
    --save_strategy steps \
    --save_total_limit 1 \
    --learning_rate 1e-4 \
    --logging_steps 1 \
    --dataset_concatenation \
    --do_train \
    --use_habana \
    --use_lazy_mode \
    --throughput_warmup_steps 1

and its finishing the training on Gaudi.

but trying this command will fail:
python optimum-habana/examples/gaudi_spawn.py \
    --world_size 4 --use_mpi optimum-habana/examples/language-modeling/run_lora_clm.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --train_file merged_final_ultimate_andy.json \
    --bf16 True \
    --output_dir ./model_lora_llama \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy no \
    --save_strategy steps \
    --save_total_limit 1 \
    --learning_rate 1e-4 \
    --logging_steps 1 \
    --dataset_concatenation \
    --do_train \
    --use_habana \
    --use_lazy_mode \
    --throughput_warmup_steps 1

while using only 2 devices:
python optimum-habana/examples/gaudi_spawn.py \
    --world_size 2 --use_mpi optimum-habana/examples/language-modeling/run_lora_clm.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --train_file merged_final_ultimate_andy.json \
    --bf16 True \
    --output_dir ./model_lora_llama \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy no \
    --save_strategy steps \
    --save_total_limit 1 \
    --learning_rate 1e-4 \
    --logging_steps 1 \
    --dataset_concatenation \
    --do_train \
    --use_habana \
    --use_lazy_mode \
    --throughput_warmup_steps 1

also this command fails with fail to acquire device:
python optimum-habana/examples/gaudi_spawn.py \
    --world_size 4 --use_deepspeed optimum-habana/examples/language-modeling/run_lora_clm.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --train_file merged_final_ultimate_andy.json \
    --bf16 True \
    --output_dir ./model_lora_llama \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy no \
    --save_strategy steps \
    --save_total_limit 1 \
    --learning_rate 1e-4 \
    --logging_steps 1 \
    --dataset_concatenation \
    --do_train \
    --use_habana \
    --use_lazy_mode \
    --throughput_warmup_steps 1 \
    --deepspeed gaudi_config.json
I used this config file:

{
    "steps_per_print": 64,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "bf16": {
        "enabled": true
    },
    "gradient_clipping": 1.0,
    "zero_optimization": {
        "stage": 2,
        "overlap_comm": false,
        "reduce_scatter": false,
        "contiguous_gradients": false
    }
}

Note: Im using a 7 devices in my template which gives me 7 HPU's

for a while Im facing some issues running distributed with Gaudi multi devices and I really want to run a 70B llama model but for now Im stuck.

Information

Tasks

Reproduction

described in info

Expected behavior

to work

DannyAtal commented 1 year ago
image
DannyAtal commented 1 year ago

another Issue I faced with gaudi while trying to create my own script for training: is this: Screenshot 2023-08-24 at 17 23 16 (1)

Im facing this whenever I use device_map="auto"

when using it as device_map="hpu" I receive this error: image (20)

I believe this is due to loading the shards to device and trying to train on it, and the best practice will be using it as auto which again will face issues with Intel_extension_for_pytorch{xpu} I would love if you could give me some hints here.

regisss commented 1 year ago

Hi @DannyAtal! Running the LoRA example with DeepSpeed is currently not optimized for HPUs and requires a lot of memory. We are are working on improving this! But it's weird that you get a device acquisition issue, I'm going to look into it.

Regarding the multi-card examples, let me try to reproduce it and I'll get back to you.

The undefined symbol error means that you installed a version of PyTorch that is not the one provided by Habana. I think Intel extension for PyTorch targets Intel CPUs and GPUs, why do you want to use it on Gaudi?

The other error means that you're trying to move to HPU a tensor that requires more memory than available.

DannyAtal commented 1 year ago

@regisss yes but without the LoRA Im receiving this error:

image

and that error is related to the transformers version, which at the first place I used 4.28.1 that is working with LoRA but then the code specified that I need 4.32.0 and when I edited the version in the code to 4.28.1 I received this error:

image

it seems to be trying to search for the config file in HF and not locally.

about the Pytorch version when I tried to run in my code with The undefined symbol error, if its not the right one how come it works with your code on single/double HPUs. Im also trying Intel Extension for transformers example and apparently it is needed for the inference.

about the last error yes thats occur when I try bigger model such as Llama 13B or 70B and I want to be able to use the 70B eventually. so how can we distribute the data between the cards? I have a 3 nodes each contain 8HPUS but run on Kubernetes so I don't think we can use them all in the same job, I can use up to 8 for the same job

any suggestions?

DannyAtal commented 1 year ago

@regisss also when running the same command with Gaudi2 I get this error:

image

this commnd: python3 optimum-habana/examples/language-modeling/run_lora_clm.py \ --model_name_or_path meta-llama/Llama-2-7b-hf \ --train_file merged_final_ultimate_andy.json \ --bf16 True \ --output_dir ./model_lora_llama \ --num_train_epochs 3 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 4 \ --evaluation_strategy no \ --save_strategy steps \ --save_steps 2000 \ --save_total_limit 1 \ --learning_rate 1e-4 \ --logging_steps 1 \ --dataset_concatenation \ --do_train \ --use_habana \ --use_lazy_mode \ --throughput_warmup_steps 1

regisss commented 1 year ago

and that error is related to the transformers version, which at the first place I used 4.28.1 that is working with LoRA but then the code specified that I need 4.32.0

Yes, you need Transformers v4.32 with the last release. For multi-card LoRA, you can install the library from source because we merged a fix this morning.

and when I edited the version in the code to 4.28.1 I received this error

It looks for the config file both online and locally. Are you sure the path you gave is the right one? Because it's looking for None as the path. You need to give the --gaudi_config_name arg for run_clm.py, for instance --gaudi_config_name Habana/gpt2 should work well with Llama.

about the Pytorch version when I tried to run in my code with The undefined symbol error, if its not the right one how come it works with your code on single/double HPUs. Im also trying Intel Extension for transformers example and apparently it is needed for the inference.

My guess is that the Intel extension overrides only a few methods that were not used by the examples you ran. But there is absolutely no guarantee that it will work with Gaudi as it's not meant for it. Where did you see that it is needed for inference?

about the last error yes thats occur when I try bigger model such as Llama 13B or 70B and I want to be able to use the 70B eventually. so how can we distribute the data between the cards?

We need DeepSpeed for this but it doesn't work well with LoRA at the moment. We are investigating it. See how to use DeepSpeed on Gaudi if you want to try: https://huggingface.co/docs/optimum/habana/usage_guides/deepspeed

also when running the same command with Gaudi2 I get this error:

I cannot reproduce it on Gaudi2 with --dataset_name tatsu-lab/alpaca. Could you share merged_final_ultimate_andy.json or find apublic dataset so that I can reproduce it?

DannyAtal commented 1 year ago

yes it reproduced for me also with this dataset on gaudi2 note that Im using gaudi2 on Kunernetes it was reproduced with this command: python3 optimum-habana/examples/language-modeling/run_lora_clm.py \ --model_name_or_path meta-llama/Llama-2-7b-hf \ --dataset_name tatsu-lab/alpaca\ --bf16 True \ --output_dir ./model_lora_llama \ --num_train_epochs 3 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 4 \ --evaluation_strategy no \ --save_strategy steps \ --save_steps 2000 \ --save_total_limit 1 \ --learning_rate 1e-4 \ --logging_steps 1 \ --dataset_concatenation \ --do_train \ --use_habana \ --use_lazy_mode \ --throughput_warmup_steps 1

regisss commented 1 year ago

I cannot reproduce it on my Gaudi2 instance. Maybe it's related to Kubernetes, I suggest you ask on Habana's forum: https://forum.habana.ai/

ZhaiFeiyue commented 1 year ago

@DannyAtal which release are you using? could you supply more info, e.g. hl-smi outputs habana release version and docker ?

DannyAtal commented 1 year ago

@ZhaiFeiyue this is the versions Im using:

image

and this is the docker Image Im using: public.ecr.aws/habanalabs/pytorch-installer:2.0.1-ubuntu20.04-1.11.0-latest

these are the devices:

image

and Im pulling optimum-habana master version, and Installing this from pypi: pip install optimum-habana==1.7.2 and this deep speed version: pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.11.0

ZhaiFeiyue commented 1 year ago

@DannyAtal can you run hl-smi -L and provide the outputs?

DannyAtal commented 1 year ago

@ZhaiFeiyue sure here is it: hl-smi-l.txt

ZhaiFeiyue commented 1 year ago

@DannyAtal driver, fw and docker seems already match with each other. have you followed the k8s guide to setup your env? or have you tried running model on gaudi with k8s?

DannyAtal commented 1 year ago

@ZhaiFeiyue no I didnt install it, it was a guy from habana who installed it. it is running on single card but fail for multi cards, also didnt try multimodes cause if in single node its not functioning then it wont in multi nodes. yes I was able to run a model on gaudi dl1 and dl2 in k8s but not on multi cards just on a single device.

regisss commented 11 months ago

@DannyAtal Have you managed to solve this?