Closed DannyAtal closed 8 months ago
another Issue I faced with gaudi while trying to create my own script for training: is this:
Im facing this whenever I use device_map="auto"
when using it as device_map="hpu" I receive this error:
I believe this is due to loading the shards to device and trying to train on it, and the best practice will be using it as auto which again will face issues with Intel_extension_for_pytorch{xpu} I would love if you could give me some hints here.
Hi @DannyAtal! Running the LoRA example with DeepSpeed is currently not optimized for HPUs and requires a lot of memory. We are are working on improving this! But it's weird that you get a device acquisition issue, I'm going to look into it.
Regarding the multi-card examples, let me try to reproduce it and I'll get back to you.
The undefined symbol error means that you installed a version of PyTorch that is not the one provided by Habana. I think Intel extension for PyTorch targets Intel CPUs and GPUs, why do you want to use it on Gaudi?
The other error means that you're trying to move to HPU a tensor that requires more memory than available.
@regisss yes but without the LoRA Im receiving this error:
and that error is related to the transformers version, which at the first place I used 4.28.1 that is working with LoRA but then the code specified that I need 4.32.0 and when I edited the version in the code to 4.28.1 I received this error:
it seems to be trying to search for the config file in HF and not locally.
about the Pytorch version when I tried to run in my code with The undefined symbol error, if its not the right one how come it works with your code on single/double HPUs. Im also trying Intel Extension for transformers example and apparently it is needed for the inference.
about the last error yes thats occur when I try bigger model such as Llama 13B or 70B and I want to be able to use the 70B eventually. so how can we distribute the data between the cards? I have a 3 nodes each contain 8HPUS but run on Kubernetes so I don't think we can use them all in the same job, I can use up to 8 for the same job
any suggestions?
@regisss also when running the same command with Gaudi2 I get this error:
this commnd: python3 optimum-habana/examples/language-modeling/run_lora_clm.py \ --model_name_or_path meta-llama/Llama-2-7b-hf \ --train_file merged_final_ultimate_andy.json \ --bf16 True \ --output_dir ./model_lora_llama \ --num_train_epochs 3 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 4 \ --evaluation_strategy no \ --save_strategy steps \ --save_steps 2000 \ --save_total_limit 1 \ --learning_rate 1e-4 \ --logging_steps 1 \ --dataset_concatenation \ --do_train \ --use_habana \ --use_lazy_mode \ --throughput_warmup_steps 1
and that error is related to the transformers version, which at the first place I used 4.28.1 that is working with LoRA but then the code specified that I need 4.32.0
Yes, you need Transformers v4.32 with the last release. For multi-card LoRA, you can install the library from source because we merged a fix this morning.
and when I edited the version in the code to 4.28.1 I received this error
It looks for the config file both online and locally. Are you sure the path you gave is the right one? Because it's looking for None
as the path. You need to give the --gaudi_config_name
arg for run_clm.py
, for instance --gaudi_config_name Habana/gpt2
should work well with Llama.
about the Pytorch version when I tried to run in my code with The undefined symbol error, if its not the right one how come it works with your code on single/double HPUs. Im also trying Intel Extension for transformers example and apparently it is needed for the inference.
My guess is that the Intel extension overrides only a few methods that were not used by the examples you ran. But there is absolutely no guarantee that it will work with Gaudi as it's not meant for it. Where did you see that it is needed for inference?
about the last error yes thats occur when I try bigger model such as Llama 13B or 70B and I want to be able to use the 70B eventually. so how can we distribute the data between the cards?
We need DeepSpeed for this but it doesn't work well with LoRA at the moment. We are investigating it. See how to use DeepSpeed on Gaudi if you want to try: https://huggingface.co/docs/optimum/habana/usage_guides/deepspeed
also when running the same command with Gaudi2 I get this error:
I cannot reproduce it on Gaudi2 with --dataset_name tatsu-lab/alpaca
. Could you share merged_final_ultimate_andy.json
or find apublic dataset so that I can reproduce it?
yes it reproduced for me also with this dataset on gaudi2 note that Im using gaudi2 on Kunernetes it was reproduced with this command: python3 optimum-habana/examples/language-modeling/run_lora_clm.py \ --model_name_or_path meta-llama/Llama-2-7b-hf \ --dataset_name tatsu-lab/alpaca\ --bf16 True \ --output_dir ./model_lora_llama \ --num_train_epochs 3 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 4 \ --evaluation_strategy no \ --save_strategy steps \ --save_steps 2000 \ --save_total_limit 1 \ --learning_rate 1e-4 \ --logging_steps 1 \ --dataset_concatenation \ --do_train \ --use_habana \ --use_lazy_mode \ --throughput_warmup_steps 1
I cannot reproduce it on my Gaudi2 instance. Maybe it's related to Kubernetes, I suggest you ask on Habana's forum: https://forum.habana.ai/
@DannyAtal which release are you using? could you supply more info, e.g. hl-smi outputs habana release version and docker ?
@ZhaiFeiyue this is the versions Im using:
and this is the docker Image Im using: public.ecr.aws/habanalabs/pytorch-installer:2.0.1-ubuntu20.04-1.11.0-latest
these are the devices:
and Im pulling optimum-habana master version, and Installing this from pypi: pip install optimum-habana==1.7.2 and this deep speed version: pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.11.0
@DannyAtal can you run hl-smi -L
and provide the outputs?
@ZhaiFeiyue sure here is it: hl-smi-l.txt
@DannyAtal driver, fw and docker seems already match with each other. have you followed the k8s guide to setup your env? or have you tried running model on gaudi with k8s?
@ZhaiFeiyue no I didnt install it, it was a guy from habana who installed it. it is running on single card but fail for multi cards, also didnt try multimodes cause if in single node its not functioning then it wont in multi nodes. yes I was able to run a model on gaudi dl1 and dl2 in k8s but not on multi cards just on a single device.
@DannyAtal Have you managed to solve this?
System Info
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
described in info
Expected behavior
to work