intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.44k stars 1.24k forks source link

LoRA LLAMA-70B finetuning fails on multi GPU. #10069

Closed sriraman2020 closed 6 months ago

sriraman2020 commented 7 months ago

https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/LLM-Finetuning/LoRA

lora_finetune_llama2_7b_pvc_1550_4_card.sh works fine with 7B

Replaced the workload with llama-70B it fails. - meta-llama/Llama-2-70b-hf

System config

(bigdl_31J) sdp@aia-sdp-pvc-135536:/localdisk/sdp/sudarsh2/rsrirama/BigDL/python/llm/example/GPU/LLM-Finetuning/LoRA$ clinfo | grep "compute" Max compute units 224 Max compute units 224 Max compute units 512 Max compute units 512 Max compute units 512 Max compute units 512 Max compute units 512 Max compute units 512 Max compute units 512 Max compute units 512 (bigdl_31J) sdp@aia-sdp-pvc-135536:/localdisk/sdp/sudarsh2/rsrirama/BigDL/python/llm/example/GPU/LLM-Finetuning/LoRA$

Error log below

RuntimeError: Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 7 PID 2019202 RUNNING AT aia-sdp-pvc-135536 = KILLED BY SIGNAL: 9 (Killed)

jason-dai commented 7 months ago

Try https://github.com/intel-analytics/BigDL/blob/main/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_70b_pvc_1550_4_card.sh?

sriraman2020 commented 7 months ago

seems to be stuck here for 15 mins

image
plusbang commented 7 months ago

seems to be stuck here for 15 mins image

According to the log, AMX state allocation in the OS failed!. It seems you still need to bypass AMX as we discussed in previous issue.

sriraman2020 commented 6 months ago

We are still seeing this error after disabled AMX - export BIGDL_LLM_AMX_DISABLED=1

Uptime: 461.239374 s 2024:02:01-23:06:50:(3084264) |CCL_ERROR| exchange_utils.cpp:220 recvmsg_fd: condition !check_msg_retval("recvmsg", recv_bytes, iov, msg, sizeof(u.cntr_buf), sock, *fd) failed errno: No such file or directory 2024:02:01-23:06:50:(3084264) |CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: exchange_utils.cpp:220 recvmsg_fd: EXCEPTION: errno: No such file or directory terminate called after throwing an instance of 'ccl::v1::exception' what(): oneCCL: exchange_utils.cpp:220 recvmsg_fd: EXCEPTION: errno: No such file or directory

LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)LIBXSMM WARNING: AMX state allocation in the OS failed!

LIBXSMM_TARGET: clx [Intel(R) Xeon(R) Platinum 8480+] Registry and code: 13 MB Command: python -u ./alpaca_qlora_finetuning.py --base_model meta-llama/Llama-2-70b-hf --data_path yahma/alpaca-cleaned --output_dir ./bigdl-qlora-alpaca --gradient_checkpointing True --micro_batch_size 8 --batch_size 128 --deepspeed ./deepspeed_zero2.json --saved_low_bit_model ./llama-2-70b-hf-nf4 Uptime: 461.489362 s Terminated (bigdl_31J) sdp@aia-sdp-pvc-135536:/localdisk/sdp/sudarsh2/rsrirama/BigDL/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora$

plusbang commented 6 months ago

Could you please provide more details about your environment (dependency version list)? Please make sure you've prepared your environment following installation instructions in https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora#1-install

sriraman2020 commented 6 months ago

/BigDL/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora$ pip list Package Version


accelerate 0.23.0 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 bigdl-core-xe-21 2.5.0b20240201 bigdl-core-xe-esimd-21 2.5.0b20240201 bigdl-llm 2.5.0b20240201 bitsandbytes 0.42.0 certifi 2024.2.2 charset-normalizer 3.3.2 datasets 2.14.7 deepspeed 0.11.2+78c518ed dill 0.3.7 filelock 3.13.1 fire 0.5.0 frozenlist 1.4.1 fsspec 2023.10.0 hjson 3.1.0 huggingface-hub 0.17.3 idna 3.6 intel-extension-for-deepspeed 0.9.4+ec33277 intel-extension-for-pytorch 2.1.10+xpu intel-openmp 2024.0.2 Jinja2 3.1.3 MarkupSafe 2.1.4 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.15 networkx 3.2.1 ninja 1.11.1.1 numpy 1.26.3 oneccl-bind-pt 2.1.100+xpu packaging 23.2 pandas 2.2.0 peft 0.5.0 pillow 10.2.0 pip 23.3.1 protobuf 5.26.0rc1 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.0 pyarrow-hotfix 0.6 pydantic 2.6.0 pydantic_core 2.16.1 python-dateutil 2.8.2 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 safetensors 0.4.2 scipy 1.12.0 sentencepiece 0.1.99 setuptools 68.2.2 six 1.16.0 sympy 1.12 tabulate 0.9.0 termcolor 2.4.0 tokenizers 0.14.1 torch 2.1.0a0+cxx11.abi torchvision 0.16.0a0+cxx11.abi tqdm 4.66.1 transformers 4.34.0 typing_extensions 4.9.0 tzdata 2023.4 urllib3 2.2.0 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4

sriraman2020 commented 6 months ago

@plusbang Looks like oneCCL issue? Do let me know if any more information is required.

image
plusbang commented 6 months ago

@plusbang Looks like oneCCL issue? Do let me know if any more information is required. image

yeah, it seems like a oneccl related issue. We previously encountered another oneccl related bug and solve it by sudo apt install level-zero-dev (https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora#7-troubleshooting). Maybe you could also try it.

sriraman2020 commented 6 months ago

driver already installed and present, Level_Zero

plusbang commented 6 months ago

driver already installed and present, Level_Zero

Maybe you could try to export CCL_LOG_LEVEL=debug and obtain more error messages about oneCCL.

sriraman2020 commented 6 months ago

below log with export CCL_LOG_LEVEL=debug, export ONECCL_BINDINGS_FOR_PYTORCH_ENV_VERBOSE=1

CCL_Log_level

plusbang commented 6 months ago

below log with export CCL_LOG_LEVEL=debug, export ONECCL_BINDINGS_FOR_PYTORCH_ENV_VERBOSE=1

CCL_Log_level

According to the log Too many open files, maybe you could try to raise the system open file limit using ulimit -n 1048576.

sriraman2020 commented 6 months ago

It is running with above fix ulimit -n 1048576, but after modifying the below code changes(**) it getting error, attached Error message details and CCL_Dubug logs here.

**with torch.autograd.profiler_legacy.profile(enabled=True, use_xpu=True, record_shapes=True) as prof:**
    trainer = transformers.Trainer(
        model=model,
        train_dataset=train_data,
        eval_dataset=val_data,
        args=transformers.TrainingArguments(
            per_device_train_batch_size=micro_batch_size,
            gradient_accumulation_steps=gradient_accumulation_steps,
            # warmup_ratio=0.03,
            # warmup_steps=100,
            max_grad_norm=0.3,
            # num_train_epochs=num_epochs,
            learning_rate=learning_rate,
            lr_scheduler_type="cosine",
            bf16=True,  # ensure training more stable
            logging_steps=1,
            optim="adamw_torch",
            evaluation_strategy="steps" if val_set_size > 0 else "no",
            save_strategy="steps",
            eval_steps=1 if val_set_size > 0 else None,
            save_steps=1,
            max_steps = 1,
            output_dir=output_dir,
            save_total_limit=1,
            load_best_model_at_end=True if val_set_size > 0 else False,
            ddp_find_unused_parameters=False if ddp else None,
            group_by_length=group_by_length,
            report_to="wandb" if use_wandb else None,
            run_name=wandb_run_name if use_wandb else None,
            gradient_checkpointing=gradient_checkpointing,
            ddp_backend="ccl",
            deepspeed=deepspeed,
            save_safetensors=False,
        ),
        data_collator=transformers.DataCollatorForSeq2Seq(
            tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
        ),
    )
    model.config.use_cache = False

    trainer.train(resume_from_checkpoint=resume_from_checkpoint)

    model.save_pretrained(output_dir)

    print(
        "\n If there's a warning about missing keys above, please disregard :)"
    )
**torch.save(prof.table(sort_by="id", row_limit=-1),"./qlora_llama7b_finetuning_profile_id.pt")**
**torch.save(prof.key_averages(group_by_input_shape=True).table(row_limit=-1),"./qlora_llama7b_finetuning_profile_detail.pt")**
**prof.export_chrome_trace("./qlora_llama7b_finetuning_trace.json")**

if name == "main": fire.Fire(train)

jason-dai commented 6 months ago

It is running with above fix ulimit -n 1048576, but after modifying the below code changes(**) it getting error, attached Error message details and CCL_Dubug logs here.

It seems there are no error messages or logs here?

sriraman2020 commented 6 months ago

Actually its working fine. The error was due to shared system. We are able to successfully collect performance stats. Thanks!

hkvision commented 6 months ago

Thanks for your response. Since the issue is closed, we are closing it. Feel free to raise new issues in the future :)