Closed sriraman2020 closed 6 months ago
seems to be stuck here for 15 mins
seems to be stuck here for 15 mins
According to the log, AMX state allocation in the OS failed!
. It seems you still need to bypass AMX as we discussed in previous issue.
We are still seeing this error after disabled AMX - export BIGDL_LLM_AMX_DISABLED=1
Uptime: 461.239374 s 2024:02:01-23:06:50:(3084264) |CCL_ERROR| exchange_utils.cpp:220 recvmsg_fd: condition !check_msg_retval("recvmsg", recv_bytes, iov, msg, sizeof(u.cntr_buf), sock, *fd) failed errno: No such file or directory 2024:02:01-23:06:50:(3084264) |CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: exchange_utils.cpp:220 recvmsg_fd: EXCEPTION: errno: No such file or directory terminate called after throwing an instance of 'ccl::v1::exception' what(): oneCCL: exchange_utils.cpp:220 recvmsg_fd: EXCEPTION: errno: No such file or directory
LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)LIBXSMM WARNING: AMX state allocation in the OS failed!
LIBXSMM_TARGET: clx [Intel(R) Xeon(R) Platinum 8480+] Registry and code: 13 MB Command: python -u ./alpaca_qlora_finetuning.py --base_model meta-llama/Llama-2-70b-hf --data_path yahma/alpaca-cleaned --output_dir ./bigdl-qlora-alpaca --gradient_checkpointing True --micro_batch_size 8 --batch_size 128 --deepspeed ./deepspeed_zero2.json --saved_low_bit_model ./llama-2-70b-hf-nf4 Uptime: 461.489362 s Terminated (bigdl_31J) sdp@aia-sdp-pvc-135536:/localdisk/sdp/sudarsh2/rsrirama/BigDL/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora$
Could you please provide more details about your environment (dependency version list)? Please make sure you've prepared your environment following installation instructions in https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora#1-install
/BigDL/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora$ pip list Package Version
accelerate 0.23.0 aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 async-timeout 4.0.3 attrs 23.2.0 bigdl-core-xe-21 2.5.0b20240201 bigdl-core-xe-esimd-21 2.5.0b20240201 bigdl-llm 2.5.0b20240201 bitsandbytes 0.42.0 certifi 2024.2.2 charset-normalizer 3.3.2 datasets 2.14.7 deepspeed 0.11.2+78c518ed dill 0.3.7 filelock 3.13.1 fire 0.5.0 frozenlist 1.4.1 fsspec 2023.10.0 hjson 3.1.0 huggingface-hub 0.17.3 idna 3.6 intel-extension-for-deepspeed 0.9.4+ec33277 intel-extension-for-pytorch 2.1.10+xpu intel-openmp 2024.0.2 Jinja2 3.1.3 MarkupSafe 2.1.4 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.15 networkx 3.2.1 ninja 1.11.1.1 numpy 1.26.3 oneccl-bind-pt 2.1.100+xpu packaging 23.2 pandas 2.2.0 peft 0.5.0 pillow 10.2.0 pip 23.3.1 protobuf 5.26.0rc1 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.0 pyarrow-hotfix 0.6 pydantic 2.6.0 pydantic_core 2.16.1 python-dateutil 2.8.2 pytz 2024.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 safetensors 0.4.2 scipy 1.12.0 sentencepiece 0.1.99 setuptools 68.2.2 six 1.16.0 sympy 1.12 tabulate 0.9.0 termcolor 2.4.0 tokenizers 0.14.1 torch 2.1.0a0+cxx11.abi torchvision 0.16.0a0+cxx11.abi tqdm 4.66.1 transformers 4.34.0 typing_extensions 4.9.0 tzdata 2023.4 urllib3 2.2.0 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.4
@plusbang Looks like oneCCL issue? Do let me know if any more information is required.
@plusbang Looks like oneCCL issue? Do let me know if any more information is required.
yeah, it seems like a oneccl related issue. We previously encountered another oneccl related bug and solve it by sudo apt install level-zero-dev
(https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora#7-troubleshooting). Maybe you could also try it.
driver already installed and present,
driver already installed and present,
Maybe you could try to export CCL_LOG_LEVEL=debug
and obtain more error messages about oneCCL.
below log with export CCL_LOG_LEVEL=debug, export ONECCL_BINDINGS_FOR_PYTORCH_ENV_VERBOSE=1
below log with export CCL_LOG_LEVEL=debug, export ONECCL_BINDINGS_FOR_PYTORCH_ENV_VERBOSE=1
According to the log Too many open files
, maybe you could try to raise the system open file limit using ulimit -n 1048576
.
It is running with above fix ulimit -n 1048576, but after modifying the below code changes(**) it getting error, attached Error message details and CCL_Dubug logs here.
**with torch.autograd.profiler_legacy.profile(enabled=True, use_xpu=True, record_shapes=True) as prof:**
trainer = transformers.Trainer(
model=model,
train_dataset=train_data,
eval_dataset=val_data,
args=transformers.TrainingArguments(
per_device_train_batch_size=micro_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
# warmup_ratio=0.03,
# warmup_steps=100,
max_grad_norm=0.3,
# num_train_epochs=num_epochs,
learning_rate=learning_rate,
lr_scheduler_type="cosine",
bf16=True, # ensure training more stable
logging_steps=1,
optim="adamw_torch",
evaluation_strategy="steps" if val_set_size > 0 else "no",
save_strategy="steps",
eval_steps=1 if val_set_size > 0 else None,
save_steps=1,
max_steps = 1,
output_dir=output_dir,
save_total_limit=1,
load_best_model_at_end=True if val_set_size > 0 else False,
ddp_find_unused_parameters=False if ddp else None,
group_by_length=group_by_length,
report_to="wandb" if use_wandb else None,
run_name=wandb_run_name if use_wandb else None,
gradient_checkpointing=gradient_checkpointing,
ddp_backend="ccl",
deepspeed=deepspeed,
save_safetensors=False,
),
data_collator=transformers.DataCollatorForSeq2Seq(
tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
),
)
model.config.use_cache = False
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
model.save_pretrained(output_dir)
print(
"\n If there's a warning about missing keys above, please disregard :)"
)
**torch.save(prof.table(sort_by="id", row_limit=-1),"./qlora_llama7b_finetuning_profile_id.pt")**
**torch.save(prof.key_averages(group_by_input_shape=True).table(row_limit=-1),"./qlora_llama7b_finetuning_profile_detail.pt")**
**prof.export_chrome_trace("./qlora_llama7b_finetuning_trace.json")**
if name == "main": fire.Fire(train)
It is running with above fix ulimit -n 1048576, but after modifying the below code changes(**) it getting error, attached Error message details and CCL_Dubug logs here.
It seems there are no error messages or logs here?
Actually its working fine. The error was due to shared system. We are able to successfully collect performance stats. Thanks!
Thanks for your response. Since the issue is closed, we are closing it. Feel free to raise new issues in the future :)
https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/LLM-Finetuning/LoRA
lora_finetune_llama2_7b_pvc_1550_4_card.sh works fine with 7B
Replaced the workload with llama-70B it fails. - meta-llama/Llama-2-70b-hf
System config
(bigdl_31J) sdp@aia-sdp-pvc-135536:/localdisk/sdp/sudarsh2/rsrirama/BigDL/python/llm/example/GPU/LLM-Finetuning/LoRA$ clinfo | grep "compute" Max compute units 224 Max compute units 224 Max compute units 512 Max compute units 512 Max compute units 512 Max compute units 512 Max compute units 512 Max compute units 512 Max compute units 512 Max compute units 512 (bigdl_31J) sdp@aia-sdp-pvc-135536:/localdisk/sdp/sudarsh2/rsrirama/BigDL/python/llm/example/GPU/LLM-Finetuning/LoRA$
Error log below
RuntimeError: Native API failed. Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES) -5 (PI_ERROR_OUT_OF_RESOURCES)
=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 7 PID 2019202 RUNNING AT aia-sdp-pvc-135536 = KILLED BY SIGNAL: 9 (Killed)