tsantra commented 8 months ago

Hi,

After increasing the max_steps=100, in the qLora_finetuning_cpu.py code , my system crashes.

My system configuration: Xeon Gold Memory: 128 GB Disc Capacity : 3.8 TB OS: 22.04.3 LTS

model id: llama-2-7b-chat-hf

trainer = transformers.Trainer( model=model, train_dataset=data["train"], args=transformers.TrainingArguments( per_device_train_batch_size=4, gradient_accumulation_steps=1, warmup_steps=20, max_steps=100, learning_rate=2e-4, save_steps=10, bf16=True, logging_steps=20, output_dir="outputs", optim="adamw_hf", # paged_adamw_8bit is not supported yet

gradient_checkpointing=True, # can further reduce memory but slower

    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

Attached the htops screen shot: processed-2D4450F8-76D9-4049-856D-C2E4F37A9F6B-6C05E950-CDE6-410B-8A0B-CF96DC2A4C36

hzjane commented 8 months ago

It seem that the memory is run out. When we run this example using 48 cores, the memory usage is within 100GB. Maybe you use another dataset to run， and it will increase the memory usage. You can try to increase memory or reduce per_device_train_batch_size to test.

tsantra commented 8 months ago

@hzjane I am using the default dataset that the example comes with. As you suggested reduced per_device_train_batch_size to 1. However, close to 162 steps had to force stop the training, as Memory usage kept increasing steadily crossing 112G and Swp kept increasing too. Attaching some screen shots to show the steady increase in memory usage. This doesn't seem right. This would have led to system crash soon. Looks like maybe there is memory leakage.

I then tried, gradient_checkpointing=True and this gave Seg Fault even without running any training step.

Some settings in the env: (base) ceed-user@ceed-server:~$ conda activate bigdl_cpu Sourcing bigdl-nano-init in: /home/ceed-user/anaconda3/envs/bigdl_cpu/bin Setting OMP_NUM_THREADS... Setting KMP_AFFINITY... Setting KMP_BLOCKTIME... Setting jemalloc... nano_vars.sh already exists +++++ Env Variables +++++ LD_PRELOAD = /home/ceed-user/anaconda3/envs/bigdl_cpu/lib/libiomp5.so /home/ceed-user/anaconda3/envs/bigdl_cpu/lib/python3.9/site-packages/bigdl/nano/libs/libjemalloc.so MALLOC_CONF = oversize_threshold:1,background_thread:false,metadata_thp:always,dirty_decay_ms:-1,muzzy_decay_ms:-1 OMP_NUM_THREADS = 64 KMP_AFFINITY = granularity=fine,none KMP_BLOCKTIME = 1 TF_ENABLE_ONEDNN_OPTS = 1 +++++++++++++++++++++++++ Complete.

Please help!

hzjane commented 8 months ago

I think it may caused by source bigdl-nano-init -j. You can try to use cmd source bigdl-nano-init -t or source bigdl-nano-unset-env to test.

jason-dai commented 8 months ago

I think it may caused by source bigdl-nano-init -j. You can try to use cmd source bigdl-nano-init -t or source bigdl-nano-unset-env to test.

Do we have the same problem in our testing? If not, what's the difference in the setup/config?

hzjane commented 8 months ago

I think it may caused by source bigdl-nano-init -j. You can try to use cmd source bigdl-nano-init -t or source bigdl-nano-unset-env to test.

Do we have the same problem in our testing? If not, what's the difference in the setup/config?

I found that using jemalloc, more and more memory will be used until it fails. And we are not using "source bigdl-nano-init" in finetune.

tsantra commented 8 months ago

@hzjane I created new conda env without bigdl-nano-init and memory consumption is steady , not increasing with increasing training steps. So as you mentioned, bigdl-nano-init was the issue.

I want to understand what's happening when we use bigdl-nano-init. Could you please help me understand why we see this happen when we use bigdl-nano-init?

hzjane commented 8 months ago

The problem is caused by jemalloc. And We have already excluded bigdl-nano-init from BigDL-LLM. And if you still don't have enough memory, you can try to set use_gradient_checkpointing=True to reduce memory usage, but it will make the training time longer.

intel-analytics / ipex-llm

System crash after increasing training steps to 100 in QLora fine-tuning on CPU code #9375

gradient_checkpointing=True, # can further reduce memory but slower