redotvideo / haven

LLM fine-tuning and eval
https://haven.run
Apache License 2.0
340 stars 11 forks source link

Llamatune fails with your example code from its home page #82

Open IridiumMaster opened 1 year ago

IridiumMaster commented 1 year ago

steps to reproduce 1) start a runpod container with the pytorch 2.01 template and lots of disk space 2) run your sample command on a properly formatted dataset: python -m llamatune.train \ --model_name meta-llama/Llama-2-13b-chat-hf \ --data_path master_qa.json \ --training_recipe lora \ --batch_size 8 \ --gradient_accumulation_steps 4 \ --learning_rate 1e-4 \ --output_dir chat_llama2_13b \ --use_auth_token xxxzzz 3) result is: Model ready for training! trainable params: 250347520 || all params: 6922337280 || trainable: 3.616517223500557 WARNING:root:Loading data... WARNING:root:Tokenizing inputs... This may take some time... config TrainingConfig(model_name='meta-llama/Llama-2-13b-chat-hf', data_path='master_qa.json', output_dir='chat_llama2_13b', training_recipe='lora', optim='paged_adamw_8bit', batch_size=8, gradient_accumulation_steps=4, n_epochs=3, weight_decay=0.0, learning_rate=0.0001, max_grad_norm=0.3, gradient_checkpointing=True, do_train=True, lr_scheduler_type='cosine', warmup_ratio=0.03, logging_steps=1, group_by_length=True, save_strategy='epoch', save_total_limit=3, fp16=True, tokenizer_type='llama', trust_remote_code=False, compute_dtype=torch.float16, max_tokens=4096, do_eval=True, evaluation_strategy='epoch', use_auth_token='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', use_fast=False, bits=4, double_quant=True, quant_type='nf4', lora_r=64, lora_alpha=16, lora_dropout=0.0) Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/llamatune/train.py", line 50, in trainer.train() File "/usr/local/lib/python3.10/dist-packages/llamatune/trainer.py", line 25, in train self.model_engine.train(data_module=self.data_module) File "/usr/local/lib/python3.10/dist-packages/llamatune/model_engines/llama_model_engine.py", line 33, in train trainer = Trainer( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 405, in init raise ValueError( ValueError: The model you want to train is loaded in 8-bit precision. if you want to fine-tune an 8-bit model, please make sure that you have installed bitsandbytes>=0.41.1.

jayantkhannadocplix1 commented 1 year ago

Hello @IridiumMaster I encountered a similar problem and managed to resolve it by executing

pip install bitsandbytes==0.41.1

IridiumMaster commented 1 year ago

aye, that is the correct fix. Hoping the maintainers change their requirements.txt