oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.71k stars 5.32k forks source link

Unable to train the models. #6425

Open ashunaveed opened 1 month ago

ashunaveed commented 1 month ago

Describe the bug

while, training the models, facing following issue. image_2024-10-02_163727112

Is there an existing issue for this?

Reproduction

training the models.

Screenshot

image_2024-10-02_163752169

Logs

running the model and training.

System Info

i7, 64GB ram, 4060 GPU.
taaaibu commented 1 month ago

I had the same issue. In my trainer.py the code

model.train()
        if hasattr(self.optimizer, "train") and callable(self.optimizer.train):
            self.optimizer.train()

caused the issue. I changed it to:

model.train()
        if hasattr(self.optimizer, 'train'):
            self.model.train()

and the training started without an error. You can find it in the trainer.py in line 3477 as your traceback in the screenshot showed. text-generation-webui\installer_files\env\Lib\site-packages\transformers

ashunaveed commented 1 month ago

it worked. Thank you. but one thing, i was able to train a full llama 3 8b model on a text data of 1MB size with lora rank of 1024 and overlap length of 255, 3 epochs, 3e-5 constant learning rate. i was able to train in 2 hours. but when i loaded lora and tried to interact, output is not getting generated. Please help. Thankyou. 11:35:38-221065 INFO Loading "Llama3" 11:35:38-225992 INFO TRANSFORMERS_PARAMS= { 'low_cpu_mem_usage': False, 'torch_dtype': torch.bfloat16, 'use_flash_attention_2': True}

C:\Users\genco\OneDrive\Documents\text-generation-webui-main\installer_files\env\Lib\site-packages\transformers\generation\configuration_utils.py:611: UserWarning: do_sample is set to False. However, min_p is set to 0.0 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset min_p. warnings.warn( Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 4.39it/s] 11:35:54-700169 INFO Loaded "Llama3" in 16.48 seconds. 11:35:54-702178 INFO LOADER: "Transformers" 11:35:54-702712 INFO TRUNCATION LENGTH: 8192 11:35:54-704798 INFO INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)" 11:36:18-158534 INFO Loading raw text file dataset 11:38:01-168398 INFO Getting model ready 11:38:01-171155 INFO Preparing for training 11:38:01-172765 INFO Creating LoRA model 11:38:04-561745 INFO Starting training Training 'llama' model using (q, v) projections Trainable params: 436,207,616 (5.1522 %), All params: 8,466,468,864 (Model: 8,030,261,248) 11:38:04-578419 INFO Log file 'train_dataset_sample.json' created in the 'logs' directory. Step: 159 {'loss': 1058159467901747.2, 'grad_norm': nan, 'learning_rate': 3e-05, 'epoch': 0.00564712526029718} Step: 319 Stop Loss 0 reached. {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3e-05, 'epoch': 0.01129425052059436} Step: 319 {'train_runtime': 13177.6232, 'train_samples_per_second': 25.801, 'train_steps_per_second': 0.201, 'train_loss': 529079733950873.6, 'epoch': 0.01129425052059436} 15:17:43-696778 INFO LoRA training run is completed and saved. 15:17:44-195770 INFO Training complete, saving 15:17:45-335153 INFO Training complete! 15:19:50-069684 INFO Applying the following LoRAs to Llama3: GCC Traceback (most recent call last): File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\modules\callbacks.py", line 61, in gentask ret = self.mfunc(callback=_callback, args, self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\modules\text_generation.py", line 398, in generate_with_callback shared.model.generate(kwargs) File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\installer_files\env\Lib\site-packages\peft\peft_model.py", line 1638, in generate outputs = self.base_model.generate(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\installer_files\env\Lib\site-packages\torch\utils_contextlib.py", line 116, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 2048, in generate result = self._sample( ^^^^^^^^^^^^^ File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 3044, in _sample next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: probability tensor contains either inf, nan or element < 0 Output generated in 4.19 seconds (0.00 tokens/s, 0 tokens, context 151, seed 164313411) Traceback (most recent call last): File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\modules\callbacks.py", line 61, in gentask ret = self.mfunc(callback=_callback, args, self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\modules\text_generation.py", line 398, in generate_with_callback shared.model.generate(kwargs) File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\installer_files\env\Lib\site-packages\peft\peft_model.py", line 1638, in generate outputs = self.base_model.generate(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\installer_files\env\Lib\site-packages\torch\utils_contextlib.py", line 116, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 2048, in generate result = self._sample( ^^^^^^^^^^^^^ File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 3044, in _sample next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: probability tensor contains either inf, nan or element < 0 Output generated in 4.31 seconds (0.00 tokens/s, 0 tokens, context 165, seed 1511014698)