Open ashunaveed opened 1 month ago
I had the same issue. In my trainer.py the code
model.train()
if hasattr(self.optimizer, "train") and callable(self.optimizer.train):
self.optimizer.train()
caused the issue. I changed it to:
model.train()
if hasattr(self.optimizer, 'train'):
self.model.train()
and the training started without an error. You can find it in the trainer.py in line 3477 as your traceback in the screenshot showed.
text-generation-webui\installer_files\env\Lib\site-packages\transformers
it worked. Thank you. but one thing, i was able to train a full llama 3 8b model on a text data of 1MB size with lora rank of 1024 and overlap length of 255, 3 epochs, 3e-5 constant learning rate. i was able to train in 2 hours. but when i loaded lora and tried to interact, output is not getting generated. Please help. Thankyou. 11:35:38-221065 INFO Loading "Llama3" 11:35:38-225992 INFO TRANSFORMERS_PARAMS= { 'low_cpu_mem_usage': False, 'torch_dtype': torch.bfloat16, 'use_flash_attention_2': True}
C:\Users\genco\OneDrive\Documents\text-generation-webui-main\installer_files\env\Lib\site-packages\transformers\generation\configuration_utils.py:611: UserWarning: do_sample
is set to False
. However, min_p
is set to 0.0
-- this flag is only used in sample-based generation modes. You should set do_sample=True
or unset min_p
.
warnings.warn(
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 4.39it/s]
11:35:54-700169 INFO Loaded "Llama3" in 16.48 seconds.
11:35:54-702178 INFO LOADER: "Transformers"
11:35:54-702712 INFO TRUNCATION LENGTH: 8192
11:35:54-704798 INFO INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"
11:36:18-158534 INFO Loading raw text file dataset
11:38:01-168398 INFO Getting model ready
11:38:01-171155 INFO Preparing for training
11:38:01-172765 INFO Creating LoRA model
11:38:04-561745 INFO Starting training
Training 'llama' model using (q, v) projections
Trainable params: 436,207,616 (5.1522 %), All params: 8,466,468,864 (Model: 8,030,261,248)
11:38:04-578419 INFO Log file 'train_dataset_sample.json' created in the 'logs' directory.
Step: 159 {'loss': 1058159467901747.2, 'grad_norm': nan, 'learning_rate': 3e-05, 'epoch': 0.00564712526029718}
Step: 319 Stop Loss 0 reached.
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3e-05, 'epoch': 0.01129425052059436}
Step: 319 {'train_runtime': 13177.6232, 'train_samples_per_second': 25.801, 'train_steps_per_second': 0.201, 'train_loss': 529079733950873.6, 'epoch': 0.01129425052059436}
15:17:43-696778 INFO LoRA training run is completed and saved.
15:17:44-195770 INFO Training complete, saving
15:17:45-335153 INFO Training complete!
15:19:50-069684 INFO Applying the following LoRAs to Llama3: GCC
Traceback (most recent call last):
File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\modules\callbacks.py", line 61, in gentask
ret = self.mfunc(callback=_callback, args, self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\modules\text_generation.py", line 398, in generate_with_callback
shared.model.generate(kwargs)
File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\installer_files\env\Lib\site-packages\peft\peft_model.py", line 1638, in generate
outputs = self.base_model.generate(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\installer_files\env\Lib\site-packages\torch\utils_contextlib.py", line 116, in decorate_context
return func(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 2048, in generate
result = self._sample(
^^^^^^^^^^^^^
File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 3044, in _sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either inf
, nan
or element < 0
Output generated in 4.19 seconds (0.00 tokens/s, 0 tokens, context 151, seed 164313411)
Traceback (most recent call last):
File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\modules\callbacks.py", line 61, in gentask
ret = self.mfunc(callback=_callback, args, self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\modules\text_generation.py", line 398, in generate_with_callback
shared.model.generate(kwargs)
File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\installer_files\env\Lib\site-packages\peft\peft_model.py", line 1638, in generate
outputs = self.base_model.generate(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\installer_files\env\Lib\site-packages\torch\utils_contextlib.py", line 116, in decorate_context
return func(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 2048, in generate
result = self._sample(
^^^^^^^^^^^^^
File "C:\Users\genco\OneDrive\Documents\text-generation-webui-main\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 3044, in _sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either inf
, nan
or element < 0
Output generated in 4.31 seconds (0.00 tokens/s, 0 tokens, context 165, seed 1511014698)
Describe the bug
while, training the models, facing following issue.
Is there an existing issue for this?
Reproduction
training the models.
Screenshot
Logs
System Info