Closed Hurri08 closed 1 year ago
Just tried it with a GTX1060 6G that I had lying around, still the same error. Cuda 11.7 and driver version 515. Pytorch is the version for Cuda 11.7.
The dataset I tried this with was pretty damn big, around 50MB or so, so that shouldn't be the problem. What's your system memory? I think it might be running out of that. My rig has 16gb
System memory is 64GB. Did your dataset also contain emojis? Maybe that's a problem?
Everything else on my PC looks fine, it just really quickly fills up the VRAM and then it just errors out.
Could you post the output of "pip list"? Maybe we're not running the same versions which causes trouble on my end.
Sure:
Package Version
---------------------- ------------
absl-py 0.13.0
aiohttp 3.7.4.post0
aitextgen 0.5.2
async-timeout 3.0.1
attrs 21.2.0
autopep8 1.5.7
cachetools 4.2.2
certifi 2021.5.30
chardet 4.0.0
charset-normalizer 2.0.3
click 8.0.1
discord 1.7.3
discord.py 1.7.3
filelock 3.0.12
fire 0.4.0
fsspec 2021.7.0
future 0.18.2
google-auth 1.33.1
google-auth-oauthlib 0.4.4
grpcio 1.39.0
huggingface-hub 0.0.12
idna 3.2
joblib 1.0.1
Markdown 3.3.4
multidict 5.1.0
numpy 1.21.1
oauthlib 3.1.1
packaging 21.0
Pillow 8.3.2
pip 21.2.4
protobuf 3.17.3
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycodestyle 2.7.0
pyDeprecate 0.3.0
pyparsing 2.4.7
pytorch-lightning 1.3.8
PyYAML 5.4.1
regex 2021.7.6
requests 2.26.0
requests-oauthlib 1.3.0
rsa 4.7.2
sacremoses 0.0.45
setuptools 58.0.4
six 1.16.0
tensorboard 2.4.1
tensorboard-plugin-wit 1.8.0
termcolor 1.1.0
tokenizers 0.10.3
toml 0.10.2
torch 1.10.0+cu113
torchmetrics 0.4.1
tqdm 4.61.2
transformers 4.9.0
typing-extensions 3.10.0.0
urllib3 1.26.6
Werkzeug 2.0.1
wheel 0.37.0
yarl 1.6.3
I made a separate conda environment before installing any dependencies, so you might wanna try that too. Might be messing with system installations of other libraries or whatever.
Still running OOM, even using only num_steps=1. Last thing I could think of now, is that I am using a different python version in my conda env. Installed everything that you had (plus mkl==2021.1.1 because it was needed). So I am not sure what is causing these problems for me.
Here is the full error message I get:
Traceback (most recent call last): File "/home/slb/Desktop/Gravital-master/main.py", line 57, in <module> main() File "/home/slb/Desktop/Gravital-master/main.py", line 36, in main ai.train("dataset.txt", File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/aitextgen/aitextgen.py", line 752, in train trainer.fit(train_model) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit self._run(model) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run self.dispatch() File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 799, in dispatch self.accelerator.start_training(self) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training self.training_type_plugin.start_training(trainer) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training self._results = trainer.run_stage() File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in run_stage return self.run_train() File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 871, in run_train self.train_loop.run_training_epoch() File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 499, in run_training_epoch batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 738, in run_training_batch self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 434, in optimizer_step model_ref.optimizer_step( File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step optimizer.step(closure=optimizer_closure) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 325, in optimizer_step make_optimizer_step = self.precision_plugin.pre_optimizer_step( File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 93, in pre_optimizer_step result = lambda_closure() File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 732, in train_step_and_backward_closure result = self.training_step_and_backward( File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 823, in training_step_and_backward result = self.training_step(split_batch, batch_idx, opt_idx, hiddens) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 290, in training_step training_step_output = self.trainer.accelerator.training_step(args) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 204, in training_step return self.training_type_plugin.training_step(*args) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 155, in training_step return self.lightning_module.training_step(*args, **kwargs) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/aitextgen/train.py", line 35, in training_step outputs = self({"input_ids": batch, "labels": batch}) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/aitextgen/train.py", line 32, in forward return self.model(**inputs, return_dict=False) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 943, in forward transformer_outputs = self.transformer( File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 791, in forward outputs = block( File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 317, in forward attn_outputs = self.attn( File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 258, in forward attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 194, in _attn attn_weights = self.attn_dropout(attn_weights) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/torch/nn/modules/dropout.py", line 58, in forward return F.dropout(input, self.p, self.training, self.inplace) File "/home/slb/miniconda3/envs/aibot/lib/python3.9/site-packages/torch/nn/functional.py", line 1169, in dropout return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training) RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 5.93 GiB total capacity; 4.63 GiB already allocated; 79.25 MiB free; 4.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0%| | 0/1 [00:00<?, ?it/s]
Ok, I solved it. Seems like a problem with the used block_size. If I use aitextgen directly, with their test script and use it with my training data and the same model, it works. Maybe it would be a good idea to mention that people can change it if they have problems?
Hello!
First of all, thanks for your work. I got a problem with training the bot. I've got a dataset of ~150k lines and already made slimmed it down to 2500. Every time I set the "num_steps" higher than 1 my GPU runs out of memory. It's a Vega64 with 8GB of RAM running rocm (5.4.3) on Ubuntu with pytorch version 2.1.0a0+gitefc3887. Batch_size is also set at 2. Anything higher also results in oom.
Is the dataset just too big or could there be another problem?
Thanks very much in advance for any help! Also let me know if you need additional info!