huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.87k stars 27.2k forks source link

HF Trainer Segmentation Fault #5590

Closed ksjae closed 4 years ago

ksjae commented 4 years ago

🐛 Bug

Information

Model I am using (Bert, XLNet ...): GPT2-medium & large

Language I am using the model on (English, Chinese ...): Korean (with custom trained tokenizer)

The problem arises when using:

tokenizer = GPT2TokenizerFast.from_pretrained("./data/TOKEN")

config = GPT2Config.from_pretrained('gpt2-medium') model = GPT2LMHeadModel(config=config) tokenizer = GPT2TokenizerFast.from_pretrained("./data/TOKEN", model_max_length=1024)

print('loading dataset...') dataset = LineByLineTextDataset( tokenizer=tokenizer, file_path="./data/kowiki.txt", block_size=512, )

training_args = TrainingArguments( output_dir='./m', # output directory num_train_epochs=1, # total # of training epochs per_device_train_batch_size=1, # batch size per device during training - the higher the better, but may OOM per_device_eval_batch_size=1, # batch size for evaluation logging_dir='./logs', # directory for storing logs save_steps=10000, do_train=True )

trainer = Trainer( model=model, # the instantiated Transformers model to be trained args=training_args, # training arguments, defined above train_dataset=dataset, # training dataset ) faulthandler.enable() trainer.train()


The tasks I am working on is:
* [ ] an official GLUE/SQUaD task: (give the name)
* [ O ] my own task or dataset: (give details below)
Text generation with prompt (trained on wiki & novel)

## To reproduce

Steps to reproduce the behavior:

1. Modify path to data file
2. Use any file(tested with Korean - UTF8)
3. Use any tokenizer(tested with self & GPT2 tokenizers)

<!-- If you have code snippets, error messages, stack traces please provide them here as well.
     Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
     Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.-->

### Error message

loading dataset... Epoch: 0%| | 0/1 [00:00<?, ?it/s] Fatal Python error: Segmentation fault | 0/99996 [00:00<?, ?it/s]

Thread 0x00007f872dfff700 (most recent call first): File "/opt/conda/lib/python3.6/threading.py", line 299 in wait File "/opt/conda/lib/python3.6/threading.py", line 551 in wait File "/opt/conda/lib/python3.6/site-packages/tqdm/_monitor.py", line 69 in run File "/opt/conda/lib/python3.6/threading.py", line 916 in _bootstrap_inner File "/opt/conda/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f8736bb5700 (most recent call first): File "/opt/conda/lib/python3.6/threading.py", line 299 in wait File "/opt/conda/lib/python3.6/queue.py", line 173 in get File "/opt/conda/lib/python3.6/site-packages/tensorboard/summary/writer/event_file_writer.py", line 205 in run File "/opt/conda/lib/python3.6/threading.py", line 916 in _bootstrap_inner File "/opt/conda/lib/python3.6/threading.py", line 884 in _bootstrap

Current thread 0x00007f88273e7740 (most recent call first): File "/opt/conda/lib/python3.6/site-packages/torch/cuda/comm.py", line 39 in broadcast_coalesced File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 21 in forward File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 71 in _broadcast_coalesced_reshape File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 88 in replicate File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 159 in replicate File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 154 in forward File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577 in call File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 622 in _training_step File "/opt/conda/lib/python3.6/site-packages/transformers/trainer.py", line 499 in train File "trainer.py", line 34 in Segmentation fault (core dumped)



## Expected behavior
Process through training(as normal)

## Environment info
<!-- You can run the command `transformers-cli env` and copy-and-paste its output below.
     Don't forget to fill out the missing fields in that output! -->

- `transformers` version: 3.0.2
- Platform: Linux-4.4.0-178-generic-x86_64-with-debian-buster-sid
- Python version: 3.6.10
- PyTorch version (GPU?): 1.6.0a0+9907a3e (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Planning to (don't see any flags!)
LysandreJik commented 4 years ago

Hi! Could you paste here the result of pip list in your environment ?

ksjae commented 4 years ago
absl-py                0.9.0
apex                   0.1
astor                  0.8.1
astunparse             1.6.3
backcall               0.1.0
beautifulsoup4         4.9.1
blis                   0.4.1
Bottleneck             1.3.2
cachetools             4.1.0
catalogue              1.0.0
certifi                2020.6.20
chardet                3.0.4
click                  7.1.2
cycler                 0.10.0
cymem                  2.0.3
Cython                 0.29.20
decorator              4.4.2
fastai                 1.0.61
fastprogress           0.2.3
filelock               3.0.12
fire                   0.3.1
future                 0.18.2
gast                   0.2.2
gluonnlp               0.9.1
google-auth            1.18.0
google-auth-oauthlib   0.4.1
google-pasta           0.2.0
graphviz               0.8.4
grpcio                 1.29.0
h5py                   2.10.0
idna                   2.8
importlib-metadata     1.6.1
ipython                7.14.0
ipython-genutils       0.2.0
jedi                   0.17.0
joblib                 0.15.1
Keras-Applications     1.0.8
Keras-Preprocessing    1.1.2
kiwisolver             1.2.0
kobert-transformers    0.4.1
kogpt2                 0.1.1
kss                    1.3.1
Markdown               3.2.2
matplotlib             3.2.2
mecab-python3          1.0.0
murmurhash             1.0.2
mxnet                  1.6.0
natto                  0.1.7
numexpr                2.7.1
numpy                  1.19.0
nvidia-ml-py3          7.352.0
oauthlib               3.1.0
opt-einsum             3.2.1
packaging              20.4
pandas                 1.0.5
parso                  0.7.0
pdf2image              1.9.0
pexpect                4.8.0
pickleshare            0.7.5
Pillow                 6.2.0
pip                    20.1.1
plac                   1.1.3
preshed                3.0.2
prompt-toolkit         3.0.5
protobuf               3.12.2
psutil                 5.7.0
ptyprocess             0.6.0
pyasn1                 0.4.8
pyasn1-modules         0.2.8
Pygments               2.6.1
pyparsing              2.4.7
pytesseract            0.2.7
python-dateutil        2.8.1
pytz                   2020.1
PyYAML                 5.3.1
regex                  2017.4.5
requests               2.21.0
requests-oauthlib      1.3.0
rsa                    4.6
sacremoses             0.0.43
scikit-learn           0.23.1
scipy                  1.4.1
sentencepiece          0.1.91
setuptools             41.2.0
six                    1.14.0
soupsieve              2.0.1
soynlp                 0.0.493
spacy                  2.3.0
srsly                  1.0.2
tensorboard            1.15.0
tensorboard-plugin-wit 1.6.0.post3
tensorflow             1.15.0
tensorflow-estimator   1.15.1
termcolor              1.1.0
thinc                  7.4.1
threadpoolctl          2.1.0
tokenizers             0.7.0
torch                  1.5.1+cu101
torchvision            0.6.1+cu101
tqdm                   4.46.1
traitlets              4.3.3
transformers           2.11.0
urllib3                1.24.3
wasabi                 0.7.0
wcwidth                0.1.9
Werkzeug               1.0.1
wheel                  0.34.2
wrapt                  1.12.1
zipp                   3.1.0
ksjae commented 4 years ago

Is there anything else I should post?

ksjae commented 4 years ago

Bumping @sgugger to analyze this issue.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.