nlp-uoregon / trankit

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Apache License 2.0
724 stars 99 forks source link

Limit input string to 512 characters to avoid CUDA crash #58

Open ulf1 opened 2 years ago

ulf1 commented 2 years ago

Problem

# If
assert len(sentence) > 512
# then
annotated = model_trankit(sentence, is_sent=True)
# result in CUDA error, e.g.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [19635,0,0], thread: [112,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.

Cause XLM-Roberta can only process 512 characters.

Possible fix https://github.com/nlp-uoregon/trankit/blob/1c19b9b7df3be1de91c2dd6879e0e325af5e2898/trankit/pipeline.py#L1066

Change

...

                ori_text = deepcopy(input)
                tagged_sent = self._posdep_sent(input)
...

to

...

                ori_text = deepcopy(input)
                ori_text = ori_text[:512]   # <<< TRIM STRING TO MAX 512
                tagged_sent = self._posdep_sent(input)
...
ulf1 commented 2 years ago

A quick fix for other trankit users would be

annotated = model_trankit(sentence[:512], is_sent=True)