Tokenizer doesn't respect combined_electra-large's max_length

rmalouf commented 11 months ago

Describe the bug When parsing a long text using the latest "combined_electra-large" model, I get the error:

Token indices sequence length is longer than the specified maximum sequence length for this
model (630 > 512). Running this sequence through the model will result in indexing errors
Exception in thread parse_chunks:
Traceback (most recent call last):
  File "/home1/malouf/.pyenv/versions/3.11.3/lib/python3.11/threading.py", line 1038, in 
_bootstrap_inner
    self.run()
  File "/home1/malouf/batch/treebank/threadpipe.py", line 113, in run
    for tag, result in zip(tags, self.function(items)):
  File "/home1/malouf/batch/treebank/parse.py", line 125, in parse_chunks
    for doc_id, doc in zip(
  File 
"/home1/malouf/.pyenv/versions/treebank/lib/python3.11/site-packages/stanza/pipeline/core.py", 
line 456, in stream
    batch = self.bulk_process(batch, *args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File 
"/home1/malouf/.pyenv/versions/treebank/lib/python3.11/site-packages/stanza/pipeline/core.py", 
line 433, in bulk_process
    return self.process(docs, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File 
"/home1/malouf/.pyenv/versions/treebank/lib/python3.11/site-packages/stanza/pipeline/core.py",
line 422, in process
    doc = process(doc)
          ^^^^^^^^^^^^
  File
"/home1/malouf/.pyenv/versions/treebank/lib/python3.11/site-packages/stanza/pipeline/processor.
py", line 258, in bulk_process
    self.process(combined_doc) # annotations are attached to sentence objects
    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/home1/malouf/.pyenv/versions/treebank/lib/python3.11/site-packages/stanza/pipeline/pos_proces
sor.py", line 84, in process
    batch.doc.set([doc.UPOS, doc.XPOS, doc.FEATS], [y for x in preds for y in x])   
  File
"/home1/malouf/.pyenv/versions/treebank/lib/python3.11/site-packages/stanza/models/common/doc.p
y", line 254, in set
    assert (to_token and self.num_tokens == len(contents)) or self.num_words == len(contents),
\
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Contents must have the same length as the original file.

Environment (please complete the following information):

OS: MacOS 14.0
Python version: Python 3.11.3
Stanza version: 1.6.1 (and transformers 4.34.0)

AngledLuffa commented 11 months ago

Yes, this is a known issue. We either need to use a transformer that allows a bigger window, or in some way combine representations to get decent results from a longer sentence. The biggest reason we can't simply use two consecutive iterations of the transformer is the second half of the sequence would treat a word in the middle of the sentence as the start of the sentence, considering the way the transformer positional encodings work.

We hope to address this by the end of the year, but there are several things in our task list which need handling. A simple enough stopgap might be to fall back to the non-transformer model for sentences which are too long. In the meantime, you might consider discarding sentences which are that long.

rmalouf commented 11 months ago

I'd be fine with truncating or discarding long sentences for now, but unfortunately I can't tell that they're too long until after the text is tokenized. Is there an easy built-in way to truncate sentences mid-pipeline, or will I need to add a custom processor?

AngledLuffa commented 11 months ago

I've got a rather high priority thing to work on today and tomorrow, but I can try to have a thing ready by Friday which at least avoids the crash

On Mon, Oct 9, 2023 at 12:13 PM Rob Malouf @.***> wrote:

I'd be fine with truncating or discarding long sentences for now, but unfortunately I can't tell that they're too long until after the text is tokenized. Is there an easy built-in way to truncate sentences mid-pipeline, or will I need to add a custom processor?

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1294#issuecomment-1753538781, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWJKOP7ZNKAJOPVIZ5DX6REEZAVCNFSM6AAAAAA5YFI4SWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJTGUZTQNZYGE . You are receiving this because you commented.Message ID: @.***>

rmalouf commented 11 months ago

Thanks, that's very generous! But you can hold off -- I'll take a stab at it first and come back next week if I can't make it work.

BLKSerene commented 9 months ago

Hi, I also get this error. Any updates or workarounds on this?

AngledLuffa commented 9 months ago

Let me see if I can get to it this winter break

rmalouf commented 4 months ago

I'm still running into this in stanza 1.8.2. An offending text fragment is:

Call Government Securities TOTAL . 31 December 1845 3,590,014 563,072 628,500 1,039,745 2,231,317 31 December 1846 3,280,864 634,575 423,060 938,717 1,996,352 31 December 1847 2,733,753 7,231,325 350,108 791,899 1,863,332 30 June 1848 3,170,118 588,871 159,724 1,295,047 2,043,642 31 December 1848 3,089,659 645,468 176,824 1,189,213 2,011,505 30 June 1849 3,392,857 552,642 246,494 964,800 1,763,936 31 December 1849 3,680,623 686,761 264,577 973,691 1,224,029 30 June 1850 3,821,022 654,649 258,177 972,055 1,884,881 31 December 1850 3,969,648 566,039 334,982 1,089,794 1,990,815 30 June 1851 4,414,179 691,719 424,195 1,054,018 2,169,932 31 December 1851 4,677,298 653,946 378,337 1,054,018 2,080,301 30 June 1852 5,245,135 861,778 136,687 1,054,018 2,122,483 31 December 1852 5,581,706 855,057 397,087 1,119,477 2,371,621 30 June 1853 6,219,817 904,252 499,467 1,218,852 2,622,571 31 December 1853 6,259,540 791,699 677,392 1,468,902 2,937,993 30 June 1854 6,892,470 827,397 917,557 1,457,415 3,202,369 31 December 1854 7,177,244 694,309 486,400 1,451,074 2,631,783 30 June 1855 8,166,553 722,243 483,890 1,754,074 2,960,207 31 December 1855 8,744,095 847,856 451,575 1,949,074 3,248,505 30 June 1856 11,170,010 906,876 601,800 1,980,489 3,489,165 31 December 1856 11,438,461 1,119,591 432,000 2,922,625 4,474,216 30 June 1857 13,913,058 967,078 687,730 3,353,179 5,007,987 31 December 1857 113,889,021 2,226,441 1,115,883 3,582,797 6,923,121 1191

Obviously I'm not expecting to get a useful parse of that. I'd just like the stream to not crash so I can continue processing text chunks.

AngledLuffa commented 3 months ago

Are you getting a different exception, though? I get the following log & traceback:

Token indices sequence length is longer than the specified maximum sequence length for this model (715 > 512). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/john/stanza/stanza/pipeline/core.py", line 480, in __call__
    return self.process(doc, processors)
  File "/home/john/stanza/stanza/pipeline/core.py", line 431, in process
    doc = process(doc)
  File "/home/john/stanza/stanza/pipeline/pos_processor.py", line 91, in process
    dataset.doc.set([doc.UPOS, doc.XPOS, doc.FEATS], [y for x in preds for y in x])
  File "/home/john/stanza/stanza/models/common/doc.py", line 303, in set
    assert (to_token and self.num_tokens == len(contents)) or self.num_words == len(contents), \
AssertionError: Contents must have the same length as the original file.

rmalouf commented 3 months ago

Oh, you're right! I didn't look closely enough. First and last lines are the same but it's a different assertion that's failing. Sorry about that.

stanfordnlp / stanza

Tokenizer doesn't respect combined_electra-large's max_length #1294