PTBTokenizer Unrecognizable: (U+2063, decimal: 8291)

rlyCarlson commented 2 years ago

PTBTokenizer crashed on this unicode character (U+2063, decimal: 8291) which is an invisible comma/separator, and threw this error:

Untokenizable: ⁣ (U+2063, decimal: 8291) Exception in thread “main” java.lang.ArithmeticException: integer overflow at java.lang.Math.toIntExact(Math.java:1011) at edu.stanford.nlp.process.PTBLexer.getNext(PTBLexer.java) at edu.stanford.nlp.process.PTBLexer.next(PTBLexer.java) at edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:301) at edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:185) at edu.stanford.nlp.process.AbstractTokenizer.hasNext(AbstractTokenizer.java:69) at edu.stanford.nlp.process.PTBTokenizer.tokReader(PTBTokenizer.java:493) at edu.stanford.nlp.process.PTBTokenizer.tok(PTBTokenizer.java:464) at edu.stanford.nlp.process.PTBTokenizer.main(PTBTokenizer.java:890)

Also tried using -filter \u2063 and threw the same error

AngledLuffa commented 2 years ago

I think these are two separate issues & it's just a coincidence that the errors are printed this way.

1) We could figure out something to do with the "invisible comma". Currently "untokenizable" characters should just be dropped. You can test this by tokenizing a small file with such a character; it should not crash.

2) The documentation very clearly says that it will properly handle files longer than Integer.MAX_VALUE characters long, and then instead it just crashes. Oops.

https://github.com/stanfordnlp/CoreNLP/blob/f05cb54ec0a4f3c90395771817f44a81eb549baf/src/edu/stanford/nlp/process/PTBLexer.flex#L483

AngledLuffa commented 2 years ago

I suggest only tokenizing files of less than 2GB until we figure it out

AngledLuffa commented 2 years ago

Just to confirm regarding the invisible separator:

java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize -file foo.txt
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize

Processing file /home/john/CoreNLP/foo.txt ... writing to /home/john/CoreNLP/foo.txt.out
Untokenizable: ⁣ (U+2063, decimal: 8291)
Annotating file /home/john/CoreNLP/foo.txt ... done [0.1 sec].

Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
TOTAL: 0.1 sec. for 3 tokens at 56.6 tokens/sec.
Pipeline setup: 0.0 sec.
Total time for StanfordCoreNLP pipeline: 0.2 sec.

cat foo.txt.out
Document: ID=foo.txt (1 sentences, 3 tokens)

Sentence #1 (3 tokens):
Unban⁣mox⁣opal

Tokens:
[Text=Unban CharacterOffsetBegin=0 CharacterOffsetEnd=5]
[Text=mox CharacterOffsetBegin=6 CharacterOffsetEnd=9]
[Text=opal CharacterOffsetBegin=10 CharacterOffsetEnd=14]

Note that Chrome turns the invisible separator into a visible space, so it's proving quite difficult to paste the example file here. You can get as many invisible separators as you need here, though

stanfordnlp / CoreNLP

PTBTokenizer Unrecognizable: (U+2063, decimal: 8291) #1281