Open rlyCarlson opened 2 years ago
I think these are two separate issues & it's just a coincidence that the errors are printed this way.
1) We could figure out something to do with the "invisible comma". Currently "untokenizable" characters should just be dropped. You can test this by tokenizing a small file with such a character; it should not crash.
2) The documentation very clearly says that it will properly handle files longer than Integer.MAX_VALUE characters long, and then instead it just crashes. Oops.
I suggest only tokenizing files of less than 2GB until we figure it out
Just to confirm regarding the invisible separator:
java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize -file foo.txt
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
Processing file /home/john/CoreNLP/foo.txt ... writing to /home/john/CoreNLP/foo.txt.out
Untokenizable: (U+2063, decimal: 8291)
Annotating file /home/john/CoreNLP/foo.txt ... done [0.1 sec].
Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
TOTAL: 0.1 sec. for 3 tokens at 56.6 tokens/sec.
Pipeline setup: 0.0 sec.
Total time for StanfordCoreNLP pipeline: 0.2 sec.
cat foo.txt.out
Document: ID=foo.txt (1 sentences, 3 tokens)
Sentence #1 (3 tokens):
Unbanmoxopal
Tokens:
[Text=Unban CharacterOffsetBegin=0 CharacterOffsetEnd=5]
[Text=mox CharacterOffsetBegin=6 CharacterOffsetEnd=9]
[Text=opal CharacterOffsetBegin=10 CharacterOffsetEnd=14]
Note that Chrome turns the invisible separator into a visible space, so it's proving quite difficult to paste the example file here. You can get as many invisible separators as you need here, though
PTBTokenizer crashed on this unicode character (U+2063, decimal: 8291) which is an invisible comma/separator, and threw this error:
Also tried using -filter \u2063 and threw the same error