tema16 / tt4j

Automatically exported from code.google.com/p/tt4j
0 stars 0 forks source link

Consider filtering out very long tokens #3

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
It seems that TT has problems with very long tokens. Consider writing a test 
checking what the maximum token size is and subsequently add code to tt4j that 
ignores such long tokens.

Original issue reported on code.google.com by richard.eckart on 5 May 2011 at 8:20

GoogleCodeExporter commented 9 years ago
- Improved handling of a dying TreeTagger process.
- Added setting to control the maximum token length (in bytes) - per default 
90000.
- Empirically determined that at least on my machine the maximum token length 
is 99998. I expect that there is a 100000 byte buffer in TreeTagger- this 
corresponds to 99998 one-byte characters + line-break + ZERO (end of string in 
C).

Original comment by richard.eckart on 3 Jun 2011 at 9:46