idumiY / lucene-gosen

Automatically exported from code.google.com/p/lucene-gosen
0 stars 0 forks source link

Invalid separate too long characters > 4096 #32

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
If a special too long characters text (4096 over) input, 
GosenTokenizer(StreamTagger2) output invalid token.

In this case, the text is not include these character (0 to 4096).

0x000D:CARRIAGE RETURN(CR)
0x000A:LINE FEED(LF)
0x0085:NEXT LINE (NEL)
0x2028:LINE SEPARATOR
0x2029:PARAGRAPH SEPARATOR

If the text tokenize by lucene-gosen, split up word into two term at 4096 
character.

Original issue reported on code.google.com by johtani on 5 Jun 2012 at 11:39

GoogleCodeExporter commented 8 years ago
Add test case.
And add fix a specified pattern.
If 4096 characters include 0x3002(。), lucene-gosen output correct tokens.

Add patch and test case.

Original comment by johtani on 5 Jun 2012 at 11:45

Attachments: