Closed flaviomartins closed 8 years ago
I'd like to add to this too:
The "…" character still causes problems sometimes (this normally happens to manual retweets when the url gets mangled)
Example Tweet Text:
Some cars are in the river #NBC4NY http://t.co/WmK9Hc…
Is tokenized as:
some, cars, are, in, the, river, #nbc4ny, http, t, co, wmk9hc
@igorbrigadir thanks for this test case. Should we add the character "…" to the list in this pull request? It should fix the problem in your test case.
Yep, It looks like that would do it, but as @lintool pointed out in the mailing list: This requires re-indexing.
Not sure how much of an impact this makes on retrieval accuracy.
@igorbrigadir I added an additional commit with a check for the ellipsis character and included your example as test case.
....
WhitespaceTokenizer splits according to Java's Character.isWhitespace() which excludes nbsp and breaks twitter-text detection/regex for valid URLs.