lintool / twitter-tools

Twitter Tools
twittertools.cc
218 stars 100 forks source link

Handle the nbsp character to fix the detection of valid URLs (inc. test)... #43

Closed flaviomartins closed 8 years ago

flaviomartins commented 10 years ago

....

WhitespaceTokenizer splits according to Java's Character.isWhitespace() which excludes nbsp and breaks twitter-text detection/regex for valid URLs.

igorbrigadir commented 10 years ago

I'd like to add to this too:

The "…" character still causes problems sometimes (this normally happens to manual retweets when the url gets mangled)

Example Tweet Text:

Some cars are in the river #NBC4NY http://t.co/WmK9Hc…

Is tokenized as:

some, cars, are, in, the, river, #nbc4ny, http, t, co, wmk9hc
flaviomartins commented 10 years ago

@igorbrigadir thanks for this test case. Should we add the character "…" to the list in this pull request? It should fix the problem in your test case.

igorbrigadir commented 10 years ago

Yep, It looks like that would do it, but as @lintool pointed out in the mailing list: This requires re-indexing.

Not sure how much of an impact this makes on retrieval accuracy.

flaviomartins commented 10 years ago

@igorbrigadir I added an additional commit with a check for the ellipsis character and included your example as test case.