jiemakel / arpa

Keyword extraction tool using LAS+SPARQL
MIT License
1 stars 1 forks source link

Whitespace in the beginning of the processed text is not trimmed #3

Closed evsheino closed 7 years ago

evsheino commented 8 years ago

The string " foo bar" yields the ngrams [ " foo", "bar", " foo bar" ], whereas the expected ngrams would be [ "foo", "bar", "foo bar" ]. Also, multiple whitespace characters at the beginning of the string produce a whitespace ngram: " foo bar" => [ " ", " foo", "bar" ]. As whitespace at the end of the string is trimmed, presumably the intention would be to trim whitespace from both ends.

jiemakel commented 8 years ago

This is actually a symptom of a larger problem, where ARPA is doing tokenization and processing on the text before passing it on to LAS (in the past, this made some sense in terms of data throughput efficiency when LAS didn't make use of contextual information for word disambiguation, but nowadays it does, as well as has support for smarter tokenization).

However, LAS output currently doesn't include the punctuation characters that are stripped out as part of tokenization before analysis, which ARPA on the other hand needs. Thus, the real fix will be to add this information to the what LAS returns, and then refactor ARPA to just work completely off of that.