Whitespace in the beginning of the processed text is not trimmed

jiemakel / arpa

Keyword extraction tool using LAS+SPARQL

MIT License

1 stars 1 forks source link

This is actually a symptom of a larger problem, where ARPA is doing tokenization and processing on the text before passing it on to LAS (in the past, this made some sense in terms of data throughput efficiency when LAS didn't make use of contextual information for word disambiguation, but nowadays it does, as well as has support for smarter tokenization).

However, LAS output currently doesn't include the punctuation characters that are stripped out as part of tokenization before analysis, which ARPA on the other hand needs. Thus, the real fix will be to add this information to the what LAS returns, and then refactor ARPA to just work completely off of that.

jiemakel / arpa

Whitespace in the beginning of the processed text is not trimmed #3