kennycason / kumo

Kumo - Java Word Cloud
http://kennycason.com/posts/2014-07-03-kumo-wordcloud.html
MIT License
617 stars 156 forks source link

Use Guava HashMultiset #37

Open ChrisHennickAtGoogle opened 8 years ago

ChrisHennickAtGoogle commented 8 years ago

Guava's HashMultiset class would make it much faster to preprocess text. I'd suggest converting the raw tokens from languagetool to a HashMultiset before any further processing, and using the entrySet() method to process each distinct token only once during normalization, filtering etc.

kennycason commented 8 years ago

I intentionally started with the mindset of not putting in to many dependencies. But if people are interested in performance (outside of better data structures/algorithms), I'd probably hook in GS collections (now Eclipse collections) :)

kennycason commented 6 years ago

Coming back to this, I now realize I misunderstood your initial intent. I agree that the normalizer could probably also just process on the already tokenized text. The current string copying/processing in Normalizer is a bit overkill.