isaackd / wcloud

Create word clouds
MIT License
25 stars 7 forks source link

Tokenization option `min_word_length` counts length in bytes #5

Open isaackd opened 1 year ago

isaackd commented 1 year ago

This should count by actual "characters" https://github.com/isaackd/wcloud-dev/blob/e368d53dd4d6fb7fcef084ed98225dc54a054a29/src/tokenizer.rs#L46-L48 From https://doc.rust-lang.org/std/primitive.str.html#method.len:

This length is in bytes, not chars or graphemes. In other words, it might not be what a human considers the length of the string.