Closed micimize closed 1 year ago
We chose to tokenize at the world level rather than the character level for efficiency reasons. Five characters is not very long (less than the length of an average English word), so we would need to set k much higher than the 5 we used for the word-wise shingling in order to capture distinctive character-level k-grams.
Hi – I was looking at the NearDup details in Appendix A and saw that "space tokenized consecutive 5-grams" were used for shingling.
I've found very little discussion on the internet comparing character and word-wise shingling, and was wondering if there was a particular reason or analysis that lead to this choice over k-shingling?