google-research / deduplicate-text-datasets

Apache License 2.0
1.1k stars 108 forks source link

[Paper Question] Why use w-shingles over k-shingles? #28

Closed micimize closed 1 year ago

micimize commented 1 year ago

Hi – I was looking at the NearDup details in Appendix A and saw that "space tokenized consecutive 5-grams" were used for shingling.

I've found very little discussion on the internet comparing character and word-wise shingling, and was wondering if there was a particular reason or analysis that lead to this choice over k-shingling?

daphnei commented 1 year ago

We chose to tokenize at the world level rather than the character level for efficiency reasons. Five characters is not very long (less than the length of an average English word), so we would need to set k much higher than the 5 we used for the word-wise shingling in order to capture distinctive character-level k-grams.