Closed MiladMolazadeh closed 6 months ago
As far as our results in the paper go, it doesn't look like it's a problem. Here's my intuition for why:
Maybe in some settings it could matter a lot. In those settings you may be better off removing entire examples that have duplicates, I don't know, that's task specific. But as far as the code goes it appears to work for our cases.
Hello!
I'm currently using a suffix array and Persian language text. However, in some examples, the outcome of deduplication is not ideal when removing substrings from the text. This leads to boundaries of strings being overlapped by words, resulting in a deprecated and sometimes meaningless text. How can I rectify this issue?
one example (translated to english):
In the repository, you mentioned this, stating that it doesn't disrupt the language model. The reason provided is that only a relatively small amount of text is removed. However, I'm having difficulty understanding why this isn't considered harmful. Do you mean that this disruption has no effect whatsoever on the perplexity of the language model?
In our paper we suggest just taking all of these duplicate sequences that have been identified and completely striking them from the dataset. This somewhat breaks the flow of text, for example if previously had an example "Alice wanted to go to the store" and we deduplicated at the level of 10 characters, we might completely strike " to go to the " and be left with "Alice wantedstore". In practice we have found this doesn't break the language model because we remove relatively little text, and so these breaks don't cause harm.