google-research / deduplicate-text-datasets

Apache License 2.0
1.1k stars 108 forks source link

Incomplete Sentences #34

Closed MiladMolazadeh closed 6 months ago

MiladMolazadeh commented 10 months ago

Hello!

I'm currently using a suffix array and Persian language text. However, in some examples, the outcome of deduplication is not ideal when removing substrings from the text. This leads to boundaries of strings being overlapped by words, resulting in a deprecated and sometimes meaningless text. How can I rectify this issue?

one example (translated to english):

ORIGINAL: According to BBC and quoted by Currency, the dollar to ruble rate increased by 0.32% to 55.19 rubles and the euro decreased by 0.36% to 56.09 rubles.

RESULT AFTER DEDUP: o ruble rate increased by 0.32% to 55.19 rubles and the euro decreased by 0.36% to 56.09 rubles.

In the repository, you mentioned this, stating that it doesn't disrupt the language model. The reason provided is that only a relatively small amount of text is removed. However, I'm having difficulty understanding why this isn't considered harmful. Do you mean that this disruption has no effect whatsoever on the perplexity of the language model?

In our paper we suggest just taking all of these duplicate sequences that have been identified and completely striking them from the dataset. This somewhat breaks the flow of text, for example if previously had an example "Alice wanted to go to the store" and we deduplicated at the level of 10 characters, we might completely strike " to go to the " and be left with "Alice wantedstore". In practice we have found this doesn't break the language model because we remove relatively little text, and so these breaks don't cause harm.

carlini commented 6 months ago

As far as our results in the paper go, it doesn't look like it's a problem. Here's my intuition for why:

  1. there's a whole log of garbage on the internet. Training on something that's only marginally worse quality is still a lot better than the average, and probably better than training on repeated stuff.
  2. models already see partial sentences. When we cut apart long sentences, we're already training the model on weird stuff that's out of the ordinary. So this isn't so much worse.
  3. Most datasets are converted from rendered HTML -> .txt and this process already has a bunch of noise. A small amount of noise from the deduplication doesn't hurt.

Maybe in some settings it could matter a lot. In those settings you may be better off removing entire examples that have duplicates, I don't know, that's task specific. But as far as the code goes it appears to work for our cases.