google-research / deduplicate-text-datasets

Apache License 2.0
1.12k stars 111 forks source link

remove_ex in finish_dedup_wiki40b #35

Closed wead-hsu closed 8 months ago

wead-hsu commented 11 months ago

Thanks for your excellent code.

I have successfully rerun the code in the repository about exactdedup. However, I have a problem about the following code:

159        remove_ex[i].append((max(int(remove[ptr][0] - byte_start - 6), 0),
160                              min(int(remove[ptr][1] - byte_start), byte_end-byte_start)))

I know the meaning of "6", but why not also subtract "6" in the right?

carlini commented 8 months ago

We need to start 6 bytes off of the start (that's the left side) but the end is still the last byte; that doesn't need to change or be offset by 6 bytes.