google-research / deduplicate-text-datasets

Apache License 2.0
1.1k stars 108 forks source link

Unexpected behavior with ending symbols #15

Closed mitya52 closed 2 years ago

mitya52 commented 2 years ago

Hi again,

I found that count-occurrences have an unexpected behavior if you want to count last symbols in sequence. Here are the examples:

Can you fix this? Thanks!

carlini commented 2 years ago

Thanks for catching this. It should be good now with 0008e616acfb1cdee33d036cf426642050b9a74d, I had previously unintentionally ignored the last N bytes for a file of length for queries of length N. This was a bug in the old code (it should have only ignored the last N-1 bytes) but now as part of the fix for #14 I don't need any length cap.

mitya52 commented 2 years ago

I think you forgot to fix same workaround upper in line 286. When I patched it, all started to work well.