google-research / deduplicate-text-datasets

Apache License 2.0
1.12k stars 111 forks source link

"failed to fill whole buffer" errors #14

Closed mitya52 closed 2 years ago

mitya52 commented 2 years ago

Hi,

I have tried to run the code on simple string and count-occurrences fails with "failed to fill whole buffer" error.

Here are steps to reproduce:

  1. run ./target/debug/dedup_dataset make --data-file dup.txt, data file dup.txt contains simple string "aaabbb"
  2. then run ./target/debug/dedup_dataset count-occurrences --data-file dup.txt --query-file query.txt, where query.txt contains
    • "bb" expectation: Number of times present: 2 reality: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" }', src/main.rs:275:31;
    • "ab" expectation: Number of times present: 1 reality: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" }', src/main.rs:297:31;
    • "b" expectation: Number of times present: 2 reality: Number of times present: 1;

May be I'm doing something wrong? Thanks.

carlini commented 2 years ago

Ah sorry this is a bug with something from the rewrite. I've updated to the new version of the code if you pull -- please let me know if it works correctly.

mitya52 commented 2 years ago

Thanks, all "failed to fill whole buffer" errors gone with fix!