google-research / deduplicate-text-datasets

Apache License 2.0
1.1k stars 108 forks source link

Should newline char be removed #16

Closed cperiz closed 2 years ago

cperiz commented 2 years ago

Hi, So I notice that this read here adds a \n char to the end of the query. This then causes an issue with the count if its not actually an end-of-line. Should there be a .strip() added here?

arr = open(args.query_file,"rb").read().strip()

Thanks.

carlini commented 2 years ago

This python command doesn't itself add a newline to the file. It only reads what's already in the file. If you don't want a newline then you should remove it from the input file. For example, if you're writing to it with echo, this by default adds a newline to the end.