huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.93k stars 137 forks source link

blank lines treated as duplicates in the GopherFilter #273

Closed lfoppiano closed 2 weeks ago

lfoppiano commented 2 weeks ago

Hi all, I noticed that the gopher filter is able to recognise paragraphs and lines and make filtering based on their ratio, however, in case of processed documents is often the case that the data is surrounded by blank lines.

Because of that, the blank lines are biasing the filter toward removing documents when the ratio passes the threshold, I was wondering whether blank lines should have a separate metric/threshold.

What do you think?

justHungryMan commented 2 weeks ago

Wasn't this issue resolved in https://github.com/huggingface/datatrove/commit/a8d21e2ba3bb84f721311cd7a22365fd400f0681?

I’ve also encountered a similar issue. it occurs in version 0.2.0, but it seems that the fineweb team has recognized and resolved it in the latest repository. It’s likely that Fineweb removed data containing many blank spaces.

lfoppiano commented 2 weeks ago

@justHungryMan thanks! Indeed it seems this could solve the issue. I didn't mention I'm using version 0.2.0.

I wonder whether is there a plan for a datatrove patch release?

guipenedo commented 2 weeks ago

Hi, we've just pushed version 0.3.0 to pypi, which should fix this issue :)

lfoppiano commented 2 weeks ago

Great! That was quick. Thanks!