Closed lfoppiano closed 2 weeks ago
Wasn't this issue resolved in https://github.com/huggingface/datatrove/commit/a8d21e2ba3bb84f721311cd7a22365fd400f0681?
I’ve also encountered a similar issue. it occurs in version 0.2.0, but it seems that the fineweb team has recognized and resolved it in the latest repository. It’s likely that Fineweb removed data containing many blank spaces.
@justHungryMan thanks! Indeed it seems this could solve the issue. I didn't mention I'm using version 0.2.0.
I wonder whether is there a plan for a datatrove patch release?
Hi, we've just pushed version 0.3.0
to pypi, which should fix this issue :)
Great! That was quick. Thanks!
Hi all, I noticed that the gopher filter is able to recognise paragraphs and lines and make filtering based on their ratio, however, in case of processed documents is often the case that the data is surrounded by blank lines.
Because of that, the blank lines are biasing the filter toward removing documents when the ratio passes the threshold, I was wondering whether blank lines should have a separate metric/threshold.
What do you think?