huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
2.05k stars 147 forks source link

Fix `test_basic_article_trafilatura` test failure #264

Closed tylerjthomas9 closed 3 months ago

tylerjthomas9 commented 3 months ago

"Hello World!" is being duplicated in the trafilatura extraction test on trafilatura==1.12.0 (released last week).

=========================== short test summary info ============================
FAILED tests/pipeline/test_extractors.py::TestExtractors::test_basic_article_trafilatura - AssertionError: 'Hello World!\nHello World!' != 'Hello World!'
- Hello World!
  Hello World!
======= 1 failed, 63 passed, 1 skipped, 19 warnings in 90.94s (0:01:30) ========

This issue appears relevant. trafilatura is duplicating the text because it is too short.

To fix this issue, I modified the config for trafilatura to set MIN_EXTRACTED_SIZE=0.

I also verified that the issue only appears in v1.12.0 https://github.com/tylerjthomas9/datatrove/pull/1.

hynky1999 commented 3 months ago

Great catch!

I think best way to handle this is to lock the trafilatura to < 1.12.0. They must have changed the way they process short htmls which is the reason why the duplication happens. This fix would rather hide it then solve it