Closed tylerjthomas9 closed 3 months ago
Great catch!
I think best way to handle this is to lock the trafilatura to < 1.12.0. They must have changed the way they process short htmls which is the reason why the duplication happens. This fix would rather hide it then solve it
"Hello World!" is being duplicated in the
trafilatura
extraction test ontrafilatura==1.12.0
(released last week).This issue appears relevant.
trafilatura
is duplicating the text because it is too short.To fix this issue, I modified the config for
trafilatura
to setMIN_EXTRACTED_SIZE=0
.I also verified that the issue only appears in
v1.12.0
https://github.com/tylerjthomas9/datatrove/pull/1.