miso-belica / jusText

Heuristic based boilerplate removal tool
https://pypi.python.org/pypi/jusText
BSD 2-Clause "Simplified" License
727 stars 79 forks source link

Duplicate text output #42

Open adbar opened 3 years ago

adbar commented 3 years ago

Justext outputs the title of this webpage twice:

https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html (archived as https://web.archive.org/web/20211020174043/https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html)

The rest of the extraction is not completely clean either (e.g. "REKLAMA" elements).

miso-belica commented 3 years ago

I fixed some issues in the main branch, but now if I run python -m justext -s Polish "https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html" I think it gets you what you expect. The title "Ziemniaki na szóstej, surówka na dziesiątej". Jak pomagać, żeby nie zaszkodzić? [PORADNIK W PIGUŁCE] is twice in the original HTML too and there is no deduplication logic. The jusText is intended to create corpora IMHO and some duplication there is not so bad. It would be nice to do some deduplication though, but you know. I don't have the motivation to do it because I am no longer using justText for my projects. image

adbar commented 3 years ago

OK, I understand, I'll see what I can do.