Open adbar opened 3 years ago
I fixed some issues in the main branch, but now if I run python -m justext -s Polish "https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html"
I think it gets you what you expect. The title "Ziemniaki na szóstej, surówka na dziesiątej". Jak pomagać, żeby nie zaszkodzić? [PORADNIK W PIGUŁCE] is twice in the original HTML too and there is no deduplication logic. The jusText is intended to create corpora IMHO and some duplication there is not so bad. It would be nice to do some deduplication though, but you know. I don't have the motivation to do it because I am no longer using justText for my projects.
OK, I understand, I'll see what I can do.
Justext outputs the title of this webpage twice:
https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html (archived as https://web.archive.org/web/20211020174043/https://wiadomosci.gazeta.pl/wiadomosci/7,114883,27025667,ziemniaki-na-szostej-surowka-na-dziesiatej-jak-pomoc-zeby.html)
The rest of the extraction is not completely clean either (e.g. "REKLAMA" elements).