postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.42k stars 445 forks source link

incomplete content on multiple pages #739

Open Grienauer opened 1 year ago

Grienauer commented 1 year ago

Currently on following pages the parser seems to be lost. I don't see any markup problems. maybe the newspapers detect and block the scraper?

https://www.derstandard.at/story/2000145508819/franzoesischer-verfassungsrat-stimmt-umstrittener-pensionsreform-zu there an info is added to the text, that some "software" is blocking stuff and it should be removed

https://kurier.at/wirtschaft/atomausstieg-wie-die-abschaltung-eines-kernkraftwerks-funktioniert/402412829 only one line of text

thx for info. happy to help.

Overwatching commented 1 year ago

There are multiple mentions in the issues section about header content being removed erroneously. I think this falls into the same problem.

I came here to report the same thing happening on Hackaday.com/blog

ctipper commented 1 year ago

And https://www.thetimes.co.uk/ multiple articles, it clips the first one or two paragraphs on every page I'v tried. Kind of useeless in this state.