niklaswais / gesp

https://nwais.de/gesp
MIT License
18 stars 4 forks source link

Occasional broken PDF files from the SN OVG scraper #11

Open zeuner opened 1 month ago

zeuner commented 1 month ago

Sometimes, the intermediate pages on www.justiz.sachsen.de contain invalid //a/@href paths which are not visible to human users, but will confuse the scraper, leading to invalid PDF files. E.g., as of today, the [1] page makes the scraper fail on the 2 B 9/20.NC ruling from April 16th, 2020.

[1] https://web.archive.org/web/20240922101217/https://www.justiz.sachsen.de/ovgentschweb/document.phtml?id=5775

zeuner commented 1 month ago

It turns out there is a different site quirk that also leads to defective PDF downloads:

For some decisions, there is a string appended to the PDF url, causing the download to miss the actual file. E.g., as of today, the [1] intermediate page points to [2] (only visible in source). When removing the part after pdf, the link points to a real decision full text.

[1] https://web.archive.org/web/20240922171316/https://www.justiz.sachsen.de/ovgentschweb/document.phtml?id=5869 [2] https://www.justiz.sachsen.de/ovgentschweb/documents/19A665.U01.pdf%3BVolltext+%28hier+klicken%29

zeuner commented 1 month ago

I encountered yet another variant of the issue: [1]. In this case, the filename required for the file download is only found in the link text. Looks like it's pretty arbitrary whether the different parts of the link get put in the href attribute or the link text.

[1] https://web.archive.org/web/20240923061228/https://www.justiz.sachsen.de//ovgentschweb/document.phtml?id=6578

zeuner commented 1 month ago

Sometimes (e.g. see [1]), the decisions are provided as DOCX instead of PDF. Haven't fixed this yet, this will require adapting gesp/src/create_file.py, too.

[1] https://web.archive.org/web/20240923150003/https://www.justiz.sachsen.de//ovgentschweb/document.phtml?id=5998

zeuner commented 1 month ago

Sometimes (e.g. see [1]), the decisions are provided as DOCX instead of PDF. Haven't fixed this yet, this will require adapting gesp/src/create_file.py, too.

[1] https://web.archive.org/web/20240923150003/https://www.justiz.sachsen.de//ovgentschweb/document.phtml?id=5998

Also fixed in https://github.com/niklaswais/gesp/pull/12 now.

zeuner commented 1 month ago

Yet another case of failing downloads: [1]. It turns out that there is a .. in the filename which seems unusual. I haven't figured out whether it's possible to derive a working URL.

[1] https://web.archive.org/web/20240926140837/https://www.justiz.sachsen.de//ovgentschweb/document.phtml?id=1623