Open zeuner opened 1 month ago
It turns out there is a different site quirk that also leads to defective PDF downloads:
For some decisions, there is a string appended to the PDF url, causing the download to miss the actual file. E.g., as of today, the [1] intermediate page points to [2] (only visible in source). When removing the part after pdf
, the link points to a real decision full text.
[1] https://web.archive.org/web/20240922171316/https://www.justiz.sachsen.de/ovgentschweb/document.phtml?id=5869 [2] https://www.justiz.sachsen.de/ovgentschweb/documents/19A665.U01.pdf%3BVolltext+%28hier+klicken%29
I encountered yet another variant of the issue: [1]. In this case, the filename required for the file download is only found in the link text. Looks like it's pretty arbitrary whether the different parts of the link get put in the href
attribute or the link text.
Sometimes (e.g. see [1]), the decisions are provided as DOCX instead of PDF. Haven't fixed this yet, this will require adapting gesp/src/create_file.py
, too.
Sometimes (e.g. see [1]), the decisions are provided as DOCX instead of PDF. Haven't fixed this yet, this will require adapting
gesp/src/create_file.py
, too.
Also fixed in https://github.com/niklaswais/gesp/pull/12 now.
Yet another case of failing downloads: [1]. It turns out that there is a ..
in the filename which seems unusual. I haven't figured out whether it's possible to derive a working URL.
Sometimes, the intermediate pages on
www.justiz.sachsen.de
contain invalid//a/@href
paths which are not visible to human users, but will confuse the scraper, leading to invalid PDF files. E.g., as of today, the [1] page makes the scraper fail on the 2 B 9/20.NC ruling from April 16th, 2020.[1] https://web.archive.org/web/20240922101217/https://www.justiz.sachsen.de/ovgentschweb/document.phtml?id=5775