But this builds an empty httpx.URL, which then raises because the script will try to fetch /robots.txt for this empty URL. Temp fix:
@@ -378,6 +394,9 @@ def _fulltext_urls_from_meta(data: bytes) -> tuple[httpx.URL, str] | None:
if field not in meta_dict:
continue
for fulltext_url in meta_dict[field]:
+ # Skip blank URLs
+ if fulltext_url.strip() == "":
+ continue
return httpx.URL(fulltext_url), filetype
For PDFs, this seems like the right answer but for, e.g., HTML, the 'right' thing is probably the URL of the page itself. On the other hand, that seems like a very rare case, since HTML is not our priority anyway.
Page https://www.wjgnet.com/2220-3230/full/v11/i4/88.htm contains this snippet:
But this builds an empty
httpx.URL
, which then raises because the script will try to fetch/robots.txt
for this empty URL. Temp fix:For PDFs, this seems like the right answer but for, e.g., HTML, the 'right' thing is probably the URL of the page itself. On the other hand, that seems like a very rare case, since HTML is not our priority anyway.