It seems like relative (and eventually absolute) hrefs in the HTML document are not correctly prefixed with the base URL, and hence they cannot be downloaded.
AC:
double check that relative and absolute hrefs are correctly prefixed before downloading.
Futter für Europas Vieh (<a href="/medien/248846/Futter_fuer_Europas_Vieh.pdf" class="styled-link styled-link--internal"><span class="a-visually-hidden">Interner Link: </span>Grafik zum Download</a>)
Log example:
extractor_1 | 2022-06-21 14:56:00,728 INFO metalookup.app.api Received request for https://www.bpb.de/gesellschaft/umwelt/anthropozaen/248847/futter-fuer-europas-vieh
extractor_1 | 2022-06-21 14:56:12,266 INFO metalookup.core.metadata_manager Built WebsiteData object in 11.54s.
... [other extractors] ...
extractor_1 | 2022-06-21 14:56:12,590 ERROR metalookup.core.metadata_manager Failed to extract extract_from_files
extractor_1 | Traceback (most recent call last):
extractor_1 | File "/usr/local/lib/python3.10/site-packages/metalookup/core/metadata_manager.py", line 113, in run_extractor
extractor_1 | stars, explanation, extra_data = await extractor.extract(site=site, executor=self.process_pool)
extractor_1 | File "/usr/local/lib/python3.10/site-packages/metalookup/features/extract_from_files.py", line 73, in extract
extractor_1 | values = await self._work_files(files=extractable_files, executor=executor)
extractor_1 | File "/usr/local/lib/python3.10/site-packages/metalookup/features/extract_from_files.py", line 200, in _work_files
extractor_1 | extractable_files: tuple[Optional[str], ...] = await asyncio.gather(*tasks)
extractor_1 | File "/usr/local/lib/python3.10/site-packages/metalookup/features/extract_from_files.py", line 193, in task
extractor_1 | file = await self._download_file(url, session)
extractor_1 | File "/usr/local/lib/python3.10/site-packages/metalookup/features/extract_from_files.py", line 163, in _download_file
extractor_1 | result = await session.get(url=file)
extractor_1 | File "/usr/local/lib/python3.10/site-packages/aiohttp/client.py", line 507, in _request
extractor_1 | req = self._request_class(
extractor_1 | File "/usr/local/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 305, in __init__
extractor_1 | self.update_host(url)
extractor_1 | File "/usr/local/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 366, in update_host
extractor_1 | raise InvalidURL(url)
extractor_1 | aiohttp.client_exceptions.InvalidURL: /medien/248846/Futter_fuer_Europas_Vieh.pdf
extractor_1 | 2022-06-21 14:56:12,976 INFO metalookup.features.adblock_based Found 0 links that should be blocked according to ad-block rules in 0.078s
It seems like relative (and eventually absolute) hrefs in the HTML document are not correctly prefixed with the base URL, and hence they cannot be downloaded.
AC:
URL:
https://www.bpb.de/gesellschaft/umwelt/anthropozaen/248847/futter-fuer-europas-vieh
HTML Sample for below log:
Log example: