openeduhub / metalookup

Provide metadata about domains w.r.t accessibility, licencing, adds, etc.
GNU General Public License v3.0
5 stars 0 forks source link

ExtractFromFiles tries to download invalid URLs #118

Closed MRuecklCC closed 2 years ago

MRuecklCC commented 2 years ago

It seems like relative (and eventually absolute) hrefs in the HTML document are not correctly prefixed with the base URL, and hence they cannot be downloaded.

AC:

URL: https://www.bpb.de/gesellschaft/umwelt/anthropozaen/248847/futter-fuer-europas-vieh

HTML Sample for below log:

 Futter für Europas Vieh (<a href="/medien/248846/Futter_fuer_Europas_Vieh.pdf" class="styled-link styled-link--internal"><span class="a-visually-hidden">Interner Link: </span>Grafik zum Download</a>)

Log example:

extractor_1   | 2022-06-21 14:56:00,728 INFO metalookup.app.api Received request for https://www.bpb.de/gesellschaft/umwelt/anthropozaen/248847/futter-fuer-europas-vieh
extractor_1   | 2022-06-21 14:56:12,266 INFO metalookup.core.metadata_manager Built WebsiteData object in 11.54s.
... [other extractors] ...
extractor_1   | 2022-06-21 14:56:12,590 ERROR metalookup.core.metadata_manager Failed to extract extract_from_files
extractor_1   | Traceback (most recent call last):
extractor_1   |   File "/usr/local/lib/python3.10/site-packages/metalookup/core/metadata_manager.py", line 113, in run_extractor
extractor_1   |     stars, explanation, extra_data = await extractor.extract(site=site, executor=self.process_pool)
extractor_1   |   File "/usr/local/lib/python3.10/site-packages/metalookup/features/extract_from_files.py", line 73, in extract
extractor_1   |     values = await self._work_files(files=extractable_files, executor=executor)
extractor_1   |   File "/usr/local/lib/python3.10/site-packages/metalookup/features/extract_from_files.py", line 200, in _work_files
extractor_1   |     extractable_files: tuple[Optional[str], ...] = await asyncio.gather(*tasks)
extractor_1   |   File "/usr/local/lib/python3.10/site-packages/metalookup/features/extract_from_files.py", line 193, in task
extractor_1   |     file = await self._download_file(url, session)
extractor_1   |   File "/usr/local/lib/python3.10/site-packages/metalookup/features/extract_from_files.py", line 163, in _download_file
extractor_1   |     result = await session.get(url=file)
extractor_1   |   File "/usr/local/lib/python3.10/site-packages/aiohttp/client.py", line 507, in _request
extractor_1   |     req = self._request_class(
extractor_1   |   File "/usr/local/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 305, in __init__
extractor_1   |     self.update_host(url)
extractor_1   |   File "/usr/local/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 366, in update_host
extractor_1   |     raise InvalidURL(url)
extractor_1   | aiohttp.client_exceptions.InvalidURL: /medien/248846/Futter_fuer_Europas_Vieh.pdf
extractor_1   | 2022-06-21 14:56:12,976 INFO metalookup.features.adblock_based Found 0 links that should be blocked according to ad-block rules in 0.078s
MRuecklCC commented 2 years ago

In the current version this actually results in internal server errors.... which is bad: image