openeduhub / metalookup

Provide metadata about domains w.r.t accessibility, licencing, adds, etc.
GNU General Public License v3.0
5 stars 0 forks source link

Gracefully handle disapeared and non-HTML Content #155

Closed MRuecklCC closed 2 years ago

MRuecklCC commented 2 years ago
MRuecklCC commented 2 years ago

See also: https://github.com/openeduhub/metalookup/pull/154#discussion_r931041382

MRuecklCC commented 2 years ago

While the legacy endpoint (#148) replies with a 502 in case of errors in general, The current MetaDataManager implementation still respons with 500 errors in case it cannot fetch splash or lighthouse. Here, we need to distinguish between:

MRuecklCC commented 2 years ago

Because the rework of the content class in #149 and how the communication with splash is now initiated from within the extractors, the extractors also need to handle the potential errors of communication with splash/lighthouse.

The first quick fix will be to simply:

these exceptions will then propagate outwards and should return in the correct HTTP error codes (400, 502).

The downside of this is, that the API Layer HTTPExceptions (fastapi) leak into the extractor/Content class. It also means, that the MetadataManager class has a hard time consolidating those errors, as it can only intercept the HTTPExceptions.

A long term much cleaner solution woudl be to introduce custom exception classes such as:

This way we can keep the API layer exceptions out of the core classes and give the metadata manager a chance to consolidate exceptions of different extractors.