thoth-pub / thoth

Metadata management and dissemination system for Open Access books
https://thoth.pub
Apache License 2.0
45 stars 9 forks source link

Consider adding check to Location fullTextUrl field to ensure it points directly to file #439

Open rhigman opened 2 years ago

rhigman commented 2 years ago

Work on https://github.com/thoth-pub/thoth/issues/405 required downloading PDF publication files directly from Location fullTextUrls, including checking that the URL returned Content-Type: application/pdf. This uncovered many user-entered fullTextUrls which instead returned Content-Type: text/html. These were fortunately simple to fix via a bulk database update, but would have been onerous for the user to change individually.

Perhaps we could/should do a Content-Type check when the user tries to save a fullTextUrl, to prevent similar issues in future.

ja573 commented 1 year ago

This should be prioritised soon, to help reduce the number of auto dissemination errors

rhigman commented 8 months ago

Prioritising this might be the simplest mitigation to the issue described here.

rhigman commented 6 months ago

Note that more than one Content-Type may be valid for any given Publication Type, e.g. application/octet-stream is also permissible for PDFs. This may require a change to the thoth-dissemination check if we start hitting issues with it (we haven't so far).