Base moodle's file_exist method on the file content instead of headers

marcospri commented 2 weeks ago

In moodle we can't make a API call to check if a file exists and we have to rely on the URL where file will be downloaded.

When a problem occurs we get a 200 response with a JSON document with the reason of the error.

We don't want to always download the document as it will actually download the full PDF file for success cases, the most common case.

Until now we been relying on the headers of a HEAD HTTP request, interpreting JSON responses as the file being missing.

We have found at least one school that doesn't include the content-type header on the response so he are switch the approach to inspect the first bytes of the response and check if we are getting a PDF back from Moodle.

Here's the headers from the file in the Moodle instance that doesn't include content-type:

 {'Date': 'Tue, 05 Nov 2024 14:42:55 GMT', 'Content-Length': '694351', 'Connection': 'keep-alive', 'Expires': 'Tue, 05 Nov 2024 16:43:00 GMT', 'Cache-Control': 'private, max-age=7205', 'Last-Modified': 'Fri, 01 Nov 2024 10:38:06 GMT', 'X-Frame-Options': 'SAMEORIGIN', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'x-trace': 'ip-172-31-65-239->amazon_172_31_65_222', 'Age': '0', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains', 'X-Cache': 'MISS'}

Testing

Open: localhost - Moodle File

localhost - Deleted Moodle file

Getting a file on the first and an error in the second.

marcospri commented 2 weeks ago

@robertknight @acelaya

Made some changes based on the suggestions:

We keep the old behaviour when we do get headers, that's being working for over a year so let's keep it that way. Use the PDF detection as a fallback when there's no headers.
Now we read the first 1024 bytes of the document instead of just 5
We check response.content instead of response.text. I don't think it would make a big difference but we are looking for bytes so that more explicit.

robertknight commented 2 weeks ago

We check response.content instead of response.text. I don't think it would make a big difference but we are looking for bytes so that more explicit.

In the present, I agree. My comment was mostly about doing the "proper" thing to avoid nasty surprises in future because our code is improperly treating binary data as text and the handling of this changes (see eg. https://github.com/hypothesis/h/pull/8953)

hypothesis / lms

Base moodle's file_exist method on the file content instead of headers #6841

Testing