skyjake / lagrange

A Beautiful Gemini Client
https://gmi.skyjake.fi/lagrange/
BSD 2-Clause "Simplified" License
1.21k stars 64 forks source link

"View as Text" option for incorrect MIME types #600

Open acidus99 opened 1 year ago

acidus99 commented 1 year ago

The Problem:

There are 20K+ files across Geminispace that are text files served with an incorrect MIME type. Lagrange will not display these, forcing you to download them, and view them external of Lagrange. This is cumbersome for the user at best, and sometimes impossible (e.g. on iOS).

A Solution:

Please consider using the WHATWG's "mislabeled binary resource" rules to detect these files, and give the user an option view the file as plain text inside Lagrange. Something like this:

image

Background

There are a number of files in Geminispace that are text files, but the capsules are misconfigured and send a non-text/* MIME type. This primarily happens with servers that use the default file-extension-to-mime-type mappings, so *.crd files get sent with a MIME type of application/x-mscardfile. As such Lagrange does not display the file, and instead offers to download it.

Lost of files in the various TextFiles.com mirrors, and the entire music archive on gemini://blitter.com suffer from this. Using the crawler for Kennedy I found 20K+ files that are affected by this.

This problem is super common on the web as well, so much so that the WHATWG has an entire living standard on how to properly determine the content type of a file, regardless of the MIME type:

While the full scope of that is probably overkill, there is section 7.2 which shows how to detect a text document that has been mislabeled with a binary MIME type.

Basically, you scan the first 1445 bytes to see if any binary/control characters appear (the specific ones are defined in the standard). If they don't appear (or if you detect a BOM), its a mislabeled text file.

I recently updated the Kennedy search engine's indexer (gemini://kennedy.gemi.dev) to detect mislabeled text files and still index them. However the usefulness of this is reduced since if you get one of these files as a search result, clicking on the result in Lagrange prompts you to download it (as seen in the screen shot above)

acidus99 commented 1 year ago

Fun fact, I also now use the mislabeled text file detection rules to find binary files that are mistakenly served with a text/gemini MIME type. Let me tell you, trying to dump 5 MB of binary data in to a Full Text Search index causes a ton of problems. Adopting this approach removed junk from my index and made things faster. You might consider using this to detect incorrectly labeled gemtext files that are really binary that crash / lock-up Lagrange. I'll file that separately

skyjake commented 1 year ago

An earlier related issue: #359

skyjake commented 1 year ago

On a general level, I feel that clients should not be too smart about guessing what the real media type is supposed to be. For example, it would be inappropriate for clients to automatically start ignoring the media types and using ones that they have autodetected. This would remove all incentives on server side to use correct typing.

Providing a way to manually try some autodetection/fix is fine, as suggested, if the server's claimed media type turned out to be unexpected or unsupported.

zzo38 commented 9 months ago

I am opposed to using WHATWG's rules, at least automatically. It should use the MIME type specified in the response from the server unless the user manually overrides it. (So, I agree with skyjake's comment above.)

I do think that a "view as plain text" option (or, to generally allow the user to manually override what MIME type to use) is useful, whether or not Lagrange is already capable of displaying that file.

Another possible option would be a hex dump option, although if you are given the option to override the MIME type and the user can add an external hex dump program using MIME hooks, then that would already be made possible for free anyways, so it is probably unnecessary to add a built-in hex dump option.