ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.03k stars 3k forks source link

OS specific content type handling #7418

Open Stebalien opened 4 years ago

Stebalien commented 4 years ago

Version information:

v0.6.0-rc1

Description:

go-ipfs now reads /etc/mime.types when determining the content type of a file from the file extension. Unfortunately, this leads to hard to diagnose platform specific behavior where, ideally, all go-ipfs implementations should behave the same way.

See https://github.com/ipfs-shipyard/ipfs-companion/issues/886#issuecomment-634301352.

markg85 commented 4 years ago

Interesting issue! Yes, i did read https://github.com/ipfs-shipyard/ipfs-companion/issues/886#issuecomment-634301352

The original issue talks about this blog post: https://ipfs.io/ipns/blog.ipfs.io/2020-05-20-gossipsub-v1.1

That's - at the very least - a bug that should be fixed in the software that's used to make blog posts. A "dot" should not be part of the url except if it's an extension. URL's like that will break checks that are done solely on the extension. Which in this case is apparently happening.

Next you have a difference of webservers and files. If a webserver serves a pages like that, the content type that the webserver also provides should be used. This is where things get wonky as when handling it through IPFS it's probably all handled as files. And if different nodes have different mime databases you might indeed get different results.

But there's a fix for that :) These kind of issues have been seen in the open source desktop world years ago (talking about KDE specifically). In that world (C++ with Qt) solutions have been made to get this fixed. In particular, this function is used these days: https://doc.qt.io/qt-5/qmimedatabase.html#mimeTypeForFile Look closely at the second argument. You can specify if you want to determine the mime-type by extension or content (or both).

In the IPFS world it makes a lot of sense to determine the mime type by content, not by extension. As the node that is going to respond with the data knows the data. It's practically free to determine the mime type then.

So i'd advise you to look at how this is done in the Qt world and use that logic instead. Your starting point would be https://code.qt.io/cgit/qt/qtbase.git/tree/src/corelib/mimetypes/qmimedatabase.cpp I don't quite know how the actual database is build but i do know that it's working quite reliable for years (since it's introduction in Qt 5.0 i think)

Just as a little reminder of what is possible if you solely detect meme by extension. Do realize that on linux (windows too i think, not entirely sure) a dot is allowed to be in any entry. So you could actually have a folder called: "bigfolder.jpg" which would not be a jpg file but a folder! It's stupid.. but possible.

Hope this helps :)

markg85 commented 4 years ago

While thinking about this a bit more. Why isn't the mime content type encoded in the hash? It doesn't have to be a super strong hash part. You never know how many there are so you'd never know how much space you need to reserve for it. But you can define a list of known mime types and encode that in the hash too. With just base32 (the current bafy one) you already can encode 1024 content types in a mere 2 characters. If the hash is unknown, encode it as a reserved character pair. Like say 00. You'd get something like:

00 = Unknown mime type (aka, try to determine it on the receiving node) 01 = application/json ... 50 = image/jpeg ...

This does make the hash 2 characters longer but it also gives you a way of knowing the intended mime type for a file. Also, it only has to be determined once at the point of adding the file to IPFS.

You'd only have to do mime type checking if it's unknown, which gives you a nice backwards compatibility path too. Another thing to consider is that IPFS is exposing file details with this (IPFS already did with the filesize which is part of the encoding too). But with this you also know the type of file. For some purposes that might not be ideal. For other purposes (like quickly sorting on mime type or even "searching for image files" this offers a really simple and fast way to do just that.

Stebalien commented 4 years ago

The issue here is simply: content detection shouldn't be OS dependent. That's it.

Beyond that, we'd ideally have more accurate content detection. However, it's not simple. We don't want to treat "index.html" with the content Hi, my name is <b>Steven</b>! as text (even if file says it is).