ipfs / notes

IPFS Collaborative Notebook for Research
MIT License
402 stars 30 forks source link

metadata and extended attributes (EA, xattr) #335

Open ThomasWaldmann opened 9 years ago

ThomasWaldmann commented 9 years ago

It would be really nice if IPFS had great support for metadata / xattrs.

If you think about it, a lot of stuff boils down to having data AND (lots of) metadata. Most current filesystems are rather restricted in their offerings for metadata (often only name, timestamps, size, mode) and even the ones that support xattrs sometimes do it in a very limited way (like only offering 4KiB for them, so that you can not really rely on having enough space in there for your metadata).

A lot of problems encountered today are due to the lack of metadata (and due to historical workarounds / quick hacks made to deal with that). E.g. look at the mimetypes library - it has guessing functions that try to guess the mimetype from the filename extension (.txt -> text/plain). Well, that's better than nothing, but you still don't know the encoding (which is really problematic for the single-byte encodings other than ascii). Also, that the name has to include a .txt extension for this to work is a bad hack anyway (and a problematic one, you will get asked how to open files like "README" as it does not have this extension). For some extensions, there is not one possible mimetype, but multiple completely different ones. Because of that, we nowadays have to do all sorts of hacks and magic (like using the "file" tool (libmagic) to find out the file type - which also sometimes fails) or other heuristics to find out the correct text file encoding). Web servers and browsers also do weird tricks to find out the correct content-type (more or less successfully).

I think one should never have to do that kind of guessing, such data should be just definitely KNOWN (and stored in metadata).

There are also other usages for metadata, like putting author, description, keywords, tags, language (for texts), thumbnails (for pictures), hashes, digital signatures, encryption algorithm (if corresponding data is encrypted), or for filenames that are just byte sequences: the encoding of the filename (that is also a widespread issue: you never know...).

jbenet commented 9 years ago

Agreed. See https://github.com/ipfs/go-ipfs/issues/1642

There are also other usages for metadata, like putting author, description, keywords, tags, language (for texts), thumbnails (for pictures), hashes, digital signatures, encryption algorithm (if corresponding data is encrypted), or for filenames that are just byte sequences: the encoding of the filename (that is also a widespread issue: you never know...).

that's separate, please do this as first-class ipfs objects themselves. rough example here: https://github.com/ipfs/specs/tree/master/keychain

risen commented 9 years ago

While I agree this is necessary and important, I'd like to remind you why git chooses not to store permissions, EA etc: on trees/projects that people like to collaborate on, having all the participants store different uid/gid/perms/EAs/etc would become a source of conflicts and a hassle very quickly. So it'd be nice if this was completely optional.

fiatjaf commented 9 years ago

If this is allowed, I would like to suggest, for people that may arrive here in search for "storing file metadata in IPFS" for personal organization, that they try git-annex with the IPFS special remote. More information on the links.

tv42 commented 8 years ago

On the conflict note from @risen, in the context of actual filesystems, that is FUSE: If you don't support xattr, OS X will just write ._* files and put the xattrs in those. If you add those to IPFS, the conflicts just move there, and don't go away. ipfs add and friends naturally behave differently.

KrzysiekJ commented 8 years ago

To emphasize the importance of mimetype: the object QmeYbSUu7JyXbRT8ppaEvLHpXWUUXM5eBLSjCa9my7WtKv displays text “foo” when viewed as HTML and text “bar” when viewed as PDF. We therefore see that raw file content is generally not sufficient to display it in a browser and for this reason raw IPFS files cannot (theoretically) be used to build web sites. In practice of course heuristics can be used and it will work in many cases, but this is not a sane, simple and generic solution. Maybe a container format which combines mimetype and file content / file hash should be specified? It may need a separate URI scheme (like ipws:) to distinguish from raw objects.

jbenet commented 8 years ago

@KrzysiekJ yeah-- i hear you. We should add MIME types to files, either as an intermediate object, or in the link to them.

0zAND1z commented 5 years ago

Is there a way to extract metadata from a file object stored in IPFS?

Stebalien commented 5 years ago

@0zAND1z we don't currently store any.

Noc2 commented 5 years ago

@0zAND1z For pure metadata extraction you can take a look at https://github.com/ipfs-search/ipfs-tika. But I think the metadata problem needs to be solved in a more efficient, distributed way with a blockchain-like structure as a supporting layer and you can solve a lot of other problems around IPFS by doing this (initial loading speed, distributed search including a similarity digests like described here https://github.com/ipfs/notes/issues/347, ownership/verifiability, etc.). I’m actively working on it (see e.g. https://github.com/PACTCare/Starlog). I started with IOTA, but switched to Substrate by Parity and the ethereum non fungible tokens standard. I hope to combine my thoughts/research in a paper in the coming weeks. If you interested in this we can also have a chat.