ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.14k stars 3.02k forks source link

Blocks don't have a nice size #2053

Closed robcat closed 8 years ago

robcat commented 8 years ago

Most filesystems have blocks of a standard size of 4k or 16k.

The blocks generated by the ipfs default chunker are stored in 2¹⁸+14 bytes files (262158 b). These extra bytes don't really play nice with most filesystems and disks (an extra block is allocated for just these 14 bytes).

Are these bytes an essential part of the block that can't be moved elsewhere (e.g. in its filename or in the database)?

whyrusleeping commented 8 years ago

good point, we should probably adjust the block size to fit neatly there. the extra 14 bytes is the protobuf wrapping on the data.

jbenet commented 8 years ago

@robcat great observation -- agreed. this will shift a bit with incoming ipld, too. we should chunk as

N - overhead
robcat commented 8 years ago

@jbenet

this will shift a bit with incoming ipld, too. we should chunk as N - overhead

Yes, this would be the obvious fix. But this would still be not compatible with the deduplication features that many file systems already have. (the chunks would have weird offsets and sizes, while the filesystem is probably storing the original file in 2^n bytes chunks)

An alternative solution would be to make a special exception for blobs: use 2^n sizes for chunks, and the blockstore could transparently strip the wrapping when writing and then re-add it when reading it from the filesystem (the special status of the stored block may be encoded in the filename or in one attribute). This sounds like a premature optimization, but would really help in halving the storage requirements when running ipfs in parallel with httpd... (maybe related to #875)

jbenet commented 8 years ago

An alternative solution would be to make a special exception for blobs: use 2^n sizes for chunks, and the blockstore could transparently strip the wrapping when writing and then re-add it when reading it from the filesystem (the special status of the stored block may be encoded in the filename or in one attribute). This sounds like a premature optimization, but would really help in halving the storage requirements when running ipfs in parallel with httpd... (maybe related to #875)

This makes a lot of sense. and we should definitely investigate how this would play out. Optimizing this will matter significantly, as we're trying to put enormous amounts of data on ipfs :) -- the timeline for implementation can wait (if easy do it sooner), but certainly want to figure it out soon for specs.

I suspect this would get a TON easier if we were to allow the ipld format that's just the raw data (no wrapping) -- obviously strictly for leaves, without any links. This has been requested many times, (and im not opposed to it, we just need to make damn sure that we distinguish it correctly).

robcat commented 8 years ago

I suspect this would get a TON easier if we were to allow the ipld format that's just the raw data (no wrapping) -- obviously strictly for leaves, without any links.

Sorry if I go a bit offtopic, but while I was focused on the file archival scenario I forgot to think about the links.

Why are the links stored together with the binary data in the first place? Links beg to be stored in a separate database (it would be much easier to navigate the graph, follow backlinks, etc.).

robcat commented 8 years ago

Why are the links stored together with the binary data in the first place? Links beg to be stored in a separate database (it would be much easier to navigate the graph, follow backlinks, etc.).

@jbenet My last post was not just a rant about links: if they were stored separately, this block size issue would be immediately solved.

jbenet commented 8 years ago

@robcat hash links are part of the data model. this is why it's a merkleized data structure. an entire object is one item.

i understand that this seems odd when you think of ipfs as just a way to track posix files. but ipfs is much more beyond that, the point is to make ipfs objects a data model where people can construct arbitrary data structures -- and dump all their existing ones -- and have merkle links as first class primitives.

this is not a great rendering, but take a look:

jbenet commented 8 years ago

btw, a storage model that rips out the merkle links, storing them separately, and then puts them back together when constructing the object logically (and to send it, and hash it to make sure things are ok, etc) screams implementation pain and suffering. this is why #875 will be hard.

caveat is that by extending IPLD to allow just raw data edge nodes, we can make #875 easy to implement, and get most of the way you want here.

also note that none of this prevents us from maintaining indexes of the links in a graph database (for backlinks etc), but that's a different layer. (there's a lot wrapped into these decisions)

robcat commented 8 years ago

i understand that this seems odd when you think of ipfs as just a way to track posix files. but ipfs is much more beyond that, the point is to make ipfs objects a data model where people can construct arbitrary data structures -- and dump all their existing ones -- and have merkle links as first class primitives.

@jbenet I agree. In a certain sense my view is even more radical than yours: to make links first class citizens, blocks have to be put in a lower class.

The current graph structure (as in the IPFS paper):

type IPFSLink struct {
    Name string
    Hash Multihash
    Size int
}

type IPFSObject struct {
    links []IPFSLink
    data []byte
}

// the specific serialization format is an integral part of the object
func *IPFSObject Serialize() []byte

If I just want to walk on the graph, I necessarily have to get all the binary blobs and hash them (to confirm the object integrity), even if I don't really care about them. That's because links are on the same level as data.

But walking on the graph is a so common use case! The result is that, in practice, objects become specialized: they usually contain either only links or only binary data. In fact, I still have to come across a really compelling case of an object that needs to include both links and data.

This is the structure I have in mind:

type IPFSLink struct {
    Name string
    Hash Multihash
    Size int
    Binary bool    // this is a link to an opaque []byte object
}

type IPFSObject []IPFSLink
func *Node Serialize() []byte

// the binary Leaves (i.e. []byte) are already flat: no serialization needed

Pro:

Contra:

jbenet commented 8 years ago

We're still on different pages. see https://github.com/ipfs/specs/blob/ipld-spec/merkledag/ipld.md and the talk i linked earlier.

ipfs objects will be arbitrary data objects. the great majority of them will not be posix files nor blocks of them. they will be small pieces of interlinked data in applications.

robcat commented 8 years ago

We're still on different pages. see https://github.com/ipfs/specs/blob/ipld-spec/merkledag/ipld.md and the talk i linked earlier.

Sorry @jbenet, I watched and read all the linked material, but according to me they sit on different levels. IPLD is the definition of a common language that allows to interpret and canonicalize the nodes of the graph; here I was instead discussing about implementation details (do we really need to specify a serialization format for object that are already flat?)

ipfs objects will be arbitrary data objects. the great majority of them will not be posix files nor blocks of them. they will be small pieces of interlinked data in applications.

I understand this vision perfectly, but maybe we have "dual" views. According to me, the Data property in a Node object (as the current implementation does it) can only be justified by the unix mantra "everything is a file". Ordinary objects should encode all their information using Links, not by some opaque byte stream.

Also, we maybe don't agree on what is small. According to me, the content of the Data field is by definition always big (otherwise, it could be even encoded as a link with an identity-multihash value) and does not belong in the strongly connected component of the graph (where the small pieces of interlinked data will live). This doesn't force the graph to be a tree at all (as in this example).

But if we are not still understanding each other, I would like to ask:

jbenet commented 8 years ago

Also, we maybe don't agree on what is small. According to me, the content of the Data field is by definition always big (otherwise, it could be even encoded as a link with an identity-multihash value) and does not belong in the strongly connected component of the graph (where the small pieces of interlinked data will live). This doesn't force the graph to be a tree at all (as in this example).

  • Even today, we already have many intermediate nodes (nodes with lots of links) larger than most file objects (containing data)

  • Also, an object's "data" won't be segregated off in a Data field any more. the non-link properties are interspersed in the object with the links. And some data can be added to the links themselves. See the example about unix directories with modes again.

  • Why does the node object have a special Data field, despite the fact that IPLD doesn't give it any special consideration?

IPLD is not yet deployed, we're moving away from the current format to IPLD. This transition is not complete yet. The old format will continue to be supported -- backwards compatible -- so links do not break.

  • How are this title property and this array of links going to be handled by the go-ipfs implementation?

I don't understand the question. they will be stored in the same serialized object. See https://github.com/ipfs/go-ipld/

  • What is a compelling case of object that has to store links and binary data at the same time?
  • A filesystem directory.
  • Any webapp data models:
  • social webapp: users, posts, relationships, etc.
  • store webapp: users, products, categories, bundles, orders, ...
  • any api that today uses JSON or XML or Protobuf or ...
  • A file with metadata or extended attributes
  • An archive format
  • Versioning objects (git commits, darcs patches, CRDTs, etc)
  • Cryptographic artifacts (keys, signatures, certificates, ...)

It's not "binary data" specifically. it's just "data".

  • How is my proposed implementation of the Node object going to prevent arbitrary data objects from being stored in the graph?

You distinguish "data carrying nodes" as leaves, and special. IPLD does not, and embeds data directly in the same object. Also, you do not have the same data model as IPLD (full json-like tree), you have the old merkledag link-list model.


Other points for consideration:

davidar commented 8 years ago

I suspect this would get a TON easier if we were to allow the ipld format that's just the raw data (no wrapping) -- obviously strictly for leaves, without any links. This has been requested many times, (and im not opposed to it, we just need to make damn sure that we distinguish it correctly).

An alternative solution would be to make a special exception for blobs: use 2^n sizes for chunks, and the blockstore could transparently strip the wrapping when writing and then re-add it when reading it from the filesystem (the special status of the stored block may be encoded in the filename or in one attribute). This sounds like a premature optimization, but would really help in halving the storage requirements when running ipfs in parallel with httpd... (maybe related to #875)

This makes a lot of sense. and we should definitely investigate how this would play out. Optimizing this will matter significantly, as we're trying to put enormous amounts of data on ipfs :) -- the timeline for implementation can wait (if easy do it sooner), but certainly want to figure it out soon for specs.

@jbenet I've suggested some changes to the IPLD spec in ipfs/specs#111 to address this issue, comments welcome :)

Kubuxu commented 8 years ago

Here - https://github.com/ipfs/go-ipfs/issues/2122 - we also also talking about limiting block size to 256KiB to allow better aligment and packing in storage (as block size is usually 4KB but it can be bigger), which was initial topic of this issues.

kevina commented 8 years ago

Note that it should be possible to adopt my filestore code (#875, #2634) to separate out the payload and store it separately and basically implement what I think @robcat is proposing. This was actually something I was going to propose and can have some nice performance benefit. For example the garbage collector only cares about the links in a node; by separating out the payload you could implement something like adding a GetLinks() method to the DAGService and this method can be satisfied without having to load the payload from disk.

If this idea is implemented it might not necessary be a good idea to set the block size to 256KiB as that means the payload could then have odd sizes and the odd sizes might not be compatible with a filesystems deduplication. If the payload size is exactly 256KiB than the filesystem has a much better chance of deduplicating the block, if that data is also present in another file.

Something to think about.

Kubuxu commented 8 years ago

It is true, but you have to consider how many nodes will be pure raw data nodes which also will happen to be in file that user publishes and how many of them will be linking, storing metadata and use custom structures. Users are mostly consumers in our times (1% to 99% rule). Your filestore is nice thing but for consuming majority it won't be that useful.

Other thing is that filesystem will be only able to dedup it if the content of file begins at 4KiB boundary, which would require exactly N * 4KiB of metadata before it, I don't feel like it is going to happen.

On the other hand if raw nodes where to be introduced (DAG that consist only from raw uninterpretable for IPFS data) they would store precisely 256KiB of given file, making it aligned perfectly with file system boundaries, allowing for filesystem deduplication. And linking nodes also are 256KiB and align to filesystem boundaries not wasting any space.

kevina commented 8 years ago

@Kubuxu, my idea was that the metadata be stored separately from the payload (i.e. the contents of the file). I.e. the metadata could be stored in a leveldb (as it will be small) while the payload could be stored in a separate file on disk. The size of the metadata in this case will be irrelevant. (This idea is different from the filestore as each block will still be stored in its own file using the same system the flatfs datastore does now, but the contents stored on disk will only be the payload).

As far as raw nodes without any metadata, that will also solve the problem. Has anyone given any idea on how that would be implemented? Without any sort of header how would you know what you are getting when you retrieve a block?

kevina commented 8 years ago

To summarize my point from IRC unless raw nodes are used for all data, at some point in the future it might make sense to store the data separately from the metadata in repo. If the data is stored separately than it is the size of the data component of the block that is important as for as optimizing storage goes.

I would propose we limit the size of the data to 256KiB - 4KiB unless raw nodes are being used. The block size can than still be limited to 256KiB.

davidar commented 8 years ago

As far as raw nodes without any metadata, that will also solve the problem. Has anyone given any idea on how that would be implemented? Without any sort of header how would you know what you are getting when you retrieve a block.

@kevina see my suggestions in ipfs/specs#111

whyrusleeping commented 8 years ago

lets get this done alongside the ipld stuff. With the new CIDv1 stuff we will be able to address completely raw blocks!

jbenet commented 8 years ago

@whyrusleeping note this: https://github.com/ipfs/specs/issues/130#issuecomment-241157747

Worth noting that it also depends on whether people want to be able to use those raw blocks as "files" of their own. If yes, then either unixfs must be taught how to do that from a raw block, or the file structure should still exist around the raw block. if not, we then have to generate two objects per block, just the raw data, and a very small object pointing to the raw data that makes it a file. (the important thing on this is being able to aggregate the indexing data structures of files wherever). It ultimately is a discussion trading off the advantages of raw blocks, vs the expressive advantages of "everything is a file", and how to tune those two.

(btw, this is separate from the "raw blocks should be smaller than ( + overhead)", which was an issue we discussed previously.

whyrusleeping commented 8 years ago

Raw blocks are now a thing, running ipfs add --raw-leaves <other args> will result in the use of raw blocks wherever possible. This means the size of a block on disk (for larger files) will now be 256k evenly, and not overlapping it by a few bytes.

Closing this issue now, please go ahead and reopen this or open a new issue if further discussion is needed.

whyrusleeping commented 8 years ago

(also worth noting that you can ipfs cat raw blocks directly)