ipfs / specs

Technical specifications for the IPFS protocol stack
https://specs.ipfs.tech
1.16k stars 231 forks source link

Abstractions in IPLD: high level and low level data representations #91

Open nicola opened 8 years ago

nicola commented 8 years ago

This issues follows on my original proposal #90 (see the full scheme https://github.com/nicola/interplanetary-paths) discussed this weekend with the team in New York and conversations with @stebalien, @jbenet, @diasdavid, @mildred (ping @dignifiedquire )

READ first: because there is so much text already, I wrote #important whenever it is the important part to read (of mine of course - feel free to use this in your posts)

Background

In this issue, I argue that there are two path schemes that we should offer to traverse IPLD objects:

We can abstract the different forms of data in different layers. For example, imagine we have file1.jpg in the folder dir.

Layer 4 (application)

The nice path that an application like unixfs should offer to their final user should allow to do the following /$hash/dir/file1.jpg.

Layer 3 (IPLD object path)

Let's assume that the unixfs application decides to structure their data in this way:

/$hash === {
  dir: {
    files: {
       file1.jpg: Link{@link: hash},
       ...
       file10000.jpg: Link{@link: hash}
    }
  }
}

Note: in the case of unixfs, we could aim at merging Layer 3 with Layer 4, but for the sake of the argument, I just made unixfs a bit more complex than it should.

Layer 2 (IPLD block path)

However, since the folder is very big, our chunker (this can be implemented in many ways, let's assume that The Nicola IPLD Chunker works this way) is going to split the IPLD object in multiple IPLD objects, that we are going to call IPLD blocks

/$hash === {
  dir: {
    files: {
      shard1: {
        file1: Link{@link: hash},
        ...,
      },
      shard2: {
        file5000: ...,
        ...,
      }
    }
  }
}

Summary of the 4 layers and their paths

In other words the two key layers for IPLD are the 2nd and the 3rd. The reason why they should have different path schemes is because they are both important to the final application developer (depending whether they are writing higher or lower level application).

The difference between the two is that one traverses the actual IPLD data blocks, while the other one abstracts that. From the previous example:

Also, the way the separator will traverse either layer may have different meaning, for example in Layer 3, maybe there is no need to have transparent links, while it can be important for Layer 2.

So for example

{
  file1: Link({@link: hash, permission: 0777})
}
hash === {
  name: "Nicola"
}

For simplicity call high-ipld high level, and low-ipld the low level (the low level is the current IPLD)

> low-ipld cat QmCCC...CCC/cat.jpg
{
  "data": "\u0008\u0002\u0012��\u0008����\u0000\u0010JFIF\u0000\u0001\u0001\u0001\u0000H\u0000H..."
}

> high-ipld cat QmCCC...CCC/cat.jpg
\u0008\u0002\u0012��\u0008����\u0000\u0010JFIF\u0000\u0001\u0001\u0001\u0000H\u0000H..."
> low-ipld cat --json QmCCC...CCC/doge.jpg
{
  "subfiles": [
    {
      "@link": "QmPHPs1P3JaWi53q5qqiNauPhiTqa3S1mbszcVPHKGNWRh"
    },
    {
      "@link": "QmPCuqUTNb21VDqtp5b8VsNzKEMtUsZCCVsEUBrjhERRSR"
    },
    {
      "@link": "QmS7zrNSHEt5GpcaKrwdbnv1nckBreUxWnLaV4qivjaNr3"
    }
  ]
}

> high-ipld cat QmCCC...CCC/doge.jpg
\u0008\u0002\u0012��\u0008����\u0000\u0010JFIF\u0000\u0001\u0001\u0001\u0000H\u0000H..."
> low-ipld cat --json QmCCC...CCC/blogpost
{
    "shards": [
        {
          "@link": "QmPHPs1P3JaWi53q5qqiNauPhiTqa3S1mbszcVPHKGNWRh"
        },
        {
          "@link": "QmPCuqUTNb21VDqtp5b8VsNzKEMtUsZCCVsEUBrjhERRSR"
        },
        {
          "@link": "QmS7zrNSHEt5GpcaKrwdbnv1nckBreUxWnLaV4qivjaNr3"
        }
      ]
}

> high-ipld cat QmCCC...CCC/blogpost
"This is a very long blogpost..."

Notes on implementation

Two paths options

At the beginning my perception of IPLD was that pathing would resolve the high level representation, so that if I have a JSON, I could just be able to traverse it /friends/0/name, however the current IPLD pathing may not allow that.

Also, blocks and objects are in reference to file system concepts, they are open for better naming

#important

mildred commented 8 years ago

Thank you for getting to this s it can become quite complex, especially when we don't know yet exactly what we are talking about.

I don't quite get the difference between layer 2 and layer 3 (IPLD block vs IPLD objects). Are you saying that a single logical IPLD objects could be composed of multiple physical IPLD blocks that would be composed together to form the logical object ?

I would aim for something simpler by merging the two, and make the application layer (4th layer) aware of the physical constraints of the underlying blocks and integrate the chunking into the application layer. I can see the value of having a chunking that comes for free for every application. But wouldn't it come with added complexity and inefficiencies?

As for the path solutions, I quite like the 1st solution with different prefixes for different usages. You could make an analogy with URI schemes. I would imagine solething like:

In my opinion, each application might want to have the full flexibility of being able to define their own path system, with a common structure of course. This makes much more sense. Also, some applications might not necessarily want to implement a filesystem hierarchy at all. it should not be mandatory (and we always have the IPLD paths to debug the data).

nicola commented 8 years ago

Thanks for your reply @mildred, this concept is very new in my mind and I found it very hard to explain.

Are you saying that a single logical IPLD objects could be composed of multiple physical IPLD blocks that would be composed together to form the logical object ?

Yes, exactly!

I would aim for something simpler by merging the two, and make the application layer (4th layer) aware of the physical constraints of the underlying blocks and integrate the chunking into the application layer.

So, by merging them (and by merging them I don't mean using two different path scheme in one), then when I add a json to ipld, it could become physically different than it originally was, so imagine seeing /files/shard1/file1 instead of /files/file1. In this sense, you can consider IPLD objects as an application on top of IPLD blocks (because this is what it is!)

Also, in the paragraph in which I talk about the differences you can see what I mean by these two layers actually being different. One needs paths to be transparent, one needs the urls to be simple (no // on layer 3)

(layer 2 and layer 3 that are identical for me)

They were to me, until I realized that these may need two different path schemes as soon as they diverge in their physical representation

see Note on different path schemes

Of course, don't get me wrong, I am not settled on this idea of separating the two concepts either. I would prefer them to be merged. The way I see the difference between the two, is that IPLD objects are a layer (or an application) on top of IPLD blocks. One gives me very nice path scheme that corresponds to the actual representation of the data (the high level one) and abstracts the fact that an object my be split in multiple objects.

mildred commented 8 years ago

In this sense, you can consider IPLD objects as an application on top of IPLD blocks (because this is what it is!)

Makes complete sense. Not much different than protocols running over TLS instead of TCP. In the end, applications will be able to decide over which layer they want to be implemented. Some might like the abstraction and not having to worry about little details while others might welcome the possibility to be closer to the wire format and limit the overhead (I suspect that unixfs will be of the latter).

nicola commented 8 years ago

@mildred if you want we can take this conversation over IRC/hangout (we need a way to explain this in a easier way - happy to chat)

The fact that I would like this not to be just yet another app that you can build on IPFS is because when I add some data, I expect to get this data with the same representation, in other words, I am describing a different higher level resolver than the one that we envision for IPLD. IPLD blocks are the low level thing on which everything runs on, IPFS resolves to files, IPLD objects resolve to structured data (if that makes sense).

What I am not convince about is to separate them /ipfs /ipld, since if we use the high level representation of IPLD, then they are the same

Stebalien commented 8 years ago

Unfortunately, I don't have time to talk much today but I'll try. The distinction between the object layer/block layer isn't the same as TLS versus TCP. Layer 3 is the data model layer and layer 2 is the data representation layer.

The purpose of having two distinct layers is to get application-specific efficiency/control (IPLD blocks) while still maintaining the logical structure of the data (IPLD objects). That is, split separates the model from the representation. One core idea that you may be missing is that any valid tree of IPLD blocks is a valid IPLD object. This means that UNIXFS can choose to structure its IPLD blocks in a way that makes sense for unix filesystems while ipfs tar can structure its blocks in a way that makes sense for archives and both resulting merkeldags can be interpreted as valid IPLD objects.

For example, a logical IPLD object might be:

{
    "a": "lots of data",
    "big dir": {
        "aa": "first",
        "ab": "second",
        "b": "third"
    }
}

where the actual IPLD blocks might be:

{
    "a": {
        @link: hash(blocks),
        "size": 24, // mandatory? data only?
    },
    "big dir": {
        "a": {
            @prefix: true, // interpret this as a prefix of a directory tree.
            @link: hash(a_dir),
            size: ??,
        },
        "b": "third", // inline file
    }
};

blocks = {
    @blocks: bytes, // Interpret this object as a single byte array.
    0: {
        @link: hash(block_first),
        size: 7,
    7: {
        @link: hash(block_second),
        size: 5
};

// These are, by themselves, valid objects.
block_first = "lots of";
block_second = " data";

// Again, a valid IPLD object.
a_dir = {
    "aa": "first",
    "ab": "second",
};
nicola commented 8 years ago

@Stebalien thanks for joining the conversation, this is really great

Layer 3 is the data model layer and layer 2 is the data representation layer.

I think this is a much better way to explain what I explained above (we could try to use IPLD data model vs IPLD data representation naming convention if blocks and objects don't work).

Your example gives an actual real use case in which the user may define the lower representation, but when they access via IPLD(data repr. scheme), they want to access the actual representation, without taking care of resolving it manually!

mildred commented 8 years ago

Ok, I understand, and perhaps we need to find a name for this layer 3 so we can start writing code for it.

I have a question here: will layer 3 allow links, or will links be resolved at layer 2 only? If links are in both layers, how to specify if a link is to be resolved in layer 2 or in layer 3?

Layer 2 only links would be fine for me.

Stebalien commented 8 years ago

Obviously, these are my opinions.

IPFS links are invisible in layer 3. However, if we want support for mutable links (IPNS, HTTP, etc.), those would have to be visible.

For now, I think it's reasonable to say that turning a layer 3 object into layer 2 blocks is the application's job. In the future, it might be worth it to add an API to IPFS that takes a layer 3 object and automatically chunks it up into layer 2 blocks (and tries to re-use existing layer 2 blocks) but this is a future optimization.

However, the current discussion assumes that we only allow metadata on links. If the purpose of metadata is to describe properties linked objects without having to modify them or to describe properties relevant only to the data representation (size of linked content, etc.), it makes sense to support metadata on links only. If the purpose of metadata is to describe relationships, it makes sense to support metadata on all relationships.

Personally, I'd like to support metadata on all relationships. That is:

{
    "joke": "Why did the chicken cross the road?",
    // Describes the relationship "obj1"
    "joke/": { // nicola doesn't like this syntax. I'm open to suggestions.
        "comment": "A funny file"
    },
    "sad story": {
        "@link": hash,
         "size": 99, // Not really metadata. This is a part of the link spec.
    },
    "sad story/": {
        "comment": "A sad file. Do not read."
    }
}

Incidentally, this allows metadata to be stored in it's own linked layer 2 block.

nicola commented 8 years ago

#important

I am not sure if I follow, the way I see this happening is the following. We have some descriptive notation (@jbenet has some hints on the direction we should take) to describe the way shards, for example, could happen. IPLD objects like web pages, should not need to resolve their links. Of course, if you want to resolve an entire object, there must be an API call that allows you to resolve all the links recursively.

The idea here is to keep the same structure, but expose the links. Let me give you some examples:

tl;dr

Without links, with shards
// original data
{
  friends: {
    nicola: {name: "Nicola"},
    ..
    zayan: {name: "Zayan"}
  }
}

// to ipld blocks
{
  friends: {
    @merge: [{@link: hash1}, {@link: hash2}, {@link: hash3}],
    metadata: "something here"
  }
}

// hash1 == { nicola: ..}
// hash2 == { ... }
// hash3 == { .. , zayan: ..}

// to ipld object
{
  friends: {
    nicola: {name: "Nicola"},
    ..
    zayan: {name: "Zayan"}
  }
}
/ipld-blocks/hash/ == { friends: { @merge: ..
/ipld-blocks/hash/friends == { @merge: ..
/ipld-blocks/hash/friends// == [@link, @link]
/ipld-blocks/hash/friends//0 == {@link:..
/ipld-blocks/hash/friends//0// == {nicola: {name: ..
/ipld-blocks/hash/friends//0//nicola == {name: "Nicola"
/ipld-blocks/hash/friends//0//nicola/name == "Nicola"
/ipld-blocks/hash/friends/metadata == "something here"
/ipld-blocks/hash/friends//metadata == undefined

/ipld-objects/hash/ == { friends: { @merge: ..
/ipld-object/hash/friends/ == { friends: { nicola: ..
/ipld-objects/hash/friends/nicola == {name: ..
/ipld-objects/hash/friends/nicola/name == "Nicola"
/ipld-objects/hash/friends.metadata == "something here" (maybe not needed)
With links, with shards
// original data
{
  friends: {
    nicola: {@link: "hash1", meta1: "data on this link!"},
    ..
    zayan: {@link: "hash2", meta2: "more about this!"}
  }
}

// to ipld blocks
{
  friends: {
    @merge: [hash1, hash2, hash3]
  }
}

// hash1 == { nicola: { @link: .. 
// hash2 == { ... }
// hash3 == { .. , zayan: {@link: ..

// to ipld object
{
  friends: {
    nicola: {@link: "hash1"},
    ..
    zayan: {@link: "hash2"}
  }
}

// to ipld object recursively explored
{
  friends: {
    nicola: {name: "Nicola"},
    ..
    zayan: {name: "Zayan"}
  }
}
/ipld-blocks/hash/ == { friends: { @merge: ..
/ipld-blocks/hash/friends == { @merge: ..
/ipld-blocks/hash/friends// == [@link, @link]
/ipld-blocks/hash/friends//0 == {@link:..
/ipld-blocks/hash/friends//0// == {nicola: {@link: ..
/ipld-blocks/hash/friends//0//nicola == {@link: ..
/ipld-blocks/hash/friends//0//nicola/meta1 == "data on this link!"
/ipld-blocks/hash/friends//0//nicola// == {name: ..
/ipld-blocks/hash/friends//0//nicola//meta1 == undefined
/ipld-blocks/hash/friends//0//nicola//name == "Nicola"

/ipld-objects/hash/ == { friends: { @merge: ..
/ipld-object/hash/friends/ == { nicola: {@link..
/ipld-objects/hash/friends/nicola == {name: ..
/ipld-objects/hash/friends/nicola.meta1 == "data on this link!" (maybe we dont want this)
/ipld-objects/hash/friends/nicola/name == "Nicola"

note: in this example in ipld-object some times I use the dot . to describe accessing link and merge data (in other words, it allows you to access IPLD blocks from an object path)

Stebalien commented 8 years ago

IPLD objects like web pages, should not need to resolve their links.

To me, layer 3 is really a convention or a way of thinking about the data on top of which we can build APIs. For example, to avoid recursively fetching objects, you could ask ipfs for, e.g., a "directory listing" of an object.

However, you bring up a good point. In general, there is a distinction between data included in an object, and linked data. The discussion so far (or, at least my interpretation of it) conflates the two.

Maybe it would be a good idea to distinguish between links and includes (I believe I mentioned this off-line but don't think we got very far in that discussion). That is, merges, blocks, includes, etc. (things that allow splitting large objects into smaller reusable chunks) would be layer 2 constructs that would disappear in layer 3 but true links could still remain. Links would demarcate clear boundaries between distinct objects that happen to be related in some way while an include would mean "this content is logically part of this object but is stored elsewhere".

However, what about distinct objects that should be stored together for efficiency purposes? That is, given the above we have a way of splitting large objects but no way to combine small objects. I'm imagining a deep directory tree such as:

{
  "a": {
    "b" : {
      "c": {
        "d": "stuff",
      }
    }
  }
}

Each "directory" here is a distinct logical object but splitting this up into multiple blocks is wasteful.

One way to solve this is through metadata (which could be stored as a tag in CBOR). That is, in layer 2, the above object would look like:

{
  "a": {
    "b" : {
      "c": {
        "d": "stuff",
        "d/": { "distinct": true }, // Could be stored a tag in CBOR for efficiency.
      }
      "c/": { "distinct": true },
    },
    "b/": { "distinct": true },
  }
  "a/": { "distinct": true },
}

In this way, a "link" would just be a "distinct" include.

Note: my terminology is probably confusing.

Stebalien commented 8 years ago

To make we're all on the same page, the purpose of the layer 2/layer 3 distinction as I see it is that logical object boundaries should not dictated by performance requirements. That is, separation of model and representation. Enumerated:

  1. Efficient lookup in large logical objects. It should be possible to lookup a child of a large logical object without having to download the entire logical object.
  2. Efficient streaming of large logical objects. We can download the logical object chunk by chunk and stream the peaces we are interested in instead of having to download it all at once.
  3. Efficient packing of tiny logical objects.

Did I miss anything?

mildred commented 8 years ago

However, what about distinct objects that should be stored together for efficiency purposes?

I would say: just don't make a link and include the data directly in place of the link. This is a very simple solution with no need to add another concept.

edit: by include I mean put the data there without any indirection.

And if you need to reference this partial object from another place, this is where you should use IPLD paths in IPLD links.

nicola commented 8 years ago

@mildred I am not sure if by include you mean @merge or @link. @merge will include the links, @link. Of course one should be given an option to resolve all the links as well (recursively). One of the issues that I see resolving all the links by default is that some links may have relative paths, so you will create loops :(

@Stebalien I agree with your 3 objectives

jbenet commented 8 years ago

Glad to see this discussion happening. have a lot to say, of course. sorry i haven't gotten to write. Some points for discussion:


I was reminded through a recent conversations, that these are really just functional datastructures, and that there likely is very good literature on implementing these efficiently. We can look into that for some time to get clarity on layer2 and layer3.


The @merge is a concept is an attractive one, but i would caution you both not to pursue it like that. BTW this is effectively what we do already with ipfs's pre-ipld objects (the merge operation for files is concatenating all the data segments of the file objects). Instead, think much more generally, as any kind computation. It's not a merge, it's a transformation. We go from one graph of objects to another, and in some cases, to other representations, like bytestreams.

Trivial computations (like byte-stream merge (for files), or object merge (for json sharding)) are fine to hardcode, but what would be really powerful is to make sure this model is extensible with proper computation. Think being able to implement CRDTs over this trivially easy.

Please both take a look at these prelimninary discussions/experiment:

This is an extremely powerful way to think about data. AND IT IS VERY HARD TO DO RIGHT. And it has been written about extensively. We (I) need to spend a good amount of time searching the literature to gain understanding of results important to this endeavor. Please for now only use "merge" as a tool for thought, or to express how layer3 may be created out of layer2, but do not assume @merge needs to or will be a key itself, instead we will likely reference executable functions directly.


We can repurpose the word blocks for layer 2, but I would caution against it and encourage us to find other words, as "blocks" historically are dumb byte sequences, not nice objects, and it already means that in ipfs:

block := the serialized representation of an ipfs (ipld) object

jbenet commented 8 years ago

@nginnever can you post the sketches @nicola and I put in your notebook here?

qxotk commented 8 years ago

I have for a long time used physical / logical to describe the 2 main concerns for data.

Physical: the domain of hardware and practical engineering constraints Logical: the domain of software and computational or dynamic constraints

Does anyone have specific concrete examples or precedents that would require IPFS to diverge into different nomenclature for these domains?

On Mon, Apr 18, 2016 at 2:19 AM, Juan Benet notifications@github.com wrote:

@nginnever https://github.com/nginnever can you post the sketches @nicola https://github.com/nicola and I put in your notebook here?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/ipfs/specs/issues/91#issuecomment-211223145

james mcfarland james@jamesmcfarland.com http://jamesmcfarland.com

nginnever commented 8 years ago

@jbenet

IPLD Sketches - https://gateway.ipfs.io/ipfs/QmUz898hhH2Z8X3c8Jd6V1DiJqhSqLNi5u45oNcZ2qWcFp

IPLD Pad Agenda - https://pad.riseup.net/p/EByKoWmYHrjz