Pinning new cbor object doesn't appear to work

ianopolous commented 7 years ago

Version information:

go-ipfs version: 0.4.5-dev-4cb236c Repo version: 4 System version: amd64/linux Golang version: go1.7.1

Type:

Bug

Priority:

P0

Description:

Pinning a new cbor object created using block.put doesn't appear to work. To reproduce:

>> echo -e "\x4b\x67\x27\x64\x61\x79\x20\x49\x50\x46\x53\x21" | ipfs block put --format=cbor
zdpuAue4NBRG6ZH5M7aJvvdjdNbFkwZZCooKWM1m2faRAodRe
>> echo -e "\xd9\x01\x02\x58\x25\xa5\x03\x22\x12\x20\x65\x96\x50\xfc\x34\x43\xc9\x16\x42\x80\x48\xef\xc5\xba\x45\x58\xdc\x86\x35\x94\x98\x0a\x59\xf5\xcb\x3c\x4d\x84\x86\x7e\x6d\x31" | ipfs block put --format=cbor
zdpuApNFmG7PZ53BWxwix4HztiVDHomrvdJLTegycZb8YU5Qr
>> ipfs pin add -r zdpuApNFmG7PZ53BWxwix4HztiVDHomrvdJLTegycZb8YU5Qr
>> ipfs repo gc
>> ipfs block get zdpuApNFmG7PZ53BWxwix4HztiVDHomrvdJLTegycZb8YU5Qr
>> ipfs block get zdpuAue4NBRG6ZH5M7aJvvdjdNbFkwZZCooKWM1m2faRAodRe

The gc should NOT remove the two blocks added (it currently removes both). And the subsequent gets should succeed. The first block is just a cbor byte array of 'gday IPFS!' The second is just a cbor merkle link to /ipfs/zdpuAue4N...

N.B. I may not have the correct serialization for the merkle link, but as far as I can tell it is correct (a cbor tag of 258 for the multiaddr)

ianopolous commented 7 years ago

@whyrusleeping I believe it is a multihash multiaddr (/ipfs/Qmjfnrjnf), rather than a cid. I can do that in ~8 hours time if that suits?

whyrusleeping commented 7 years ago

@ianopolous Yeah, I think we're expecting it to not have the /ipfs/ prefix on it. I could be wrong though.

ianopolous commented 7 years ago

@whyrusleeping it doesn't have the /ipfs in the byte[]. The byte[] is whatever the multiaddress serialization is returning

whyrusleeping commented 7 years ago

It shouldnt be a multiaddr, it should be a raw multihash or a cid. The spec is wrong there when it says multiaddr.

ianopolous commented 7 years ago

I was just going from the spec, if it isn't a multiaddr, but just a cid then that's also fine. That did puzzle me. (What about /ip4/127.0.0.1) :-)

kevina commented 7 years ago

Note: I unassigned myself from this issue and now turned off notifications.

ianopolous commented 7 years ago

@whyrusleeping Assuming my Cid serialization is correct, this should be able to replace the 2nd object, and point to the first (it is a tagged byte[] of the cid1.bytes()):

echo -e "\xd9\x01\x02\x58\x24\x01\x71\x12\x20\x88\xf4\xb9\xd6\x63\x68\x92\x2f\x57\x81\x9a\xce\xb4\x83\x0c\x26\xef\x15\x1a\x17\x2d\x5e\xb3\x48\x2b\x23\xb3\x36\x88\xfc\x74\xa5" | ipfs block put --format=cbor

jbenet commented 7 years ago

Should any valid CBOR object count?

Yes. We want to use any CBOR object, like "Foobar".

Should the link be a map-type with the tag?

My original thinking: Yes. (along with @mildred and @davidar)

Last week, @whyrusleeping brought to my attention that this was missed by the js and go ipld implementations. (@diasdavid: i think we tried telling you last week in person, too). It is an unfortunate omission, sorry you had to find it the hard way @ianopolous.

@diasdavid brought up also that this may create problems for people encoding data and that arguably we should be encoding with a map-type, the same way json does. In particular, it made me wonder whether round-tripping an ipld object through CBOR -> JSON -> CBOR (with non-ipld-specific serializers) might drop the special map-type, or how it would be handled. It may be that roundtripping CBOR -> JSON -> CBOR is already problematic for other reasons...

My thinking is that:

we should verify whether this special tag is an issue for round-tripping CBOR -> JSON -> CBOR
- and whether it is THE ONLY issue.
we should strive to follow the "special tag" approach, as it produces more compact link objects, which IPLD will be full of (think of the bytes)
we should strive for simplicity (what's simpler here? what's more understandable? is the "do what json does?" actually simpler? or is it just "simpler to those assuming json"?)

Why is multiaddr in https://github.com/ipld/specs/tree/master/ipld#serialised-cbor-with-tags ? is that right?

Yes, that's right. Multiaddr is indeed what is intended there. There was -- and still is -- the intention to allow valid IPFS and IPLD paths to be valid multiaddrs. the benefit of this is that anything that supports a multiaddr would natively support IPFS and IPLD paths.
We have not created the "IPFS Path" and "IPLD path" multiaddr support. We should do that.
If you guys think this is confusing, we can revisit the concerns and change things for now to be only IPFS and IPLD paths (note! not just a CID).

multiaddr is often touted as "self-describing network addresses", isn't an IPFS path NOT a network address? it doesn't point to a specific program/process endpoint...

No, it is indeed an address. it is not a process address, it is a content address for a particular dataum in a network -- the IPFS network. multiaddr can support content addressing -- why shouldn't it?

eg /ndn/parc.com/videos/WidgetA.mpg
eg /bitcoin/<bitcoin-block-hash> or /ipld/<cid-of-bitcoin-block-hash> -- we would use the second, but someone might have reasons to create the first
eg /bittorrent/<torrent-hash>
eg `/uri/

jbenet commented 7 years ago

updated the above to add:

If you guys think this is confusing, we can revisit the concerns and change things for now to be only IPFS and IPLD paths (note! not just a CID).

ianopolous commented 7 years ago

Am I correct in thinking that ipld will also restrict maps to having string keys, as opposed to a general cbor object as a key (which is allowed in cbor)? Then the implication is that any recursive sequence of map keys is a valid ipld path in an object (as long as the root object is a map also).

ianopolous commented 7 years ago

What would it mean if an ipld object had a merkle link with a multiaddr that was not content addressed, e.g. /ip4/8.8.8.8/tcp/8080 ? Or do you mean that all ipld paths are valid multi-addresses, but not the other way around?

daviddias commented 7 years ago

Catching up with this thread

CBOR serialization

With our options being:

a) Serializing {"/": ""} without tags would make it very simple, it is what is today and any CBOR decoder would be able to get it
b) Serializing {"/": ""} with a cbor-tag would make the serialized format smaller (by @whyrusleeping, exactly 5 bytes smaller), which can have a great impact in the network, but also means that now all the dag-cbor implementations need to check that case + we would want to register that with IANA.

Although saving 5 bytes per object is valuable, I'm more worried about having to force every cbor parser to understand our tag, things like ipfs dag get <cid> | cbor | json -someKey would not work out of the box without patching that cbor deserializer. For the sake of all the data pipelines that might be constructed, I'm leaning towards option a) for simplicity.

Multiaddrs and other mutable pointers

I was under the impression that we deferred the design of having something that looks like a mutable link to another layer in the IPLD stack. Right now, we have a restriction that an IPLD link should be a valid CID in the resolver. Adding mutable pointers resolution at the main IPLD resolved shouldn't be an issue.

On the argument of what is simple: It was discussed several times that a mutable pointer would have some kind of special key, there is not technical reason to, but it definitely makes it way more easy to glance through a graph and be able to pick right up what are mutable pointers and which of them aren't.

nicola commented 7 years ago

Conversation on mutable links here: https://github.com/ipld/specs/issues/9

nicola commented 7 years ago

Let me reframe this conversation so that we have a common vocabulary on this! There is an ipld object that can be represented via CBOR or JSON. However, without properly parsing the representations, they are simply CBOR/JSON objects.

In (a), the CBOR and JSON representation are identical, which makes it easy to convert across those.
In (b), they are different representations, but they are the same ipld object, once parsed. The issue with this case is that converting across formats needs a special knowledge of IPLD (that the tag transforms to a new level of key-attribute). Also, creating and modifying JSON or CBOR objects in their raw format without the IPLD library might be not simple anymore.

My opinion

Think of IPLD representations as something that are only touchable via an IPLD library: some future formats might be impossible to edit in their raw representation, or in some cases it would not make sense to have a / extra key/value. (I can give an example if you need it).
Parsers with knowledge?: I think that the fundamental question is

do we want anyone to properly parse raw cbor or only clients that know that that cbor should be read in the ipld way?
Solution (a) does not take advantage of the CBOR tagging system if someone is trying to read CBOR without an IPLD library, then they are probably not interested in the IPLD features, so it is fine if the tags disappear or are flattened out (as many libraries might do) with a generic CBOR converter.

Please, tell me if you see some concerns in this.

whyrusleeping commented 7 years ago

@nicola I agree, ipld objects really should be intended to be dealt with by ipld libraries. Dealing with the tag stuff isnt that much work, and any cbor tool will convert it to a 'tag' object in json (which will properly roundtrip). As an example, @ianopolous did all the tag stuff correctly reading the spec before i even knew about it. It was then pretty simple for me to switch my code over to using the tags. @diasdavid If you really feel strongly about this, i'd love to hear a case where using the tag bites the user in a way thats really hard to deal with.

whyrusleeping commented 7 years ago

However, i will ask that when we move forward with the tag approach, we select a value between 40 and 95 (the first unassigned range). using 258 consumes one extra byte per link and i'm all about saving bytes

whyrusleeping commented 7 years ago

Is anyone opposed to using a tag of 42? Its in the unallocated range here: https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml and should be fair game. If no opposition is voiced, i'll move forward with that tomorrow.

Also, we need to reach consensus on the multiaddr thing...

jbenet commented 7 years ago

On the multiaddr thing:

What would it mean if an ipld object had a merkle link with a multiaddr that was not content addressed, e.g. /ip4/8.8.8.8/tcp/8080 ? Or do you mean that all ipld paths are valid multi-addresses, but not the other way around?

The last part. i meant that "all ipld paths are valid multi-addresses, but not the other way around", and for now no mutable links. (this may relax in the future, but at this time, no mutable links).

we have a restriction that an IPLD link should be a valid CID in the resolver

Why? IPLD links should be able to be paths, like: /ipld/<cid>/a/b/c. not just a cid.

jbenet commented 7 years ago

Though, since it looks like @whyrusleeping and @diasdavid both assumed that links were only CIDs (not sure why ;P), i'm ok to relax the spec constraints there, and say that for now it's only a CID, and not a full path. if people want to do a full path and enforce it now, speak now! (i prefer paths)

ianopolous commented 7 years ago

@whyrusleeping Indeed, given a standard cbor library, it was easy to implement an ipld parser using the tags.

In my opinion, mutable (authenticated) links should be a layer up. Otherwise even a simple pinning operation could require arbitrary ipns lookups each with their own validity scheme, which would combine into something very hard to reason about. If the links are restricted to cids (or cid+path), then the resulting structures are very easy to understand and reason about.

I say this even though my use case would be a lot easier with ipns links in ipld (I have a hierarchy of write access controlled directories, each with their own ipns key). For me it would great to be able to pin the root and be done, but I can also handle this on the application level, and keep the ipld structure simple.

whyrusleeping commented 7 years ago

Alright, going to move forward with all this tomorrow. The cid's put in the links will be prefixed with a multibase code for binary (a single zero byte) for consistency. Doing this will allow us to easily upgrade to multiaddr paths later on (if buf[0] != 0 )

daviddias commented 7 years ago

@diasdavid If you really feel strongly about this, i'd love to hear a case where using the tag bites the user in a way thats really hard to deal with.

I don't feel strongly about it, I just raised the concern for the question made "How can we make this simple". We are going to document a lot anyway, it is just more a bunch of notes saying "use these libraries".

we have a restriction that an IPLD link should be a valid CID in the resolver Why? IPLD links should be able to be paths, like: /ipld//a/b/c. not just a cid.

The resolver knows how to resolve paths, but I haven't made tests for links to be paths (missed that one). Nothing that can't be added :)

whyrusleeping commented 7 years ago

go-ipfs will not support paths in links for a long time. please don't implement that, it won't interop.

jbenet commented 7 years ago

@ianopolous did you implement path support, or just cid support?

dignifiedquire commented 7 years ago

if people want to do a full path and enforce it now, speak now! (i prefer paths)

Please do paths, even if go doesn't support them currently it should at least be able to ignore them such that other implementations can use paths, they are one of the big features of IPLD.

ianopolous commented 7 years ago

@jbenet Originally I used general multi-addresses, then changed it to cids, but it's easy to change back. It's just a parser.

whyrusleeping commented 7 years ago

The spec for multiaddr ipld paths doesnt even exist yet. When we get to the point where we know what we're wanting to implement there then maybe we can think about implementing it

whyrusleeping commented 7 years ago

Alright, heres my PR to the CBOR code: https://github.com/ipfs/go-ipld-cbor/pull/5

I changed the cbor tag to 42, and changed links to contain a binary-multibase (the byte 0) prefixes binary CIDs. Please review.

ianopolous commented 7 years ago

It's great to see the answer to the ultimate question of life, the universe and everything making it into the ipld spec. :-)

whyrusleeping commented 7 years ago

@ianopolous I've merged the changes into master, and i've tested pinning a few different cbor objects, as well as added a test for it. Mind testing it independently and reporting back?

I'm going to optimistically close this issue (open it again if i'm wrong)

ianopolous commented 7 years ago

Hmm, with 0.4.5-pre2-416f025 it hangs if I try and pin the first object in this issue (the cbor byte[]) through the cli, e.g.:

echo -e "\x4b\x67\x27\x64\x61\x79\x20\x49\x50\x46\x53\x21" | ipfs block put --format=cbor
ipfs pin add zdpuAue4NBRG6ZH5M7aJvvdjdNbFkwZZCooKWM1m2faRAodRe

ianopolous commented 7 years ago

@whyrusleeping I also don't seem to have the permissions to reopen the issue.

whyrusleeping commented 7 years ago

@ianopolous hrm... doing that gets me 'merkledag not found'. Is that object a link?

whyrusleeping commented 7 years ago

For some reason ipfs thinks that object has a link to: zdpuAsFzbmVjkDjySRJNAw2ndtAogPUVqCSZgP4MhxVRxVcE4... Unclear why. investigating.

ianopolous commented 7 years ago

It's just a cbor byte array. No links. The byte array contents are the UTF8 bytes of "g'day IPFS!"

whyrusleeping commented 7 years ago

huh, it doesnt appear the cbor code likes roundtripping that byte array...

whyrusleeping commented 7 years ago

@ianopolous lol, i found the issue. Using echo adds an extra newline to the end of the input, which when using 'block put', gets added as part of the block. But when the cbor library parses this block, it ignores that newline since its not part of the actual format, resulting in a different hash.

If you use printf instead of echo there, you should get the expected behaviour.

I'm not sure if i'm calling this a bug yet...

ianopolous commented 7 years ago

@whyrusleeping Sorry, my bad. I can confirm that pinning a cbor byte[] works now. As well as pinning a merkle link to said byte[]. Cheers! Now to figure out why the http api is giving me multihashes not cids.

ipfs / kubo