ipld / specs

Content-addressed, authenticated, immutable data structures
Other
592 stars 108 forks source link

docs: deprecate daglink tsize property #234

Closed achingbrain closed 4 years ago

achingbrain commented 4 years ago

The Tsize property of dag links is unreliable and expensive to calculate.

It also does not reflect the size of the file that the dag represents so is of limited use to users.

We cannot remove it as it will break existing IPFS implemenations but the field is optional so we can deprecate it and encourage implementations to ignore it.

Follows on from https://github.com/ipfs/specs/issues/238

achingbrain commented 4 years ago

This will also improve ipfs.add performance in js-IPFS by resolving https://github.com/ipld/js-ipld-dag-pb/issues/154

warpfork commented 4 years ago

I agree, we should absolutely do this. New library code is already without this property. Therefore I think this is an ask of our docs and specs to be much more clear about this as much as anything else...

Stebalien commented 4 years ago

The Tsize property of dag links is unreliable and expensive to calculate.

It's unreliable because we never validated it. However, it's not expensive to calculate. It's there because it's not expensive to calculate assuming the linked nodes calculate it correctly (while it would be expensive to calculate the size of an entire tree).

rvagg commented 4 years ago

related #233

it will break existing IPFS implemenations

... but ...

encourage implementations to ignore it

so does that mean that we could force a full removal of it at some point in the future and existing IPFS implementations could do without it? or is it baked in to implementations in a way that's impossible to unwind? I'm wondering how useful a deprecation is, if we can't actually get rid of it then maybe it's pointless and we should push harder on unixfsv2 as a fix to this and other dag-pb concerns?

achingbrain commented 4 years ago

It's unreliable because we never validated it

The problem is that you can't validate it until you have the whole DAG, at which point you have the whole DAG so you didn't need to validate it because you can get the size directly from the DAG.

it's not expensive to calculate

The way the IPLD APIs have gone, you pass in a data object and get a CID back. You don't get the serialized size back, so the only way to know is to serialize it first, store the size, then pass it to IPLD which serializes it again and passes the resulting buffer to the block store. https://github.com/ipld/js-ipld-dag-pb/issues/154

This is our only semi-reliable source of progress information

Why would you not decode the UnixFS entry from the root node and get the file size from that?

I guess you could be requesting random chunks of DAG but I don't think that's something we do often? Certainly not with the ipfs.cat, ipfs.ls, ipfs.files.* type operations.

is it baked in to implementations in a way that's impossible to unwind

Sort of. The Tsize property is the last property of the PBLink message but it's optional, so if we remove it, it should result in the same bytes as if you didn't specify it. Old implementations won't crash, but default values in protobufs mean it'll be read as 0.

If, however, we later decide to add a new field to the PBLink message, old implementations will read that field as the Tsize and at that point all bets are off.

we should push harder on unixfsv2 as a fix to this and other dag-pb concerns

UnixFSv1 will still be around for a long time and with no arrival date for UnixFSv2 on the horizon, I'd rather fix up what we have if possible.

mikeal commented 4 years ago

This is our only semi-reliable source of progress information

Why would you not decode the UnixFS entry from the root node and get the file size from that?

This was the same question I had. If you have the file size, from the root, your progress is based on the bytes received of the underlying data, you don’t need sizes from the links. Is there a need that we have for progress information of every individual block relative to the expected block size?

Stebalien commented 4 years ago

I'm going to table this discussion. We need to keep generating these values as they are a part of dag-pb. Removing support would have unknown repercussions and is not worth the hassle.

The answer is not to try to fix dag-pb, the answer is to push on UnixFSv2 so we can move on. Unless there is a pressing reason to touch link sizes, this just creates more work.

achingbrain commented 4 years ago

The pressing reasons are elaborated above and on https://github.com/ipfs/specs/issues/238, indeed there seems to be a consensus here in favour so it's a shame this issue was closed.

We need to keep generating these values as they are a part of dag-pb.

Do we though? The only reason I've seen is for progress information and that is achievable in a way that's more useful for the user by using the file size from the unixfs metadata in the DAG root (e.g. when I'm downloading a file I care as much about including protocol buffer/UnixFS overhead in the reported size as I do TCP packet headers).

The schema allows for the Tsize field to be omitted from the serialized protocol buffer so it seems neatly backwards compatible.

The answer is not to try to fix dag-pb, the answer is to push on UnixFSv2 so we can move on

That's been the answer for rather a while but we're no closer to shipping anything and one of Juan's slides at lab week had UnixFSv2 on the 'maybe we don't concentrate on this in 2020' list, plus even if we ship v2 we're still going to have to support UnixFSv1 for the foreseeable future so we should really fix up what we have.

Stebalien commented 4 years ago

The pressing reasons are elaborated above and on ipfs/specs#238, indeed there seems to be a consensus here in favour so it's a shame this issue was closed.

None of those are pressing reasons. That is, we have no sudden/new motivation to remove this and keeping it isn't going to make our lives massively simpler.

What I don't want to do is to start yet another major refactor just because we think we could simplify something. We have enough on our plates at the moment. Auditing all of go/js-ipfs for uses of link sizes and replacing them with something else is not a small undertaking.

That's been the answer for rather a while but we're no closer to shipping anything and one of Juan's slides at lab week had UnixFSv2 on the 'maybe we don't concentrate on this in 2020' list, plus even if we ship v2 we're still going to have to support UnixFSv1 for the foreseeable future so we should really fix up what we have.

And we're not going to get any closer if we keep starting new refactors.

warpfork commented 4 years ago

I don't think anyone or any of the content in this diff is actually arguing for new refactors...?

The diff even explicitly says:

It is left in the protobuf definition in order to maintain backwards compatibility.

We do not have these size properties in links in the Data Model specifications that are the cornerstone of all work we've been doing on IPLD since 2018, and these link sizes were something that was very explicitly pointed out to me as "kill this with fire" by @whyrusleeping before I even got going on IPLD. It seems to me having this stated in more places throughout the docs and specs is strictly better and more honest.

And again, is not a refactor and is not adding new code work.

warpfork commented 4 years ago

It's possible that we could iterate on more clear ways to phrase our stance and future intentions around this property, but even if this PR is closed (and I'm not sure I agree with that), I remain in support of something being said about this. Our current docs make too many uncritical references to these size properties and it creates ongoing confusion that I tire of rectifying in one conversation after another without good reflections in the docs and specs.

Stebalien commented 4 years ago

It is left in the protobuf definition in order to maintain backwards compatibility.

That's not the issue. The issue is that we (at least in go) have a bunch of code that expects sizes to be reasonable. While that code shouldn't completely break if the sizes aren't there/defined, I'm not entirely sure what will happen. However:

Basically, this field has been there since the very beginning and has actually been pretty damn reliable.

Saying "don't rely on this but you MUST still generate it" is fine. That may ease the transition to UnixFSv2 where no such information will be available. However, the "MUST still generate it" is the critical part and, as far as I can tell, goes against the main motivation for this PR.

We do not have these size properties in links in the Data Model specifications that are the cornerstone of all work we've been doing on IPLD since 2018, and these link sizes were something that was very explicitly pointed out to me as "kill this with fire" by @whyrusleeping before I even got going on IPLD. It seems to me having this stated in more places throughout the docs and specs is strictly better and more honest.

This has nothing to do with the data model. Sizes aren't a property of CIDs, they're a property of DagPB Links. The model for DagPB is:

{
  "Data": ...,
  "Links": [
    {
      "Hash": ...,
      "Name": ...,
      "Size": ...,
     },
  ]
}

Note: The fact that go-ipld nodes return "Links" with sizes, regardless of the format, is a bug. We absolutely shouldn't do that and we should fix that.

warpfork commented 4 years ago

Alright, I'll agree that this diff didn't have enough explanation of what the deprecation would mean in practice. I do still want to remark that I appreciated the effort, though, because I think we can all agree something about these properties causes confusion in newcomers to the ecosystem on a fairly regular basis. And I think some remarks about the limitations of the link properties could be good to include in this document, even if this diff ain't it.

One constructive takeaway I'm making from this is we need much clearer language around Data Model standard semantics vs things a Codec does, and to put a lot more effort into using that language consistently. (We probably already knew that, but this is another scenario that underscores it even further.) It wouldn't fix everything, but it might help: some documentation which notes the DagPB-specialness of this property could be less controversial than "deprecation", perhaps, and also provide useful clarity about where it fits (and also doesn't!) in the bigger picture. That in turn could introduce a heading in the doc that's an appropriate place to talk about other drawbacks in a factual and useful manner. And it goes without saying I'd be happy to review more PRs which try to nail that balance. :)

mikeal commented 4 years ago

This was closed prematurely given our governance policy https://github.com/ipld/specs#governance as there does not appear to be a consensus either way on the issue. Re-opening.

I’m inclined to agree with @Stebalien on this one. There has to be some point at which we stop taking improvements to UnixFSv1 that change the hash of the encoded data.

I see and agree with the concerns regarding encoding information we can’t trust. It’s a security hazard that every implementation can trip over. But I don’t believe we’re encouraging additional implementations of UnixFSv1 anymore so I don’t see why we’d want to take another change to UnixFSv1 that alters the final file hashes.

If this is just informational (this property exists, you should encode it, but ignore it entirely on read) then let’s get the PR changed to this effect and I’m +1 on that change.

achingbrain commented 4 years ago

I think this is a question of pragmatism. There's no expected date for UnixFSv2 to arrive. In the meantime metadata in v1 is shipping imminently and we'll move to v1 CIDs in the near future, both of which will change CIDs (mostly, metadata is opt-in).

As currently coded both go-IPFS and js-IPFS will switch to raw leaf nodes in UnixFS with v1 CIDs which will change every CID in every UnixFS DAG. The change proposed here will only affect root and intermediate nodes, not to mention make the code simpler and the data more reliable.

In the absence of a better solution, it has to be ok to make backwards compatible changes.

Stebalien commented 4 years ago

I think this is a question of pragmatism. There's no expected date for UnixFSv2 to arrive. In the meantime metadata in v1 is shipping imminently and we'll move to v1 CIDs in the near future, both of which will change CIDs (mostly, metadata is opt-in).

We're switching to V1 by default? We're planning on outputting V1 base32 CIDs by default but still using V0 CIDs internally, unless I'm missing a discussion somewhere. That is, we'd convert the V0 CID to a V1 CID on the fly. That won't change the underlying hashes and won't break reduplication in any way.

As currently coded both go-IPFS and js-IPFS will switch to raw leaf nodes in UnixFS with v1 CIDs which will change every CID in every UnixFS DAG. The change proposed here will only affect root and intermediate nodes, not to mention make the code simpler and the data more reliable.

Go-ipfs, at least, uses raw leaves when --cid-version=1 is passed. It does not use raw leaves when you just upgrade V0 CIDs to V1 CIDs when using base32.

In the absence of a better solution, it has to be ok to make backwards compatible changes.

That's the issue. Saying "don't rely on this but we still require that it be generated" is backwards compatible. Deprecating it entirely is not. It may be marked as an optional field in the protobuf schema but it's defacto required because all tools expect it to be there.

achingbrain commented 4 years ago

Maybe we can solve this in a different way. If we refactor js-ipfs-unixfs-importer to not use js-ipld and do the serialization/blockstore interactions directly (something we've talked about elsewhere), we can avoid the double-serialization problem and so it becomes cheap to calculate Tsize. Sort of like it used to be pre-IPLD.

achingbrain commented 4 years ago

It doesn't help that Tsize is unreliable and should not be used but it at least means outdated IPFS clients & tools will behave in the same way.

mikeal commented 4 years ago

Maybe we can solve this in a different way. If we refactor js-ipfs-unixfs-importer to not use js-ipld and do the serialization/blockstore interactions directly (something we've talked about elsewhere), we can avoid the double-serialization problem and so it becomes cheap to calculate Tsize. Sort of like it used to be pre-IPLD.

If you use the new @ipld/block API there’s internal code to cache the serializations as well, so you can be assured you won’t suffer extra serializations. I’m using it here on top of this importer for that exact reason https://github.com/mikeal/unixfsv1-part-importer

warpfork commented 4 years ago

Sync meeting cliff notes:

warpfork commented 4 years ago

Also @rvagg mentioned this other issue earlier, but I'd like to re-mention it: https://github.com/ipld/specs/issues/233#issuecomment-581922110

If we write a bunch of docs about "interesting" (ahem) properties of this field, at least some of it is very likely to be general to "indicators" as a pattern.

ribasushi commented 4 years ago

@Stebalien I will necropost here a bit: higher up you said:

We're switching to V1 by default? We're planning on outputting V1 base32 CIDs by default but still using V0 CIDs internally, unless I'm missing a discussion somewhere. That is, we'd convert the V0 CID to a V1 CID on the fly. That won't change the underlying hashes and won't break reduplication in any way.

This does not seem to be the case at present:

admin:~/devel/go-ipfs$ dd if=/dev/zero bs=$((1024*1024)) count=64 2>/dev/null | cmd/ipfs/ipfs add -nq
QmfKyrN38oa8CrSYirAYZq4X4Hvq73JtVkgSgeVzHCkAuT
admin:~/devel/go-ipfs$ dd if=/dev/zero bs=$((1024*1024)) count=64 2>/dev/null | cmd/ipfs/ipfs add -nq --cid-version=1 --raw-leaves=0
bafybeifug6aqizmepe6fa4k7onpb5kv2jlp52hmeg7t4aoicwpws3o52yq
admin:~/devel/go-ipfs$ cmd/ipfs/ipfs cid base32 QmfKyrN38oa8CrSYirAYZq4X4Hvq73JtVkgSgeVzHCkAuT
bafybeih4nk64ikaynttdewuiryc4t3h466fhhqe74qbtqrfpcl3xivxbzy

I am also not observing any use of v0 when working on convergence with go-ipfs in dagger. Am I missing something?

Stebalien commented 4 years ago

ipfs --upgrade-cidv0-in-output add .... That will output the same hash as the third command.

ribasushi commented 4 years ago

Ahhhh.... Visible only under ipfs --help rather than ipfs add --help is why I missed it 🤦‍♂