Closed achingbrain closed 4 years ago
This will also improve ipfs.add
performance in js-IPFS
by resolving https://github.com/ipld/js-ipld-dag-pb/issues/154
I agree, we should absolutely do this. New library code is already without this property. Therefore I think this is an ask of our docs and specs to be much more clear about this as much as anything else...
The Tsize property of dag links is unreliable and expensive to calculate.
It's unreliable because we never validated it. However, it's not expensive to calculate. It's there because it's not expensive to calculate assuming the linked nodes calculate it correctly (while it would be expensive to calculate the size of an entire tree).
related #233
it will break existing IPFS implemenations
... but ...
encourage implementations to ignore it
so does that mean that we could force a full removal of it at some point in the future and existing IPFS implementations could do without it? or is it baked in to implementations in a way that's impossible to unwind? I'm wondering how useful a deprecation is, if we can't actually get rid of it then maybe it's pointless and we should push harder on unixfsv2 as a fix to this and other dag-pb concerns?
It's unreliable because we never validated it
The problem is that you can't validate it until you have the whole DAG, at which point you have the whole DAG so you didn't need to validate it because you can get the size directly from the DAG.
it's not expensive to calculate
The way the IPLD APIs have gone, you pass in a data object and get a CID back. You don't get the serialized size back, so the only way to know is to serialize it first, store the size, then pass it to IPLD which serializes it again and passes the resulting buffer to the block store. https://github.com/ipld/js-ipld-dag-pb/issues/154
This is our only semi-reliable source of progress information
Why would you not decode the UnixFS entry from the root node and get the file size from that?
I guess you could be requesting random chunks of DAG but I don't think that's something we do often? Certainly not with the ipfs.cat
, ipfs.ls
, ipfs.files.*
type operations.
is it baked in to implementations in a way that's impossible to unwind
Sort of. The Tsize
property is the last property of the PBLink
message but it's optional, so if we remove it, it should result in the same bytes as if you didn't specify it. Old implementations won't crash, but default values in protobufs mean it'll be read as 0
.
If, however, we later decide to add a new field to the PBLink
message, old implementations will read that field as the Tsize
and at that point all bets are off.
we should push harder on unixfsv2 as a fix to this and other dag-pb concerns
UnixFSv1 will still be around for a long time and with no arrival date for UnixFSv2 on the horizon, I'd rather fix up what we have if possible.
This is our only semi-reliable source of progress information
Why would you not decode the UnixFS entry from the root node and get the file size from that?
This was the same question I had. If you have the file size, from the root, your progress is based on the bytes received of the underlying data, you don’t need sizes from the links. Is there a need that we have for progress information of every individual block relative to the expected block size?
I'm going to table this discussion. We need to keep generating these values as they are a part of dag-pb. Removing support would have unknown repercussions and is not worth the hassle.
The answer is not to try to fix dag-pb, the answer is to push on UnixFSv2 so we can move on. Unless there is a pressing reason to touch link sizes, this just creates more work.
The pressing reasons are elaborated above and on https://github.com/ipfs/specs/issues/238, indeed there seems to be a consensus here in favour so it's a shame this issue was closed.
We need to keep generating these values as they are a part of dag-pb.
Do we though? The only reason I've seen is for progress information and that is achievable in a way that's more useful for the user by using the file size from the unixfs metadata in the DAG root (e.g. when I'm downloading a file I care as much about including protocol buffer/UnixFS overhead in the reported size as I do TCP packet headers).
The schema allows for the Tsize field to be omitted from the serialized protocol buffer so it seems neatly backwards compatible.
The answer is not to try to fix dag-pb, the answer is to push on UnixFSv2 so we can move on
That's been the answer for rather a while but we're no closer to shipping anything and one of Juan's slides at lab week had UnixFSv2 on the 'maybe we don't concentrate on this in 2020' list, plus even if we ship v2 we're still going to have to support UnixFSv1 for the foreseeable future so we should really fix up what we have.
The pressing reasons are elaborated above and on ipfs/specs#238, indeed there seems to be a consensus here in favour so it's a shame this issue was closed.
None of those are pressing reasons. That is, we have no sudden/new motivation to remove this and keeping it isn't going to make our lives massively simpler.
What I don't want to do is to start yet another major refactor just because we think we could simplify something. We have enough on our plates at the moment. Auditing all of go/js-ipfs for uses of link sizes and replacing them with something else is not a small undertaking.
That's been the answer for rather a while but we're no closer to shipping anything and one of Juan's slides at lab week had UnixFSv2 on the 'maybe we don't concentrate on this in 2020' list, plus even if we ship v2 we're still going to have to support UnixFSv1 for the foreseeable future so we should really fix up what we have.
And we're not going to get any closer if we keep starting new refactors.
I don't think anyone or any of the content in this diff is actually arguing for new refactors...?
The diff even explicitly says:
It is left in the protobuf definition in order to maintain backwards compatibility.
We do not have these size properties in links in the Data Model specifications that are the cornerstone of all work we've been doing on IPLD since 2018, and these link sizes were something that was very explicitly pointed out to me as "kill this with fire" by @whyrusleeping before I even got going on IPLD. It seems to me having this stated in more places throughout the docs and specs is strictly better and more honest.
And again, is not a refactor and is not adding new code work.
It's possible that we could iterate on more clear ways to phrase our stance and future intentions around this property, but even if this PR is closed (and I'm not sure I agree with that), I remain in support of something being said about this. Our current docs make too many uncritical references to these size properties and it creates ongoing confusion that I tire of rectifying in one conversation after another without good reflections in the docs and specs.
It is left in the protobuf definition in order to maintain backwards compatibility.
That's not the issue. The issue is that we (at least in go) have a bunch of code that expects sizes to be reasonable. While that code shouldn't completely break if the sizes aren't there/defined, I'm not entirely sure what will happen. However:
Basically, this field has been there since the very beginning and has actually been pretty damn reliable.
Saying "don't rely on this but you MUST still generate it" is fine. That may ease the transition to UnixFSv2 where no such information will be available. However, the "MUST still generate it" is the critical part and, as far as I can tell, goes against the main motivation for this PR.
We do not have these size properties in links in the Data Model specifications that are the cornerstone of all work we've been doing on IPLD since 2018, and these link sizes were something that was very explicitly pointed out to me as "kill this with fire" by @whyrusleeping before I even got going on IPLD. It seems to me having this stated in more places throughout the docs and specs is strictly better and more honest.
This has nothing to do with the data model. Sizes aren't a property of CIDs, they're a property of DagPB Links. The model for DagPB is:
{
"Data": ...,
"Links": [
{
"Hash": ...,
"Name": ...,
"Size": ...,
},
]
}
Note: The fact that go-ipld nodes return "Links" with sizes, regardless of the format, is a bug. We absolutely shouldn't do that and we should fix that.
Alright, I'll agree that this diff didn't have enough explanation of what the deprecation would mean in practice. I do still want to remark that I appreciated the effort, though, because I think we can all agree something about these properties causes confusion in newcomers to the ecosystem on a fairly regular basis. And I think some remarks about the limitations of the link properties could be good to include in this document, even if this diff ain't it.
One constructive takeaway I'm making from this is we need much clearer language around Data Model standard semantics vs things a Codec does, and to put a lot more effort into using that language consistently. (We probably already knew that, but this is another scenario that underscores it even further.) It wouldn't fix everything, but it might help: some documentation which notes the DagPB-specialness of this property could be less controversial than "deprecation", perhaps, and also provide useful clarity about where it fits (and also doesn't!) in the bigger picture. That in turn could introduce a heading in the doc that's an appropriate place to talk about other drawbacks in a factual and useful manner. And it goes without saying I'd be happy to review more PRs which try to nail that balance. :)
This was closed prematurely given our governance policy https://github.com/ipld/specs#governance as there does not appear to be a consensus either way on the issue. Re-opening.
I’m inclined to agree with @Stebalien on this one. There has to be some point at which we stop taking improvements to UnixFSv1 that change the hash of the encoded data.
I see and agree with the concerns regarding encoding information we can’t trust. It’s a security hazard that every implementation can trip over. But I don’t believe we’re encouraging additional implementations of UnixFSv1 anymore so I don’t see why we’d want to take another change to UnixFSv1 that alters the final file hashes.
If this is just informational (this property exists, you should encode it, but ignore it entirely on read) then let’s get the PR changed to this effect and I’m +1 on that change.
I think this is a question of pragmatism. There's no expected date for UnixFSv2 to arrive. In the meantime metadata in v1 is shipping imminently and we'll move to v1 CIDs in the near future, both of which will change CIDs (mostly, metadata is opt-in).
As currently coded both go-IPFS
and js-IPFS
will switch to raw leaf nodes in UnixFS with v1 CIDs which will change every CID in every UnixFS DAG. The change proposed here will only affect root and intermediate nodes, not to mention make the code simpler and the data more reliable.
In the absence of a better solution, it has to be ok to make backwards compatible changes.
I think this is a question of pragmatism. There's no expected date for UnixFSv2 to arrive. In the meantime metadata in v1 is shipping imminently and we'll move to v1 CIDs in the near future, both of which will change CIDs (mostly, metadata is opt-in).
We're switching to V1 by default? We're planning on outputting V1 base32 CIDs by default but still using V0 CIDs internally, unless I'm missing a discussion somewhere. That is, we'd convert the V0 CID to a V1 CID on the fly. That won't change the underlying hashes and won't break reduplication in any way.
As currently coded both
go-IPFS
andjs-IPFS
will switch to raw leaf nodes in UnixFS with v1 CIDs which will change every CID in every UnixFS DAG. The change proposed here will only affect root and intermediate nodes, not to mention make the code simpler and the data more reliable.
Go-ipfs, at least, uses raw leaves when --cid-version=1
is passed. It does not use raw leaves when you just upgrade V0 CIDs to V1 CIDs when using base32.
In the absence of a better solution, it has to be ok to make backwards compatible changes.
That's the issue. Saying "don't rely on this but we still require that it be generated" is backwards compatible. Deprecating it entirely is not. It may be marked as an optional field in the protobuf schema but it's defacto required because all tools expect it to be there.
Maybe we can solve this in a different way. If we refactor js-ipfs-unixfs-importer
to not use js-ipld
and do the serialization/blockstore interactions directly (something we've talked about elsewhere), we can avoid the double-serialization problem and so it becomes cheap to calculate Tsize
. Sort of like it used to be pre-IPLD.
It doesn't help that Tsize
is unreliable and should not be used but it at least means outdated IPFS clients & tools will behave in the same way.
Maybe we can solve this in a different way. If we refactor js-ipfs-unixfs-importer to not use js-ipld and do the serialization/blockstore interactions directly (something we've talked about elsewhere), we can avoid the double-serialization problem and so it becomes cheap to calculate Tsize. Sort of like it used to be pre-IPLD.
If you use the new @ipld/block
API there’s internal code to cache the serializations as well, so you can be assured you won’t suffer extra serializations. I’m using it here on top of this importer for that exact reason https://github.com/mikeal/unixfsv1-part-importer
broad agreement to document these issues and warnings around the reliability of this number
we should keep generating the value as we do now, until any more concrete plan be formed
do start making new code that just doesn't regard this value when reading it.
tangent: checking up on where tsize gets used now:
addntl discussion around js apis (I didn't follow closely, someone else might want to make remarks on this).
it may be worth documenting these issues not just in the specs, but in more user facing docs as well (we'd like to decrease the odds of folks throughout the community making any new stuff that relies on this not-very-reliable field).
some conversation about longer term migration strategies between unixfsv1 and unixfsv2
Also @rvagg mentioned this other issue earlier, but I'd like to re-mention it: https://github.com/ipld/specs/issues/233#issuecomment-581922110
If we write a bunch of docs about "interesting" (ahem) properties of this field, at least some of it is very likely to be general to "indicators" as a pattern.
@Stebalien I will necropost here a bit: higher up you said:
We're switching to V1 by default? We're planning on outputting V1 base32 CIDs by default but still using V0 CIDs internally, unless I'm missing a discussion somewhere. That is, we'd convert the V0 CID to a V1 CID on the fly. That won't change the underlying hashes and won't break reduplication in any way.
This does not seem to be the case at present:
admin:~/devel/go-ipfs$ dd if=/dev/zero bs=$((1024*1024)) count=64 2>/dev/null | cmd/ipfs/ipfs add -nq
QmfKyrN38oa8CrSYirAYZq4X4Hvq73JtVkgSgeVzHCkAuT
admin:~/devel/go-ipfs$ dd if=/dev/zero bs=$((1024*1024)) count=64 2>/dev/null | cmd/ipfs/ipfs add -nq --cid-version=1 --raw-leaves=0
bafybeifug6aqizmepe6fa4k7onpb5kv2jlp52hmeg7t4aoicwpws3o52yq
admin:~/devel/go-ipfs$ cmd/ipfs/ipfs cid base32 QmfKyrN38oa8CrSYirAYZq4X4Hvq73JtVkgSgeVzHCkAuT
bafybeih4nk64ikaynttdewuiryc4t3h466fhhqe74qbtqrfpcl3xivxbzy
I am also not observing any use of v0 when working on convergence with go-ipfs in dagger. Am I missing something?
ipfs --upgrade-cidv0-in-output add ...
. That will output the same hash as the third command.
Ahhhh....
Visible only under ipfs --help
rather than ipfs add --help
is why I missed it 🤦♂
The
Tsize
property of dag links is unreliable and expensive to calculate.It also does not reflect the size of the file that the dag represents so is of limited use to users.
We cannot remove it as it will break existing IPFS implemenations but the field is optional so we can deprecate it and encourage implementations to ignore it.
Follows on from https://github.com/ipfs/specs/issues/238