gochain / gofs

GoFS - GoChain FileSystem tools and documentation.
https://gofs.io
11 stars 3 forks source link

Recursive objects #5

Open treeder opened 5 years ago

treeder commented 5 years ago

I think we should just handle recursive objects automatically for pinning and things. If the files behind the links aren't there, then the objects with the links are useless.

So when someone pins a file with payment to bump the expiry date, it should apply it to all the files. They could choose to only pin a particular file from the set if they'd like, but if the object they pin has "Links", it should update their expiries too. The cost should be based on the cost of all files and luckily the link objects have the size data on there, so we don't need to download the files until payment has been made.

ipfs object get QmZ41tFbwURWWnjg5FSibe2DWdRZm4L3YAgvu812NJAZvW
{"Links":[{"Name":"Goodbye.abi","Hash":"Qme4KFmv8xMqTmGsLfigCh83tVwzPevASfdVqgjoEncrDb","Size":663},{"Name":"Goodbye.bin","Hash":"QmWjzUYpWNH4YWwbCBgnu9Vb1UETU4rjfuaF8zvqhfjmkr","Size":2221},{"Name":"Hello.abi","Hash":"Qme4KFmv8xMqTmGsLfigCh83tVwzPevASfdVqgjoEncrDb","Size":663},{"Name":"Hello.bin","Hash":"QmPPVVzTtspqNUBPhR2oBwuoBKo7u7fkqUsPUmG1VWdPFx","Size":2221},{"Name":"goodbye.sol","Hash":"Qmcfrv6381EsgivKRE9SPdbb9cjUbRAMiy7tdJFRa46Cg4","Size":362},{"Name":"hello.sol","Hash":"Qmdh2XjJPcP9NMoHDgzjdz6gcFGD53CUEB6ec6Hnqw2JLa","Size":359}],"Data":"\u0008\u0001"}

That example doesn't have any further recursion, but we could follow throughout the tree, adding up all the sizes and use that as the size for the cost/expiry calculcation. If user pays, then we go through and "add" the specific time to the expiry (not explicitly set the time), since someone may have paid more for a particular file in this set.

jmank88 commented 5 years ago

IMO this should be an explicit flag. Pinning a directory only could make sense in some cases, e.g. if the contents are already pinned. Even when recursion is explicitly requested, some contents already being pinned complicates things. Do we still split the payment to advance all the expirations the same length? Or would it make more sense to advance some expirations more or less than others? e.g. if one large file in the set is already pinned for a year, and the user pays for a month, then they may not want to put any funds towards the large pre-pinned file at all, and only want to pay the minimum required to ensure each file is pinned at least through one month from now (this is one reason why I think it is best to keep this complexity on the client side, so the on-chain contract stays simple and straightforward, with a single CID per tx). Maybe it should be possible (for regular and recursive adds) to specify the date you want to advance the expiration to, rather than the raw duration to add. That may be more intuitive for the users, and then the CLI can sort out how many GBHs to purchase per file.

and luckily the link objects have the size data on there, so we don't need to download the files until payment has been made.

From my testing this did not actually hold true. For example, I believe that --nocopy <url> files were reporting their stored size (just the url itself), not the content size, so I had to rework the server to fetch the data first. I could be wrong, but regardless there is also a potential attack vector here since we would be depending on anonymous nodes to be telling the truth about file sizes.

treeder commented 5 years ago

With regards to being explicit about recursion, if I pin a directory object like the one I posted, I would expect my entire directory to be stored equally. It's 100% useless to pin an object like I posted above without the files it links to. How to split the funds, I would think the most obvious and probably the only expected way would be to advance the expiry equally. For instance, I just upload my new website or app, or whatever, then I want to pay for it to be stored for the next month, I pin that root hash with the payment. I can't see any other way someone would expect that to work.

the on-chain contract stays simple and straightforward, with a single CID per tx

I don't think the contract has to change any, I'd still just give it a hash, but if that hash is an object with links in it, then we recurse through it. Not unlike ipfs does when dealing with an object like that, for instance ipfs get QmZ41tFbwURWWnjg5FSibe2DWdRZm4L3YAgvu812NJAZvW will get all the objects recursively.

Maybe it should be possible (for regular and recursive adds) to specify the date you want to advance the expiration to, rather than the raw duration to add. That may be more intuitive for the users, and then the CLI can sort out how many GBHs to purchase per file.

I was kind of expecting the cost command to do something like this: Give it a hash and tell me how much it will cost for a month or a year, then I'd know how much to send to the contract. So I don't have to figure it out myself.

Regarding attack vector, I'm sure that's an acceptable risk, unless we believe someone would go through the trouble to gain control of the IPFS network to mess with file sizes, just to trick us. And if they went through that effort, they deserve some free storage for all the effort they put into that attack.

jmank88 commented 5 years ago

It's 100% useless to pin an object like I posted above without the files it links to.

Why? I can think of plenty of use cases for pinning a list of references to other files that a user doesn't want to pay to pin. They could e.g. be publishing an index of other content that is too large for them to want to pay to pin, or that is already pinned. Maybe having the recursive option default to true makes sense, but I don't think it makes sense to choose to be inflexible and diverge from the existing IPFS model by making it implicit and fixed.

I don't think the contract has to change any

The interface may not have to change in this scenario, but the data's meaning fundamentally changes, and e.g. you lose the ability to filter logs for a CID which was implicitly recursively pinned.

Give it a hash and tell me how much it will cost for a month or a year, then I'd know how much to send to the contract.

That is what it does currently. I am proposing that we add the option to instead specify a future date, and have it calculate the difference internally, based on the current status. This is pretty trivial for single files, but would be very powerful for recursive calls.

jmank88 commented 5 years ago

Regarding attack vector, I'm sure that's an acceptable risk, unless we believe someone would go through the trouble to gain control of the IPFS network to mess with file sizes, just to trick us. And if they went through that effort, they deserve some free storage for all the effort they put into that attack.

It would actually be pretty easy to do. We'd know right away, but it would be messy to correct the database. But if we can't get the correct size in the first place, then it doesn't matter (I'll try to figure this out). If we can get the correct size, then I still think it's cleaner and clearer to do the recursion client side, and issue contract calls for each file, and then there is no need to trust anyone anyways.

treeder commented 5 years ago

Wouldn't that be incredibly slow and expensive in gas fees?

treeder commented 5 years ago

And how would the contract get all the information it would need?

jmank88 commented 5 years ago

Wouldn't that be incredibly slow and expensive in gas fees?

It would be one tx per CID, but they are very cheap txs in the first place (no state involved, <30k gas on test contract: example), and the user would have explicit receipts for each CID and how much storage corresponds to each one. Speed wouldn't be a factor until dirs large enough to fill whole blocks.

And how would the contract get all the information it would need?

It doesn't need any information. It would operate exactly as it does now, as would the backend process. The CLI would just send multiple txs. The tricky part is how the CLI recurses, since it needs to talk to an IPFS endpoint. That could be a local daemon, our cluster, a public gateway, a custom service in front of our cluster (<- seems simplest/safest), etc.