Closed hugomrdias closed 4 years ago
Does it know how not to choke itself? (i.e. trying to batch 10GB at the same time?)
This is great but I can't help but think it doesn't belong at this layer in the stack, better in ipld
or the blockstore itself?
Does it know how not to choke itself? (i.e. trying to batch 10GB at the same time?)
it should never batch large amounts of data unless the user changes the chunker, even then it would be the same as without batching. the default interval in this PR is 50ms so it will accumulate small blocks (according to the chunker) for 50ms and batch put in one single transaction. also the timer for batching adjusts itself to not choke even if a single transaction takes more than the interval.
This is great but I can't help but think it doesn't belong at this layer in the stack, better in
ipld
or the blockstore itself?
i know what you mean and agree, i tried to make it like take but with the current code this is the only place we can do it without adding complexity.
if we extract blockstore to unixfs it would be much easier.
if we extract blockstore to unixfs it would be much easier.
After a quick look it seems that js-ipld is only used for get()
and put()
. Those two operations could easily be done without js-ipld directly on the BlockService. I would actually prefer that, as I don't like the js-ipld changes that much, as it leaks abstractions. To quote from the js-ipld README:
The IPLD API works strictly with CIDs and deserialized IPLD Nodes. Interacting with the binary data happens on lower levels.
Adding putBatch()
to js-ipld would break that abstraction, as we would directly put binary data.
Also I don't see much benefit in using js-ipld for unixfs. One of the points of js-ipld is, that it is codec agnostic and can work with any provided one and would do the right thing. In unixfs we exactly know which codec it is, and it is always only DagPB. So I would be in favour of using the BlockService directly.
I think it's important to have boundaries and defined responsibilities in a system. If we have everything depend on everything else it becomes a mess.
I would like this module to be responsible for creating dags from streams of buffers, I would like other modules to be responsible for serializing and persisting nodes and they are then free to apply whatever storage optimisations they see fit, including batching writes.
I would like this module to be responsible for creating dags from streams of buffers, I would like other modules to be responsible for serializing and persisting nodes and they are then free to apply whatever storage optimisations they see fit, including batching writes.
i understand and agree but its called unixfs
, its hard to have boundaries with the current situation.
can we decouple ipld from blockservice, and in unixfs serialize manually without ipld and use the blockservice directly to persist ? ill port this to the blockservice. What you think ?
https://github.com/ipfs/js-ipfs-unixfs-importer/pull/38#issuecomment-540638015
After a quick look it seems that js-ipld is only used for get() and put(). Those two operations could easily be done without js-ipld directly on the BlockService.
@vmx IPLD should be making things like Unixfs importers & exporters more easy to do, not say to not use it.
There are multiple operations here that Unixfs has to do manually (fetching node by node, checking which nodes it points to) that can be done by js-ipld, it's knowledge of the graph and its understanding of GraphSync.
can we decouple ipld from blockservice,
That won't be easy. That's planned for the next generation js-ipld, which won't be a thing soon. Is that actually needed? Why not just passing the BlockService into unixfs instead of js-ipld?
can we decouple ipld from blockservice,
That won't be easy. That's planned for the next generation js-ipld, which won't be a thing soon. Is that actually needed? Why not just passing the BlockService into unixfs instead of js-ipld?
thats what i meant ^^
@vmx IPLD should be making things like Unixfs importers & exporters more easy to do, not say to not use it.
I would agree for > UnixFSv1. But as UnixFSv1 is now opimized for perf, it just doesn't make sense to use js-ipld, as DagPB isn't even a format that supports the full IPLD Data Model.
There are multiple operations here that Unixfs has to do manually (fetching node by node, checking which nodes it points to) that can be done by js-ipld, it's knowledge of the graph and its understanding of GraphSync.
Those things used to be considered part of IPLD, but that's not the case anymore. IPLD has a different (smaller?) scope these days. Those things would now be on a different layer (which then IPLD will probably use).
The benefit of a performance improvement of this magnitude will impact adoption in significant ways. Thanks for digging into it!
@vmx and @achingbrain should i go forward with this https://github.com/ipfs/js-ipfs-unixfs-importer/pull/38#issuecomment-540645759 ?
should i go forward with this #38 (comment) ?
Do you mean:
ill port this to the blockservice
If so, yes please.
The benefit of a performance improvement of this magnitude will impact adoption
It's true, but we want the performance improvement to remain even after this module is retired for unixfsv2-importer
or whatever, which is why it shouldn't live here.
The responsibilities are:
unixfs-importer
: Turns a stream of buffers into nodes in a dag of some shapeipld
: Turns the dag nodes into blocksblock-service
: stores blocksWe should not be optimising how we store blocks at a high level - we should pass stream of blocks to ipld, which should pass a stream of blocks to the block service, which should pull them into batches and write as appropriate, if that's the fastest way to write them to the available storage.
An implication of this change is that importing files is no longer atomic, because we start returning CIDs before they've been written to the block store.
The time window is small but applications will be able to receive CIDs from blocks that they've just added, request the block with the CID and find that it doesn't exist. If the application crashes or otherwise can't write the blocks (disk full, solar flare, I don't know), future users of the block store may not find the blocks they expect to be there.
An implication of this change is that importing files is no longer atomic, because we start returning CIDs before they've been written to the block store.
🔦 Great thing to highlight! Before merging this PR, there must be a test with a faulty repo that fails some of the writes and than the test compares the CIDs that got returned with the CIDs that are in the repo. All of them should be there.
A few quick notes on js-unixfsv2
since it’s far enough along that the design is much clearer.
We put a lot of thought into how we can produce libraries that are useful both in IPFS as dependencies but also useful independently in other libraries or even in something like ipfs-lite
.
There’s actually 2 libraries I’ll be producing.
js-unixfsv2
This library contains:
The read interfaces take a single function async get CID => Block
which accepts a CID and returns a block. The write interfaces are just generators that produce new Block instances.
js-unixfsv2-mfs
This library does the low level work of a mutable file system. It will not implement the full IPFS MFS API, it’ll only provide the necessary functionality for that to be built on top.
The reason I’m writing this as well is because it’s non-trivial to provide transactional integrity of the filesystem without being quite close to the data. It’s also very useful functionality to be able to use independent of IPFS.
This library provides an interface for reads and writes to a mutable filesystem. This library will accept a bulk write function and a block get function and all read/write operations will be behind a mutex that can provide transaction integrity of the reads and writes and will batch all writes by default. This way, anyone writing to it in any way will get batched write behavior by default, even when the writes are being done concurrently. This means that, as far as these recent perf changes making their way into v2, you just need to hand me a bulk write function for whatever underlying data store you want and everything will be optimally batched and we can throw proper errors when concurrent actors try to update the same file.
Some other simple perf improvements I have planned for this: LRU block cache to speed up reads/writes in the same parts of the graph, and a bloom filter to speed up writes of the same data (much faster re-imports).
An implication of this change is that importing files is no longer atomic, because we start returning CIDs before they've been written to the block store.
Is this also going to break back-pressure during large file imports? If we’re continuing to import data without flushing to disc then we’re going to eventually run out of memory once the disc is under enough load.
This is probably the wrong place to get into unixfsv2
implementation discussions, but..
There’s actually 2 libraries I’ll be producing.
It might minimise disruption to add unixfsv2
resolvers to the existing exporter at js-ipfs-unixfs-exporter/src/resolvers That way there'll be no extra work to required to traverse through a graph that contains unixfsv1
nodes, generic dag-cbor
nodes and unixfsv2
nodes. I don't think unixfsv1
is going away any time soon.
js-unixfsv2-mfs
Interesting stuff. We already have something to prevent re-writing existent blocks in ipfs-repo
and if @hugomrdias ports the changes here into the blockstore, we'll have batch writes at the block level which may be sufficient or better performance wise, though it's hard to say without measuring it.
The existing mfs implementation uses the same unixfs importer/exporter as ipfs.add
, will your new module do the same thing? I ask because go-IPFS
has a different import implementation for mfs
from ipfs.add
which has led to inconsistencies between the two - this is something I'm very keen to avoid.
I would agree for > UnixFSv1. But as UnixFSv1 is now opimized for perf, it just doesn't make sense to use js-ipld, as DagPB isn't even a format that supports the full IPLD Data Model.
@vmx this is the kind of painpoints that any user of IPLD will hit and the reason why our Ethereum friends had to create https://github.com/ipld/js-ipld-graph-builder because the Perf of using js-ipld directly to add lots of dag nodes was terrible (which is a similar problem to what @hugomrdias is trying to fix, the only difference is that the graph-builder is designed for graphs that suffer a lot of changes, while unixfs is just dealing with big graphs).
graph-builder is what lead for the consideration of https://github.com/ipld/js-ipld/issues/66
@daviddias I agree. The problem with the current js-ipld is that it didn't get the layers of abstractions quite right. That will be fixed with future iterations of the js ipld stack.
what is the plan on this PR?
@hugomrdias was going to make these changes in the blockstore instead so the optimisation benefits other codecs and can live on with UnixFSv2.
Also I don't think we got to the bottom of the conversation around whether we are ok with no longer having atomic writes. There's code in the filestore ensuring all writes are atomic, it'd all be redundant if this is how we want to do things going forward.
I'm going to close this because importing speed has massively increased with #41 and will increase further with ipfs/js-ipfs#2771 neither of which compromise atomicity of writes.
This PR adds support for batch puts of blocks into the datastore.
I duplicated the buildFile function with support for batch to be easier to benchmark, we can dedupe before merging this PR.
Two new options were added
batch
(Boolean) andbatchInterval
(number).batch
is just a toggle to enable batchingbatchInterval
is the time interval to accumulate Blocks before putting in the datastoreBatching improves performance and below you will see some numbers, but please run the benchmark files i added to the repo and test in your machines and provide feedback.
browser
node
Needs: https://github.com/ipld/js-ipld/pull/238