Open dirkmc opened 4 years ago
@Stebalien & @magik6k this analysis came from spelunking through the source code. Please let me know if there's anything there that doesn't sound right. In particular I'd like a second opinion on the theory (untested) that branches of the directory tree may be missed when writing the directories to the blockstore
This is mostly correct.
The IPLD Node root for a file is added to the blockstore twice
This should be slightly mitigated by the fact that we check if we already have a block before writing it (in the blockservice itself).
The second recursion over the directory structure will therefore miss these branches and they will not be written to the blockstore
Not quite. We flush everything first. Unfortunately, this does mean that we end up storing a bunch of intermediate directories that we don't need.
For some context, we primarily use MFS to make it easy to pause for GC. However, honestly, that probably isn't worth it given the performance hit. I agree we should explore using unixfs directly.
We flush everything first
Do you mean when we call root.FlushMemFree()
? It calls dir.Flush()
but AFAICT that will only Flush the root dir (not any child dirs)
It will recursively flush. Flush calls dir.GetNode()
which calls dir.sync()
which walks all the cached entries, calling GetNode on each.
Finally, it calls GetNode()
on the unixfs directory itself then adds that to the DAGService.
(unless I'm missing something)
Ah yes you're right, thanks 👍
I'll remove that section from the original issue above.
@Stebalien given this analysis, my uneducated guess is that changing ipfs add
to use the UnixFS API directly is going to give us a moderate performance improvement, maybe on the order of 10% - 20%. Would your guess fall somewhere in that range?
We should measure how much datastore is called when adding stuff with different layouts.
My bet would be that for large directories / lots of directories, Has
calls will create a significant overhead.
Another note on using unixfs directly - It would make it easier to add small files in parallel - each call to addFile
could try to buffer say 100kb of data, and if an entire file fits in that amount of data, add it async (which, when adding lots of small files should help with waiting on IO)
I was thinking along those lines - maybe a queue for incoming files and a queue for outgoing with backpressure in each
we primarily use MFS to make it easy to pause for GC
@Stebalien can we just add the current root for any in-progress ipfs add
to bestEffortRoots?
Probably? We really do need some kind of internal in-memory pinning system (which shouldn't be too hard). The primary issue is that the importer would need to be GC aware so it can pause.
We can just do something similar to how ipfs add
works now, right, ie check periodically if GC has been requested, and if so, create a temporary fake tree and pin the root.
Yeah, you're right.
If there is a significant Has
performance hit, the bloom filter cache in blockstore will help a lot.
In most cases where we call Has
while adding, the thing is already in the blockstore (likely in all cases), so Bloom filter won't really help here since we still need to check in the underlying blockstore.
(hack / poc idea: check if has
always returns True when adding, if that's the case, create blockstore wrapper which replaces Has
with hardcoded return true
. If that works we can quickly measure the overhead of all the has
calls, and see if we're fixing hhe right thing)
Has
is called before adding a block to blockstore (always). If you always return true it won't add the files to datastore.
Yeah, that happens in blockservice, and it's easy to disable that check there (or just override Has
in blockservice)
(https://github.com/ipfs/go-blockservice/blob/master/blockservice.go#L141, https://github.com/ipfs/go-blockservice/blob/master/blockservice.go#L173)
These data were generated with this script running on my laptop. It creates count x <file size>
files of random data in a single directory (in memory), creates an Adder with an in-memory datastore and calls AddAllAndPin()
on the directory.
Name Has Get Put Objects In-Memory Badger
1 x 100M: 1230 3 411 2 281.674222ms 301.512398ms
10 x 10M: 1248 12 417 11 250.425109ms 393.516821ms
100 x 1M: 1518 102 507 101 260.626669ms 404.573155ms
1024 x 100k: 3090 1026 1031 1025 375.101811ms 647.932807ms
10240 x 10k: 30738 10242 10247 10241 1.314768727s 3.186459614s
20480 x 5k: 61458 20482 20487 20481 3.366398431s 6.938569625s
51200 x 2k: 153618 51202 51207 51201 27.019718374s 36.833479179s
68266 x 1.5k: 204816 68268 68273 68267 1m2.91561757s 1m38.99733823s
81920 x 1.25k: 245778 81922 81927 81921 2m5.529349576s 2m52.524424918s
102400 x 1k: 307218 102402 102407 102401 4m6.188296442s †
Edit: I added the timings for Badger DB (†) The final Badger test timed out after 10 minutes
Objects
is the number of Objects output to the response stream.
We really need tracing in those code paths, this would let us know what is the precise cause.
It might be worth comparing that with badger.
@Kubuxu good call 👍
I'm continuing to work on generating some useful profiling information, I just wanted to get this out here as a start as @magik6k had some questions on the how many calls we're making to Has/Get/Put
Also please let me know if there's other particular things you'd like to see stats about, you guys are more familiar with where bottlenecks most likely to be
You might also want to get stats from localhost:5001/debug/metrics/prometheus
it will have stats on how much time total was spent in datastore in a given call.
I added the timings for Badger DB to the table in my comment above
Very interesting.
I wanted to test the impact of adding many files in a directory tree with a varying branch out factor. For example 16 files in a directory tree with
It doesn't seem to have a big performance impact.
In these graphs the branch out factor (items / directory) is on the x-axis and duration is on the y-axis, where each line represents a <total number of files> x <file size>
:
These graphs show Has / Get / Put / Object counts for 68,266 x 1.5k
files and for 102,400 x 1k
files, for several different branch-out factors (x-axis)
(Spreadsheet / Code)
I ran some simulations for varying numbers of files of size 1k, with a few different branch-out factors. Again branch-out factor doesn't appear to make a difference, and it appears that time complexity increases polynomially (Spreadsheet - See Sheet 2).
Are there any good comparisons we could make with things that are non-ipfs, to get some other baselines to contrast things with?
I'm loving these graphs, and they're super awesome for helping us see some things about our own code paths, but after ogling for a while, I realized the number of zeros on most of the axes are hard for my brain to parse -- are these comparable to how fast a plain filesystem would flush? If there's a constant factor difference, that's useful to know; the rough order of mag of that constant factor would also be useful to know.
(plain filesystem flush might not be a fair comparison -- we're hashing, etc, And That's Different; maybe 'git add' or operations like that would be better? There's also a pure golang git implementation out there, which could be amusing if we want to avoid comparing directly to one of the most hyper-optimized c things on the globe.)
That's a good question - I like the idea of comparing with git, and actually I think the fastest implementation is probably a good baseline.
With these graphs I'm mostly trying to figure out where the performance bottlenecks are with our implementation, eg
But I think you're right it would be good to compare against realistic maximum performance of a comparable system
Can we re-run those tests on badger?
rsync
is probably a useful tool for comparison with. Maybe scp
as well
@Stebalien good idea - I will re-run the tests for the last graph (File Count vs Duration for files of size 1k) against Badger overnight
I re-ran the tests against Badger (red line). It seems like performance is also "gently" polynomial, but with a large number of files it's actually better than in-memory storage (blue line), perhaps because there is some overhead keeping that much data in an in-memory map in go?
(Spreadsheet - See Sheet 2) / Code)
@dirkmc this is looking great. Would you be able to also see how long it takes to do an ipfs add
using the Filestore?
As a Proof of Concept I replaced the use of MFS with direct calls to UnixFS (but ignored GC for now). I ran the tests against Badger with various branch-out factors, it appears to perform significantly better than the average with MFS graphed above (Spreadsheet - See Sheet 2 / Code)
The purple line is the average with MFS, the red line is the average calling UnixFS directly
Note: Edited to show a graph with the average of several different branch-out factors. The old graph only showed the results for a branch-out factor of 32)
By 'use unixfs directly' you mean 'dont create any directories'? I'm willing to believe there's some inefficiency in the mfs code here, but it seems we're comparing apples to oranges, of course it's going to be faster if we just add each file without tracking where it lives
Ah, I see it now. all the commented out stuff threw me for a loop.
Are these tests being run with or without a running daemon?
So far I'm just hooking directly into the Adder 🐍 so as to reduce the amount of noise from other stuff going on in IPFS.
I compared the output hashes of the new code vs the old code and it looks like it's producing the correct output, although the hashes are in a different order as the new code outputs the directory hash as soon as the directory's contents has been added, whereas the old code output all the directories at the end. It's still not quite apples to apples because for the time being I'm ignoring GC.
The real test will be to compare adding a package manager scale directory tree to a running IPFS instance with the old code vs the new code. I'd like to compare that to rsync and git as people suggested above, and also to test with Filestore.
I compared the output hashes of the new code vs the old code and it looks like it's producing the correct output
ah, perfect. that was my primary concern.
although the hashes are in a different order as the new code outputs the directory hash as soon as the directory's contents has been added
So, here's a bit of weirdness. one of the main motivators behind using mfs was the fact that the api accepts files in any order. so you could receive foo/bar/baz
abc/def/cat.txt
then foo/bar/bop
. and the current code would handle that fine. Thats just something to keep in mind here, don't let it get in the way of your optimizations and profiling :)
This is unfortunate. As far as I can tell, js-ipfs will actually do this as you can pass files as a series of {path: "/foo/bar.txt", content: "..."}
. @alanshaw, is this correct? Specifically, will js-ipfs-http-client ever send /foo/1
, then /bar/0
, then /foo/2
?
@andrew very helpfully created a histogram for a real package manager directory tree, albeit a small one. To make a more realistic simulation I modified the performance test to create a directory tree on disk (instead of in-memory) that looks (very roughly) like the histogram:
I tried profiling the Adder but I found that profiling was too sharp a tool (or at least I couldn't make sense of the output) for a whole sub-system. Currently we read files off the disk and process them one at a time so I decided to try parallelizing file adding.
I tested this with Badger as the datastore for varying numbers of files from 1k - 32k with
The file count / total tree sizes tested were:
1k 3G
2k 4.8G
4k 12G
8k 23G
16k 46G
32k 93G
My system:
$ sysctl hw.physicalcpu hw.logicalcpu
hw.physicalcpu: 4
hw.logicalcpu: 8
Performance appears to be roughly linear for this shape directory structure. For my system the optimal number of threads seems to be around 8 - 16, where performance reaches about twice as fast as the current Adder.
(Spreadsheet - Code)
Please note that as I've tried different experiments the code has become increasingly embarrassing 🙈 The parallelization code is very much proof-of-concept, panics instead of handling errors, and probably has some races and bugs. It has some way to go to make it production-ready
For a real-world benchmark I ran @dirkmc's dirks branch against a copy of the main Arch linux repository and also saw a 2x speed improvement 🚀
$ time $GOPATH/bin/ipfs add -r --progress --offline --quieter ~/arch
47.50 GiB / 47.50 GiB [========================================================] 100.00%
QmY5C4FXTDtmbyyKX1KixVMedJQ2nk9RgJMv6j9pDzrUaR
real 5m4.095s
user 7m27.847s
sys 1m26.448s
$ time $GOPATH/bin/ipfs add -r --progress --offline --quieter ~/arch
27.07 TiB / 47.50 GiB [======================================================] 58362.03%
QmY5C4FXTDtmbyyKX1KixVMedJQ2nk9RgJMv6j9pDzrUaR
real 2m33.295s
user 6m54.554s
sys 1m44.620s
note: the progress bar is currently a little broken 📈
Andrew also ran a test against the same Arch linux repo, where the repo was stored on an external hard drive. He found that there was no performance improvement of the new branch against the old branch for this use case:
time $GOPATH/bin/ipfs add -r --progress --offline --quieter /Volumes/5TB/arch/
47.50 GiB / 47.50 GiB [===============================================] 100.00%
QmY5C4FXTDtmbyyKX1KixVMedJQ2nk9RgJMv6j9pDzrUaR
real 11m2.681s
user 7m11.504s
sys 1m25.566s
$ time $GOPATH/bin/ipfs add -r --progress --offline --quieter /Volumes/5TB/arch/
2.90 TiB / 47.50 GiB [========================================================] 6255.61%5179323 milli-seconds add
674354 milli-seconds cumulative
QmY5C4FXTDtmbyyKX1KixVMedJQ2nk9RgJMv6j9pDzrUaR
real 11m15.050s
user 5m45.568s
sys 1m25.589s
This implies that there are probably two bottlenecks:
I'd suggest we try adding a second queue to parallelize calls to DAGService. This is a little more complicated to implement as the calls to DAGService happen in several places (not just in go-ipfs)
I'd suggest we try adding a second queue to parallelize calls to DAGService. This is a little more complicated to implement as the calls to DAGService happen in several places (not just in go-ipfs)
We already pipeline and batch writes to the dag service. Unfortunately, we flush the pipeline on every file, IIRC.
Ok, so @alanshaw and I have discussed this in https://github.com/ipfs/interface-js-ipfs-core/pull/509 and adding files out-of-order is technically supported and, really, there's no theoretical reason why it shouldn't be performant. I think the right path forward here is to fix/rewrite go-mfs.
@Stebalien it seems like if we allow out-of-order adds we'd need to keep the whole directory tree in a cache, whether we use MFS or not. Do you have an idea that will allow us to avoid that?
@Stebalien it seems like if we allow out-of-order adds we'd need to keep the whole directory tree in a cache, whether we use MFS or not. Do you have an idea that will allow us to avoid that?
Explicitly open/close directories and eager-flush on close. In the best case, the overhead would be O(depth)
memory (trivial). In the worst case, the overhead would be a ton of extra hashing as we'd have to re-open previously closed (and hashed) directories, modify them, and then re-hash them.
I see so in that case we'd be optimizing for in-order adds, and doing some thrashing if adds are supplied out-of-order. That seems like a reasonable trade-off to me 👍
Yep. Ideally, we'd also make this configurable. That is, MFS could be configured to keep an LRU cache of open directories.
Had a good discussion with @achingbrain. Takeaways and questions:
iotop
. If we are than we may be barking up the wrong tree. @andrew, could you test this?We also discussed the filestore a bit and the terrible UX. I believe a bunch of the bad UX comes from the fact that we're mixing pinning with the filestore.
Alternatively, we could not do that. That is, we could say that all pinned blocks must live in the datastore, and "filestore" blocks can only live in MFS
(ipfs files
) or in some special set of "synced" directories. That way overwriting or removing a file in the filestore wouldn't really break anything. (note: fixing this would be a pretty large task).
Calling
ipfs add
on a directory with a large tree of sub-directories and files is slow. This use case is particularly important for file-system based package managers.Background
IPFS deals with immutable blocks. Blocks are stored in the blockstore. The UnixFS package breaks files up into chunks, and converts them to IPLD objects. The DAG Service stores IPLD objects in the blockstore.
The Mutable File System (MFS) is an abstraction that presents IPFS as a file system. For example consider the following directory structure with associated hashes:
If the contents of
fish.txt
changes, the CID forfish.txt
will also change. The link fromocean → fish
will change, so the CID forocean
will change. The link fromanimals → ocean
will change so the CID foranimals
will change. MFS manages those links and the propagation of changes up the tree.Algorithm
ipfs add
uses the MFS package to add files and directories to the IPFS blockstore. To add a directory with a large tree of sub-directories and files:animals
in the example above)animals/ocean
This adds the empty directory to the blockstore.animals/ocean/fish.txt
animals/ocean
animals/ocean/fish.txt
) Note: This again adds the IPLD Node root to the blockstoredirectory.GetNode()
Note that at this stage, the links to files in the directories have been created, so the directory created here will have a different CID than the empty directory created before the files were added. Callingdirectory.GetNode()
(confusingly) writes the directory with links to files to the blockstore(†) Although we've already created the directory, it's necessary to again ensure it exists before adding the file, because after processing every 256k files, the MFS internal directory cache structure is dereferenced to allow for golang garbage collection
Areas for Improvement
Proposed Improvements
The above issues would be mitigated if we interact directly with the UnixFS API instead of with the MFS API:
Future Work
It has been noted that disk throughput and CPU usage are not close to maximum while adding large numbers of files. Future work should focus on analyzing these findings.