Improve ipfs add performance for large trees of files & directories

dirkmc commented 4 years ago

Calling ipfs add on a directory with a large tree of sub-directories and files is slow. This use case is particularly important for file-system based package managers.

Background

IPFS deals with immutable blocks. Blocks are stored in the blockstore. The UnixFS package breaks files up into chunks, and converts them to IPLD objects. The DAG Service stores IPLD objects in the blockstore.

The Mutable File System (MFS) is an abstraction that presents IPFS as a file system. For example consider the following directory structure with associated hashes:

animals         <Qm1234...>
  land          <Qm5678...>
    dogs.txt    <Qm9012...>
    cats.txt    <Qm3456...>
  ocean         <Qm7890...>
    fish.txt    <Qm4321...>

If the contents of fish.txt changes, the CID for fish.txt will also change. The link from ocean → fish will change, so the CID for ocean will change. The link from animals → ocean will change so the CID for animals will change. MFS manages those links and the propagation of changes up the tree.

Algorithm

ipfs add uses the MFS package to add files and directories to the IPFS blockstore. To add a directory with a large tree of sub-directories and files:

Create an MFS root for the root directory (animals in the example above)
Recurse through the directory structure in "depth first search" fashion. For each directory
- Create a corresponding empty directory in MFS eg animals/ocean This adds the empty directory to the blockstore.
- For each file in the directory eg animals/ocean/fish.txt
- Read the file contents
- Convert the contents into a chunked IPLD node
- Add the IPLD Node and all its chunks to the blockstore
- Create the directory in MFS, if it doesn't exist (†) eg animals/ocean
- Add the IPLD Node representing the file to MFS at the correct path (eg animals/ocean/fish.txt) Note: This again adds the IPLD Node root to the blockstore
Recurse through the MFS representation of the directory structure
- For each directory, call directory.GetNode() Note that at this stage, the links to files in the directories have been created, so the directory created here will have a different CID than the empty directory created before the files were added. Calling directory.GetNode() (confusingly) writes the directory with links to files to the blockstore

(†) Although we've already created the directory, it's necessary to again ensure it exists before adding the file, because after processing every 256k files, the MFS internal directory cache structure is dereferenced to allow for golang garbage collection

Areas for Improvement

The IPLD Node root for a file is added to the blockstore twice
- When the file is converted to an IPLD node
- When the file is added to MFS
- Note: @Stebalien points out that this mitigated by the fact that we check if we already have a block before writing it (in the blockservice itself)
The MFS directory structure is kept in memory
Recursion over the directory structure happens twice
- While reading the structure from input
- While writing out the directories to the blockstore
The progress indicator pauses for a long period when it is almost at 100% while the directories are being written to the blockstore

Proposed Improvements

The above issues would be mitigated if we interact directly with the UnixFS API instead of with the MFS API:

Recurse once over the directory structure
Add files as we go
Add directories once all their files have been added

Future Work

It has been noted that disk throughput and CPU usage are not close to maximum while adding large numbers of files. Future work should focus on analyzing these findings.

dirkmc commented 4 years ago

@Stebalien & @magik6k this analysis came from spelunking through the source code. Please let me know if there's anything there that doesn't sound right. In particular I'd like a second opinion on the theory (untested) that branches of the directory tree may be missed when writing the directories to the blockstore

Stebalien commented 4 years ago

This is mostly correct.

The IPLD Node root for a file is added to the blockstore twice

This should be slightly mitigated by the fact that we check if we already have a block before writing it (in the blockservice itself).

The second recursion over the directory structure will therefore miss these branches and they will not be written to the blockstore

Not quite. We flush everything first. Unfortunately, this does mean that we end up storing a bunch of intermediate directories that we don't need.

For some context, we primarily use MFS to make it easy to pause for GC. However, honestly, that probably isn't worth it given the performance hit. I agree we should explore using unixfs directly.

dirkmc commented 4 years ago

We flush everything first

Do you mean when we call root.FlushMemFree()? It calls dir.Flush() but AFAICT that will only Flush the root dir (not any child dirs)

Stebalien commented 4 years ago

It will recursively flush. Flush calls dir.GetNode() which calls dir.sync() which walks all the cached entries, calling GetNode on each.

Finally, it calls GetNode() on the unixfs directory itself then adds that to the DAGService.

Stebalien commented 4 years ago

(unless I'm missing something)

dirkmc commented 4 years ago

Ah yes you're right, thanks 👍

I'll remove that section from the original issue above.

dirkmc commented 4 years ago

@Stebalien given this analysis, my uneducated guess is that changing ipfs add to use the UnixFS API directly is going to give us a moderate performance improvement, maybe on the order of 10% - 20%. Would your guess fall somewhere in that range?

magik6k commented 4 years ago

We should measure how much datastore is called when adding stuff with different layouts.

Count Put/Get/Has when adding
- Many tiny files (ideally above the 256k treshold, but that may not be practical)
- 1000 100kb files
- in lots of directoires
- 100 1mb files
- 10 10mb files
- 1 100mb file
- (some variations based on directory depth)

My bet would be that for large directories / lots of directories, Has calls will create a significant overhead.

magik6k commented 4 years ago

Another note on using unixfs directly - It would make it easier to add small files in parallel - each call to addFile could try to buffer say 100kb of data, and if an entire file fits in that amount of data, add it async (which, when adding lots of small files should help with waiting on IO)

dirkmc commented 4 years ago

I was thinking along those lines - maybe a queue for incoming files and a queue for outgoing with backpressure in each

dirkmc commented 4 years ago

we primarily use MFS to make it easy to pause for GC

@Stebalien can we just add the current root for any in-progress ipfs add to bestEffortRoots?

Stebalien commented 4 years ago

Probably? We really do need some kind of internal in-memory pinning system (which shouldn't be too hard). The primary issue is that the importer would need to be GC aware so it can pause.

dirkmc commented 4 years ago

We can just do something similar to how ipfs add works now, right, ie check periodically if GC has been requested, and if so, create a temporary fake tree and pin the root.

Stebalien commented 4 years ago

Yeah, you're right.

Kubuxu commented 4 years ago

If there is a significant Has performance hit, the bloom filter cache in blockstore will help a lot.

magik6k commented 4 years ago

In most cases where we call Has while adding, the thing is already in the blockstore (likely in all cases), so Bloom filter won't really help here since we still need to check in the underlying blockstore.

(hack / poc idea: check if has always returns True when adding, if that's the case, create blockstore wrapper which replaces Has with hardcoded return true. If that works we can quickly measure the overhead of all the has calls, and see if we're fixing hhe right thing)

Kubuxu commented 4 years ago

Has is called before adding a block to blockstore (always). If you always return true it won't add the files to datastore.

magik6k commented 4 years ago

Yeah, that happens in blockservice, and it's easy to disable that check there (or just override Has in blockservice)

(https://github.com/ipfs/go-blockservice/blob/master/blockservice.go#L141, https://github.com/ipfs/go-blockservice/blob/master/blockservice.go#L173)

dirkmc commented 4 years ago

These data were generated with this script running on my laptop. It creates count x <file size> files of random data in a single directory (in memory), creates an Adder with an in-memory datastore and calls AddAllAndPin() on the directory.

       Name          Has      Get      Put   Objects      In-Memory         Badger
 1 x      100M:     1230        3      411         2    281.674222ms     301.512398ms
 10 x      10M:     1248       12      417        11    250.425109ms     393.516821ms
 100 x      1M:     1518      102      507       101    260.626669ms     404.573155ms
 1024 x   100k:     3090     1026     1031      1025    375.101811ms     647.932807ms
 10240 x   10k:    30738    10242    10247     10241    1.314768727s     3.186459614s
 20480 x    5k:    61458    20482    20487     20481    3.366398431s     6.938569625s
 51200 x    2k:   153618    51202    51207     51201    27.019718374s    36.833479179s
 68266 x  1.5k:   204816    68268    68273     68267    1m2.91561757s    1m38.99733823s
 81920 x 1.25k:   245778    81922    81927     81921    2m5.529349576s   2m52.524424918s
 102400 x   1k:   307218   102402   102407    102401    4m6.188296442s          †

Edit: I added the timings for Badger DB (†) The final Badger test timed out after 10 minutes

Objects is the number of Objects output to the response stream.

Kubuxu commented 4 years ago

We really need tracing in those code paths, this would let us know what is the precise cause.

It might be worth comparing that with badger.

dirkmc commented 4 years ago

@Kubuxu good call 👍

I'm continuing to work on generating some useful profiling information, I just wanted to get this out here as a start as @magik6k had some questions on the how many calls we're making to Has/Get/Put

dirkmc commented 4 years ago

Also please let me know if there's other particular things you'd like to see stats about, you guys are more familiar with where bottlenecks most likely to be

Kubuxu commented 4 years ago

You might also want to get stats from localhost:5001/debug/metrics/prometheus it will have stats on how much time total was spent in datastore in a given call.

dirkmc commented 4 years ago

I added the timings for Badger DB to the table in my comment above

Kubuxu commented 4 years ago

Very interesting.

dirkmc commented 4 years ago

I wanted to test the impact of adding many files in a directory tree with a varying branch out factor. For example 16 files in a directory tree with

a branch out factor of 4 would have 2 levels of directories
a branch out factor of 2 would have 4 levels of directories (all the files are at the leaves of the directory tree)

It doesn't seem to have a big performance impact.

In these graphs the branch out factor (items / directory) is on the x-axis and duration is on the y-axis, where each line represents a <total number of files> x <file size>:

These graphs show Has / Get / Put / Object counts for 68,266 x 1.5k files and for 102,400 x 1k files, for several different branch-out factors (x-axis)

(Spreadsheet / Code)

dirkmc commented 4 years ago

I ran some simulations for varying numbers of files of size 1k, with a few different branch-out factors. Again branch-out factor doesn't appear to make a difference, and it appears that time complexity increases polynomially (Spreadsheet - See Sheet 2).

warpfork commented 4 years ago

Are there any good comparisons we could make with things that are non-ipfs, to get some other baselines to contrast things with?

I'm loving these graphs, and they're super awesome for helping us see some things about our own code paths, but after ogling for a while, I realized the number of zeros on most of the axes are hard for my brain to parse -- are these comparable to how fast a plain filesystem would flush? If there's a constant factor difference, that's useful to know; the rough order of mag of that constant factor would also be useful to know.

(plain filesystem flush might not be a fair comparison -- we're hashing, etc, And That's Different; maybe 'git add' or operations like that would be better? There's also a pure golang git implementation out there, which could be amusing if we want to avoid comparing directly to one of the most hyper-optimized c things on the globe.)

dirkmc commented 4 years ago

That's a good question - I like the idea of comparing with git, and actually I think the fastest implementation is probably a good baseline.

With these graphs I'm mostly trying to figure out where the performance bottlenecks are with our implementation, eg

is there a problem adding a directory tree with a large branch-out factor (seems like no)
does performance degrade as file count increases (seems like yes)

But I think you're right it would be good to compare against realistic maximum performance of a comparable system

Stebalien commented 4 years ago

Can we re-run those tests on badger?

bonedaddy commented 4 years ago

rsync is probably a useful tool for comparison with. Maybe scp as well

dirkmc commented 4 years ago

@Stebalien good idea - I will re-run the tests for the last graph (File Count vs Duration for files of size 1k) against Badger overnight

dirkmc commented 4 years ago

I re-ran the tests against Badger (red line). It seems like performance is also "gently" polynomial, but with a large number of files it's actually better than in-memory storage (blue line), perhaps because there is some overhead keeping that much data in an in-memory map in go?

(Spreadsheet - See Sheet 2) / Code)

aschmahmann commented 4 years ago

@dirkmc this is looking great. Would you be able to also see how long it takes to do an ipfs add using the Filestore?

dirkmc commented 4 years ago

As a Proof of Concept I replaced the use of MFS with direct calls to UnixFS (but ignored GC for now). I ran the tests against Badger with various branch-out factors, it appears to perform significantly better than the average with MFS graphed above (Spreadsheet - See Sheet 2 / Code)

The purple line is the average with MFS, the red line is the average calling UnixFS directly

Note: Edited to show a graph with the average of several different branch-out factors. The old graph only showed the results for a branch-out factor of 32)

whyrusleeping commented 4 years ago

By 'use unixfs directly' you mean 'dont create any directories'? I'm willing to believe there's some inefficiency in the mfs code here, but it seems we're comparing apples to oranges, of course it's going to be faster if we just add each file without tracking where it lives

whyrusleeping commented 4 years ago

Ah, I see it now. all the commented out stuff threw me for a loop.

Are these tests being run with or without a running daemon?

dirkmc commented 4 years ago

So far I'm just hooking directly into the Adder 🐍 so as to reduce the amount of noise from other stuff going on in IPFS.

I compared the output hashes of the new code vs the old code and it looks like it's producing the correct output, although the hashes are in a different order as the new code outputs the directory hash as soon as the directory's contents has been added, whereas the old code output all the directories at the end. It's still not quite apples to apples because for the time being I'm ignoring GC.

The real test will be to compare adding a package manager scale directory tree to a running IPFS instance with the old code vs the new code. I'd like to compare that to rsync and git as people suggested above, and also to test with Filestore.

whyrusleeping commented 4 years ago

I compared the output hashes of the new code vs the old code and it looks like it's producing the correct output

ah, perfect. that was my primary concern.

although the hashes are in a different order as the new code outputs the directory hash as soon as the directory's contents has been added

So, here's a bit of weirdness. one of the main motivators behind using mfs was the fact that the api accepts files in any order. so you could receive foo/bar/baz abc/def/cat.txt then foo/bar/bop. and the current code would handle that fine. Thats just something to keep in mind here, don't let it get in the way of your optimizations and profiling :)

Stebalien commented 4 years ago

This is unfortunate. As far as I can tell, js-ipfs will actually do this as you can pass files as a series of {path: "/foo/bar.txt", content: "..."}. @alanshaw, is this correct? Specifically, will js-ipfs-http-client ever send /foo/1, then /bar/0, then /foo/2?

dirkmc commented 4 years ago

@andrew very helpfully created a histogram for a real package manager directory tree, albeit a small one. To make a more realistic simulation I modified the performance test to create a directory tree on disk (instead of in-memory) that looks (very roughly) like the histogram:

45% of files are 1k in size
50% of files are 1k - 1M in size
5% of files are 1M - 100M in size
branch-out factor of 32

I tried profiling the Adder but I found that profiling was too sharp a tool (or at least I couldn't make sense of the output) for a whole sub-system. Currently we read files off the disk and process them one at a time so I decided to try parallelizing file adding.

I tested this with Badger as the datastore for varying numbers of files from 1k - 32k with

the original Adder
Adder with direct calls to UnixFS (no MFS)
Adder with direct calls to UnixFS (no MFS) and parallelization of file adds
- 8 threads
- 16 threads
- 32 threads

The file count / total tree sizes tested were:

1k    3G
2k    4.8G
4k    12G
8k    23G
16k   46G
32k   93G

My system:

$ sysctl hw.physicalcpu hw.logicalcpu
hw.physicalcpu: 4
hw.logicalcpu: 8

Performance appears to be roughly linear for this shape directory structure. For my system the optimal number of threads seems to be around 8 - 16, where performance reaches about twice as fast as the current Adder.

(Spreadsheet - Code)

Please note that as I've tried different experiments the code has become increasingly embarrassing 🙈 The parallelization code is very much proof-of-concept, panics instead of handling errors, and probably has some races and bugs. It has some way to go to make it production-ready

andrew commented 4 years ago

For a real-world benchmark I ran @dirkmc's dirks branch against a copy of the main Arch linux repository and also saw a 2x speed improvement 🚀

ipfs/go-ipfs master

$ time $GOPATH/bin/ipfs add -r --progress --offline --quieter ~/arch
47.50 GiB / 47.50 GiB [========================================================] 100.00%
QmY5C4FXTDtmbyyKX1KixVMedJQ2nk9RgJMv6j9pDzrUaR

real    5m4.095s
user    7m27.847s
sys    1m26.448s

dirk/go-ipfs exp/add-perf

$ time $GOPATH/bin/ipfs add -r --progress --offline --quieter ~/arch
27.07 TiB / 47.50 GiB [======================================================] 58362.03%
QmY5C4FXTDtmbyyKX1KixVMedJQ2nk9RgJMv6j9pDzrUaR

real    2m33.295s
user    6m54.554s
sys    1m44.620s

note: the progress bar is currently a little broken 📈

dirkmc commented 4 years ago

Andrew also ran a test against the same Arch linux repo, where the repo was stored on an external hard drive. He found that there was no performance improvement of the new branch against the old branch for this use case:

ipfs/go-ipfs master

time $GOPATH/bin/ipfs add -r --progress --offline --quieter /Volumes/5TB/arch/
47.50 GiB / 47.50 GiB [===============================================] 100.00%
QmY5C4FXTDtmbyyKX1KixVMedJQ2nk9RgJMv6j9pDzrUaR

real    11m2.681s
user    7m11.504s
sys    1m25.566s

dirk/go-ipfs exp/add-perf

$ time $GOPATH/bin/ipfs add -r --progress --offline --quieter /Volumes/5TB/arch/
 2.90 TiB / 47.50 GiB [========================================================] 6255.61%5179323 milli-seconds add
674354 milli-seconds cumulative
QmY5C4FXTDtmbyyKX1KixVMedJQ2nk9RgJMv6j9pDzrUaR

real    11m15.050s
user    5m45.568s
sys    1m25.589s

This implies that there are probably two bottlenecks:

Chunking / IPLDing / CIDing (CPU bound)
Writing to DAGService (I/O bound)

I'd suggest we try adding a second queue to parallelize calls to DAGService. This is a little more complicated to implement as the calls to DAGService happen in several places (not just in go-ipfs)

Stebalien commented 4 years ago

I'd suggest we try adding a second queue to parallelize calls to DAGService. This is a little more complicated to implement as the calls to DAGService happen in several places (not just in go-ipfs)

We already pipeline and batch writes to the dag service. Unfortunately, we flush the pipeline on every file, IIRC.

Stebalien commented 4 years ago

Ok, so @alanshaw and I have discussed this in https://github.com/ipfs/interface-js-ipfs-core/pull/509 and adding files out-of-order is technically supported and, really, there's no theoretical reason why it shouldn't be performant. I think the right path forward here is to fix/rewrite go-mfs.

dirkmc commented 4 years ago

@Stebalien it seems like if we allow out-of-order adds we'd need to keep the whole directory tree in a cache, whether we use MFS or not. Do you have an idea that will allow us to avoid that?

Stebalien commented 4 years ago

@Stebalien it seems like if we allow out-of-order adds we'd need to keep the whole directory tree in a cache, whether we use MFS or not. Do you have an idea that will allow us to avoid that?

Explicitly open/close directories and eager-flush on close. In the best case, the overhead would be O(depth) memory (trivial). In the worst case, the overhead would be a ton of extra hashing as we'd have to re-open previously closed (and hashed) directories, modify them, and then re-hash them.

dirkmc commented 4 years ago

I see so in that case we'd be optimizing for in-order adds, and doing some thrashing if adds are supplied out-of-order. That seems like a reasonable trade-off to me 👍

Stebalien commented 4 years ago

Yep. Ideally, we'd also make this configurable. That is, MFS could be configured to keep an LRU cache of open directories.

Stebalien commented 4 years ago

Had a good discussion with @achingbrain. Takeaways and questions:

Does the periodic MFS flush stop the world? As in, is that causing us to do a bunch of sequential disk writes every so often, blocking everything else?
When writing to an external HDD (WRT @andrew's tests), are we maxing out write throughput anyways. We can check this with tools like iotop. If we are than we may be barking up the wrong tree. @andrew, could you test this?
What about adding a directory that's already mostly added to IPFS? The internal "do we already have this block" checks should be saving us disk writes. Where are we bottlenecking here?

We also discussed the filestore a bit and the terrible UX. I believe a bunch of the bad UX comes from the fact that we're mixing pinning with the filestore.

Alternatively, we could not do that. That is, we could say that all pinned blocks must live in the datastore, and "filestore" blocks can only live in MFS (ipfs files) or in some special set of "synced" directories. That way overwriting or removing a file in the filestore wouldn't really break anything. (note: fixing this would be a pretty large task).

ipfs / kubo