ipfs-inactive / js-ipfs-unixfs-importer

[ARCHIVED] JavaScript implementation of the UnixFs importer used by IPFS
MIT License
5 stars 4 forks source link

perf: concurrent file import #41

Closed achingbrain closed 4 years ago

achingbrain commented 4 years ago

Adds two new options:

fileImportConcurrency

This controls the number of files that are imported concurrently. You may wish to set this high if you are importing lots of small files.

blockWriteConcurrency

This controls how many blocks from each file we hash and write to disk at the same time. Setting this high when writing large files will significantly increase import speed, though having it high when fileImportConcurrency is also high can swamp the process.

'high' is relative to your hardware speed (CPU & disk)..

It also:

  1. Flattens module options because validating deep objects was clunky and the separation of access to config sub objects within this module isn't very good
  2. Replaces superstruct and deep-extend with merge-options which is better suited for merging options and is smaller
  3. Replaces async-iterator-* modules with versions from the more zeitgeisty it-* namespace

Supersedes #38, sort of. No batching but atomicity guarantees are maintained and performance gains are broadly similar with the right tuning.

alanshaw commented 4 years ago

Got any perf stats we can ogle at?

achingbrain commented 4 years ago

Got any perf stats we can ogle at?

statto

Data generation, 1000x files 1k-1MB in size filled with random(ish) data:

$ cat gen-data.sh 
#!/bin/bash

mkdir -p data

for n in {1..1000}; do
    dd if=/dev/urandom of=data/file-$( printf %03d "$n" ).bin bs=1 count=$( shuf -i 1024-1048576 -n 1 )
done

Generated data:

$ du -hs ./data
525M    ./data

go-IPFS (IDK why it thinks the data folder is smaller than it is):

$ time ipfs add -r ./data
...
added QmSgH3zFNiMNyH2es2TrkZmjPTETVXvdGQHkA1hPS2n6sL data
 501.99 MiB / 501.99 MiB [============================] 100.00%
real    0m55.654s
user    0m4.931s
sys 0m5.502s

js-IPFS Master

$ time jsipfs add -r ./data
added QmSgH3zFNiMNyH2es2TrkZmjPTETVXvdGQHkA1hPS2n6sL data

real    0m52.505s
user    0m13.482s
sys 0m5.760s

File & block concurrency = 1 (e.g. how it was before this PR):

$ time jsipfs add -r --file-import-concurrency=1 --block-write-concurrency=1 ./data
...
added QmSgH3zFNiMNyH2es2TrkZmjPTETVXvdGQHkA1hPS2n6sL data

real    0m52.650s
user    0m14.849s
sys 0m5.410s

Proposed default values:

$ time jsipfs add -r --file-import-concurrency=50 --block-write-concurrency=10 ./data
...
added QmSgH3zFNiMNyH2es2TrkZmjPTETVXvdGQHkA1hPS2n6sL data

real    0m25.814s
user    0m11.552s
sys 0m5.262s