Closed andrew closed 4 years ago
Success, total time: 12 hours
NumObjects: 5824574
RepoSize (MiB): 1086
StorageMax (MiB): 9536
RepoPath: /data/.ipfs
Version: fs-repo@7
Although when trying to publish the name, get an error:
andrew@sd-48607:/data$ IPFS_PATH=/data/.ipfs ipfs name publish QmU7durGPbuyvjJPfwkszWiKg3rh2xgWX5RgqjSbgpPXJC
Error: can't publish while offline: pass `--allow-offline` to override
(the last one is cause your ipfs node is offline)
This is super cool! @magik6k @Stebalien @Kubuxu - what could we do differently to make this ...less slow. =]
What filesystem do you use? Are the disks in raid 1 or 0 or independent?
12h for 1.18TB seems to be around 28MB/s, which probably isn't anywhere near to what the drives can get to.
I'd recommend initializing the node with badger datastore - ipfs init --profile=badgerds
(it's possible to convert between datastores, but it will likely be faster to just re-add the data), it should bring the speed closer to that of the drives.
It also looks like the CPU usage is pretty low. We may need to finally implement parallel hashing.
Name successfully published, thanks for the tip @momack2
$ ipfs name publish QmU7durGPbuyvjJPfwkszWiKg3rh2xgWX5RgqjSbgpPXJC
Published to QmcESswYyg3R3YdLWbBN71iAYErJfQgk8NPB2yZcH9nKLY: /ipfs/QmU7durGPbuyvjJPfwkszWiKg3rh2xgWX5RgqjSbgpPXJC
@magik6k looks like the setup defaulted to raid 1, will try with badger next
Re-ran against the same /data/apt
directory with a fresh /data/.ipfs
directory using the --profile=badgerds
option:
$ export IPFS_PATH=/data/.ipfs
$ export IPFS_FD_MAX=4096
$ ipfs init --profile=badgerds
$ ipfs config Reprovider.Interval "0"
$ ipfs config --json Datastore.NoSync true
$ ipfs config --json Experimental.ShardingEnabled true
$ ipfs config --json Experimental.FilestoreEnabled true
$ time ipfs add -r --progress --offline --fscache --quieter --raw-leaves --nocopy /data/apt
Same CID produced, this time it took 18 hours:
badger 2019/03/10 04:30:04 INFO: Storing value log head: {Fid:6 Len:48 Offset:150054695}
1.18 TiB / 1.18 TiB [==============================================================================================================] 100.00%QmU7durGPbuyvjJPfwkszWiKg3rh2xgWX5RgqjSbgpPXJC
badger 2019/03/10 06:00:07 INFO: Storing value log head: {Fid:7 Len:48 Offset:43557748}
badger 2019/03/10 06:00:11 INFO: Force compaction on level 0 done
real 1089m48.271s
user 114m58.540s
sys 23m14.405s
ipfs stats repo --human
NumObjects: 5824574
RepoSize (MiB): 1825
StorageMax (MiB): 9536
RepoPath: /data/.ipfs
Version: fs-repo@7
/data/.ipfs/ -> 1.8G (previous run without badger resulted in 2.4G)
Now updating the mirror about 60 hours after the first rsync:
andrew@sd-48607:~$ rsync --recursive --times --links --safe-links --hard-links --stats --exclude "Packages*" --exclude "Sources*" --exclude "Release*" --exclude "InRelease" rsync://archive.ubuntu.com/ubuntu /data/apt/
This is an Ubuntu mirror - treat it kindly
Number of files: 983,593 (reg: 910,732, dir: 58,765, link: 14,096)
Number of created files: 7,438 (reg: 7,438)
Number of deleted files: 0
Number of regular files transferred: 11,316
Total file size: 1,295,095,654,503 bytes
Total transferred file size: 10,286,628,082 bytes
Literal data: 7,919,775,085 bytes
Matched data: 2,366,852,997 bytes
File list size: 17,960,443
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 8,081,081
Total bytes received: 7,963,813,509
sent 8,081,081 bytes received 7,963,813,509 bytes 36,484,643.43 bytes/sec
total size is 1,295,095,654,503 speedup is 162.46
andrew@sd-48607:~$ rsync --recursive --times --links --safe-links --hard-links --stats --delete --delete-after rsync://archive.ubuntu.com/ubuntu /data/apt/
This is an Ubuntu mirror - treat it kindly
Number of files: 985,691 (reg: 912,830, dir: 58,765, link: 14,096)
Number of created files: 259 (reg: 259)
Number of deleted files: 7,443 (reg: 7,441, link: 2)
Number of regular files transferred: 2,423
Total file size: 1,295,812,731,198 bytes
Total transferred file size: 1,324,885,090 bytes
Literal data: 471,646,863 bytes
Matched data: 853,238,227 bytes
File list size: 36,770,417
File list generation time: 3.452 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 3,129,540
Total bytes received: 510,474,614
sent 3,129,540 bytes received 510,474,614 bytes 18,676,514.69 bytes/sec
total size is 1,295,812,731,198 speedup is 2,522.98
Have started another add, this time with the exiting badgerds .ipfs directory to see how long it takes to check and add the ~7,438 changed files.
Failed after 3 hours 30 mins:
andrew@sd-48607:~$ time ipfs add -r --progress --offline --fscache --quieter --raw-leaves --nocopy /data/apt
badger 2019/03/10 13:41:56 INFO: All 8 tables opened in 1.692s badger 2019/03/10 13:41:56 INFO: Replaying file id: 7 at offset: 43557796 badger 2019/03/10 13:41:56 INFO: Replay took: 12.022ยตs 1.07 TiB / 1.18 TiB [============================= 1.18 TiB / 1.18 TiB [==================================================================================================================================] 100.00%^[[B^[[B^[[B^[[B^[[BQmQsQ9mtDXu5NTeXpinXuPUjy3nMbCi5rLfrycbf9rDdvh
Error: failed to get block for zb2rhjn4bxfqtxZrzfNYyQgm1EvKHcRket2TbdR6Y2L46zax3: data in file did not match. apt/dists/disco/universe/debian-installer/binary-i386/Packages.gz offset 0
badger 2019/03/10 17:02:14 INFO: Storing value log head: {Fid:7 Len:48 Offset:56345495}
badger 2019/03/10 17:02:17 INFO: Force compaction on level 0 done
real 200m23.176s
user 101m42.635s
sys 12m36.793s
Interestingly, adding the updated non-badger .ipfs directory also failed, on a similar file in a different path:
andrew@sd-48607:~$ time ipfs add -r --progress --offline --fscache --quieter --raw-leaves --nocopy /data/apt
1.18 TiB / 1.18 TiB [==================================================================================================================================] 100.00%QmQsQ9mtDXu5NTeXpinXuPUjy3nMbCi5rLfrycbf9rDdvh
Error: failed to get block for zb2rhjn4bxfqtxZrzfNYyQgm1EvKHcRket2TbdR6Y2L46zax3: data in file did not match. apt/dists/disco/universe/debian-installer/binary-i386/Packages.gz offset 0
real 301m42.881s
user 130m18.937s
sys 15m29.819s
@warpfork suggested doing a run with export IPFS_PROF=yas
in place to output more useful info for debugging, will kick another one off tomorrow
Error: failed to get block for zb2rhjn4bxfqtxZrzfNYyQgm1EvKHcRket2TbdR6Y2L46zax3: data in file did not match. apt/dists/disco/universe/debian-installer/binary-i386/Packages.gz offset 0
The file was probaby updated. Filestore expects files to be immutable once added, which can be problematic in this case. I'm not sure what is the best workaround in this case, but you need to somehow make ipfs not add those files to filestore.
We could change filestore to only add read-only files (possibly with a flag, and add rw content to normal blockstore)
@magik6k ideally this would happen transparently to the users, so they don't have to declare which files may change, then keeping an IPFS mirror up to date with rsync would be easily scriptable without needing to know exactly the expected behaviour of future updates from rsync
I thought I'd try again with a slightly smaller dataset, the https://clojars.org/ maven repository, which has an rsync server and is about 60GB.
Kicking off the same set of commands with a fresh /data/.ipfs
folder comes up with an estimate of 6 hours, which is quite surprising given that it took 12 hours to do 1.2TB, I would expect 1/20th the time (36 mins).
I guess its due to the differing folder structure, clojars is has many more folders at the top level?
I'd say many small files too. Can you run:
find . -type d | wc -l
find . -type f | wc -l
on both datasets?
/data/apt
:
find . -type d | wc -l
=> 58,765
find . -type f | wc -l
=> 912,830
/data/clojars
:
find . -type d | wc -l
=> 159,473
find . -type f | wc -l
=> 1,634,435
clojars has 2.7x folder count and 1.8x files than apt
I've documented some of the blockers found here in this PR: https://github.com/protocol/package-managers/pull/21/files
In the mean time, I've kicked off a fresh ipfs add
of the apt data that doesn't use the filestore to see how long it takes to re-add everything after an rsync update, to avoid the error encountered with the filestore expecting files added to be immutable
If that successfully completes, I'll put together a blog post for https://blog.ipfs.io detailing how to use apt-transport-ipfs and how to set up your own IPFS mirror of an ubuntu/debian mirror.
Update on the current attempt at not using the filestore to mirror apt, the initial offline import took around 36 hours, completed successfully and as expected /data/.ifps
grew to 1.2TB, slightly bigger than the size of /data/apt
.
After an rsync run to update /data/apt
from the mirror with the past few days worth of changes, running ipfs add -r --progress --offline --quieter --raw-leaves /data/apt
again only took 5 hours, and completed successfully.
I've now ran a third rsync update (pulling in only 12 hours worth of changes), I also started the ipfs daemon this time, although still passing the --offline
flag when adding changes. The reported time for ipfs add -r --progress --offline --quieter --raw-leaves /data/apt
is now ~2 hours.
That seems like it's possibly successful enough to actually be usable as an apt mirror ๐
Also going to test to see how much slower it is without the --offline flag after this last run.
Third offline ipfs add (with daemon running) completed in 3 hours 20 minutes.
Another rsync run done (minimal updates since 3 hours ago), going to attempt an ipfs add without --offline
now (initial estimation from the CLI is 2.5 hours)
In the meantime, you can browse the mirror here: https://cloudflare-ipfs.com/ipfs/QmThJ4k554iT3B7SZonmVMwspHALiHGThhqaem7nij1wQD
n.b. using cloudflare as I'm getting internalWebError: operation not supported
from https://ipfs.io/ipfs/QmThJ4k554iT3B7SZonmVMwspHALiHGThhqaem7nij1wQD
Updating and adding without --offline
(with the daemon runnning) took the same amount of time, 3 hours. rsync is downloading about 130mb of changes.
Have published an ipns for it: https://cloudflare-ipfs.com/ipns/QmTeHfjrEfVDUDRootgUF45eZoeVxKCy3mjNLA8q5fnc1B
Now working on a docker file for a consumer using https://github.com/JaquerEspeis/apt-transport-ipfs and a script to keep the ipfs mirror up to date with regular rsyncs.
Just published a docker image that uses the IPFS mirror I've got running to install things: https://github.com/andrew/apt-on-ipfs ๐
Example output from docker build
which sets things up and installs jq from IPFS:
It's not very quick, and the IPFS daemon is very chatty in the logs but it's working!
Moved apt-on-ipfs into the ipfs-shipyard org: https://github.com/ipfs-shipyard/apt-on-ipfs
Also added a mirror.sh script that is currently running every 4 hours via cron to keep the mirror up to date: https://github.com/ipfs-shipyard/apt-on-ipfs/blob/master/mirror.sh
Does anyone have data on client bandwidth, for installing single package over IPFS versus the same package over apt?
Strange, I'm also getting internalWebError: operation not supported
on my own IPFS instance. I reported it in https://github.com/ipfs/go-ipfs/issues/6203.
As for the IPNS hash, I'm personally getting ipfs resolve -r /ipns/QmTeHfjrEfVDUDRootgUF45eZoeVxKCy3mjNLA8q5fnc1B: context deadline exceeded
.
I also used this tool to check for the IPNS availability, but it seems like no one's able to get it.
@NatoBoram I've been off sick for a couple weeks and it looks like the ipfs daemon fell over on the mirror server at some point, I've restarted it now, that ipns name should start resolving again after the next successful cron job run within a few hours.
FYI I'm going to be turning off the machine that was running the experimental mirror on 2019-07-01
@andrew what's the status of this experiment? Could we (literal 'we', not royal 'we', lol) summarize, share, and archive, or is there more here to do?
"summarize, share, and archive" sounds like a good next step, this will form the basis for a lot of bench-marking and testing of file-system based package managers this quarter.
@andrew I asked this above:
Does anyone have data on client bandwidth, for installing single package over IPFS versus the same package over apt?
It'd be good to see client data in general, to help us understand the pros and cons of using IPFS in this scenario for a client.
cc @Mikaela
I thought it'd be interesting to try and replicate what was attempted in this thread just over a year ago to see if similar performance problems exist when in adding a large file-system based package manager to IPFS.
Steps outlined below:
Spun up a "Store-1-S" Online.net dedicated server in France with Ubuntu 18.04:
Followed https://wiki.ubuntu.com/Mirrors/Scripts to rsync a mirror into
/data/apt
:Output from rsync:
At this point
/data/apt
is about 1.2TBThen installed ipfs:
Based on notes from https://github.com/ipfs/notes/issues/212, made the following config changes:
Then ran the following command to add
/data/apt
to IPFS:I then took the dog for a walk ๐ฉ ๐ถ
Output from
dstat
at various times over the next few hours:/data/.ipfs
dir was 441Mb for 345gb uploadedHad some lunch ๐
/data/.ipfs
dir was 862Mb for 532GB uploaded after about 3 hours.Status as of 3pm:
/data/.ipfs
: 1.3Gb3:15pm progress slowed again, lots of writing, no reading
4:15pm back to full speed again:
/data/.ipfs
: 1.6Gb5:30pm progress slowed again, lots of writing, no reading, seems to be happening every hour
/data/.ipfs
: 1.9Gb6:20pm back to full speed again:
/data/.ipfs
: 2.1Gb7:20pm stuck at 100% similar disk writing pattern, no CID returned yet:
/data/.ipfs
: 2.3Gb8:15pm: still stuck at 100%, similar disk writing pattern, no CID returned yet
Will update this issue as it continues.
Drop a comment with any commands you'd like to see the output for during or after.