ipfs-inactive / package-managers

[ARCHIVED] ๐Ÿ“ฆ IPFS Package Managers Task Force
MIT License
99 stars 11 forks source link

Experiment: Setting up an Ubuntu mirror on IPFS #18

Closed andrew closed 4 years ago

andrew commented 5 years ago

I thought it'd be interesting to try and replicate what was attempted in this thread just over a year ago to see if similar performance problems exist when in adding a large file-system based package manager to IPFS.

Steps outlined below:

Spun up a "Store-1-S" Online.net dedicated server in France with Ubuntu 18.04:

screenshot 2019-03-08 at 13 16 22

Followed https://wiki.ubuntu.com/Mirrors/Scripts to rsync a mirror into /data/apt:

$ rsync --recursive --times --links --safe-links --hard-links --stats --exclude "Packages*" --exclude "Sources*" --exclude "Release*" --exclude "InRelease" rsync://archive.ubuntu.com/ubuntu /data/apt/

Output from rsync:

Number of files: 983,559 (reg: 910,696, dir: 58,765, link: 14,098)
Number of created files: 781,684 (reg: 721,325, dir: 47,528, link: 12,831)
Number of deleted files: 0
Number of regular files transferred: 722,538
Total file size: 1,297,227,684,984 bytes
Total transferred file size: 913,052,882,058 bytes
Literal data: 912,253,989,263 bytes
Matched data: 798,892,795 bytes
File list size: 41,354,902
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 16,633,996
Total bytes received: 912,544,392,481

sent 16,633,996 bytes  received 912,544,392,481 bytes  72,204,852.35 bytes/sec
total size is 1,297,227,684,984  speedup is 1.42

At this point /data/apt is about 1.2TB

Then installed ipfs:

$ wget https://dist.ipfs.io/go-ipfs/v0.4.19/go-ipfs_v0.4.19_linux-amd64.tar.gz
$ tar xvfz go-ipfs_v0.4.19_linux-amd64.tar.gz
$ cd go-ipfs
$ ./install.sh

Based on notes from https://github.com/ipfs/notes/issues/212, made the following config changes:

$ export IPFS_PATH=/data/.ipfs
$ export IPFS_FD_MAX=4096

$ ipfs init

$ ipfs config Reprovider.Interval "0"
$ ipfs config --json Datastore.NoSync true
$ ipfs config --json Experimental.ShardingEnabled true
$ ipfs config --json Experimental.FilestoreEnabled true

$ ipfs daemon

Then ran the following command to add /data/apt to IPFS:

$ ipfs add -r --progress --offline --fscache --quieter --raw-leaves --nocopy /data/apt

screenshot 2019-03-08 at 09 40 59

I then took the dog for a walk ๐Ÿฉ ๐Ÿšถ

Output from dstat at various times over the next few hours:

screenshot 2019-03-08 at 11 40 20 (sorry about the colours here)

screenshot 2019-03-08 at 12 27 18 It got "stuck" here around 30% (~345GB), progress bar didn't show any changes, slowly writing to disk.

/data/.ipfs dir was 441Mb for 345gb uploaded

Had some lunch ๐Ÿœ

screenshot 2019-03-08 at 13 17 00 Came back to life whilst I was at lunch

/data/.ipfs dir was 862Mb for 532GB uploaded after about 3 hours.

Status as of 3pm:

screenshot 2019-03-08 at 15 11 37


screenshot 2019-03-08 at 15 17 56

3:15pm progress slowed again, lots of writing, no reading


4:15pm back to full speed again:

screenshot 2019-03-08 at 16 15 26

5:30pm progress slowed again, lots of writing, no reading, seems to be happening every hour

screenshot 2019-03-08 at 17 27 17

6:20pm back to full speed again:


7:20pm stuck at 100% similar disk writing pattern, no CID returned yet:

8:15pm: still stuck at 100%, similar disk writing pattern, no CID returned yet


Will update this issue as it continues.

Drop a comment with any commands you'd like to see the output for during or after.

andrew commented 5 years ago

Success, total time: 12 hours

NumObjects:       5824574
RepoSize (MiB):   1086
StorageMax (MiB): 9536
RepoPath:         /data/.ipfs
Version:          fs-repo@7

Although when trying to publish the name, get an error:

andrew@sd-48607:/data$ IPFS_PATH=/data/.ipfs ipfs name publish QmU7durGPbuyvjJPfwkszWiKg3rh2xgWX5RgqjSbgpPXJC
Error: can't publish while offline: pass `--allow-offline` to override
momack2 commented 5 years ago

(the last one is cause your ipfs node is offline)

This is super cool! @magik6k @Stebalien @Kubuxu - what could we do differently to make this ...less slow. =]

magik6k commented 5 years ago

What filesystem do you use? Are the disks in raid 1 or 0 or independent?

12h for 1.18TB seems to be around 28MB/s, which probably isn't anywhere near to what the drives can get to.

I'd recommend initializing the node with badger datastore - ipfs init --profile=badgerds (it's possible to convert between datastores, but it will likely be faster to just re-add the data), it should bring the speed closer to that of the drives.

Stebalien commented 5 years ago

It also looks like the CPU usage is pretty low. We may need to finally implement parallel hashing.

andrew commented 5 years ago

Name successfully published, thanks for the tip @momack2

$ ipfs name publish QmU7durGPbuyvjJPfwkszWiKg3rh2xgWX5RgqjSbgpPXJC
Published to QmcESswYyg3R3YdLWbBN71iAYErJfQgk8NPB2yZcH9nKLY: /ipfs/QmU7durGPbuyvjJPfwkszWiKg3rh2xgWX5RgqjSbgpPXJC

@magik6k looks like the setup defaulted to raid 1, will try with badger next

andrew commented 5 years ago

Re-ran against the same /data/apt directory with a fresh /data/.ipfs directory using the --profile=badgerds option:

$ export IPFS_PATH=/data/.ipfs
$ export IPFS_FD_MAX=4096

$ ipfs init --profile=badgerds

$ ipfs config Reprovider.Interval "0"
$ ipfs config --json Datastore.NoSync true
$ ipfs config --json Experimental.ShardingEnabled true
$ ipfs config --json Experimental.FilestoreEnabled true

$ time ipfs add -r --progress --offline --fscache --quieter --raw-leaves --nocopy /data/apt

Same CID produced, this time it took 18 hours:

badger 2019/03/10 04:30:04 INFO: Storing value log head: {Fid:6 Len:48 Offset:150054695}
 1.18 TiB / 1.18 TiB [==============================================================================================================] 100.00%QmU7durGPbuyvjJPfwkszWiKg3rh2xgWX5RgqjSbgpPXJC
badger 2019/03/10 06:00:07 INFO: Storing value log head: {Fid:7 Len:48 Offset:43557748}
badger 2019/03/10 06:00:11 INFO: Force compaction on level 0 done

real    1089m48.271s
user    114m58.540s
sys     23m14.405s
ipfs stats repo --human

NumObjects:       5824574
RepoSize (MiB):   1825
StorageMax (MiB): 9536
RepoPath:         /data/.ipfs
Version:          fs-repo@7

/data/.ipfs/ -> 1.8G (previous run without badger resulted in 2.4G)

andrew commented 5 years ago

Now updating the mirror about 60 hours after the first rsync:

andrew@sd-48607:~$ rsync --recursive --times --links --safe-links --hard-links --stats --exclude "Packages*" --exclude "Sources*" --exclude "Release*" --exclude "InRelease" rsync://archive.ubuntu.com/ubuntu /data/apt/
This is an Ubuntu mirror - treat it kindly

Number of files: 983,593 (reg: 910,732, dir: 58,765, link: 14,096)
Number of created files: 7,438 (reg: 7,438)
Number of deleted files: 0
Number of regular files transferred: 11,316
Total file size: 1,295,095,654,503 bytes
Total transferred file size: 10,286,628,082 bytes
Literal data: 7,919,775,085 bytes
Matched data: 2,366,852,997 bytes
File list size: 17,960,443
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 8,081,081
Total bytes received: 7,963,813,509

sent 8,081,081 bytes  received 7,963,813,509 bytes  36,484,643.43 bytes/sec
total size is 1,295,095,654,503  speedup is 162.46
andrew@sd-48607:~$ rsync --recursive --times --links --safe-links --hard-links   --stats --delete --delete-after rsync://archive.ubuntu.com/ubuntu /data/apt/
This is an Ubuntu mirror - treat it kindly

Number of files: 985,691 (reg: 912,830, dir: 58,765, link: 14,096)
Number of created files: 259 (reg: 259)
Number of deleted files: 7,443 (reg: 7,441, link: 2)
Number of regular files transferred: 2,423
Total file size: 1,295,812,731,198 bytes
Total transferred file size: 1,324,885,090 bytes
Literal data: 471,646,863 bytes
Matched data: 853,238,227 bytes
File list size: 36,770,417
File list generation time: 3.452 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 3,129,540
Total bytes received: 510,474,614

sent 3,129,540 bytes  received 510,474,614 bytes  18,676,514.69 bytes/sec
total size is 1,295,812,731,198  speedup is 2,522.98

Have started another add, this time with the exiting badgerds .ipfs directory to see how long it takes to check and add the ~7,438 changed files.

andrew commented 5 years ago

Failed after 3 hours 30 mins:

andrew@sd-48607:~$ time ipfs add -r --progress --offline --fscache --quieter --raw-leaves --nocopy /data/apt
badger 2019/03/10 13:41:56 INFO: All 8 tables opened in 1.692s                                                                                  badger 2019/03/10 13:41:56 INFO: Replaying file id: 7 at offset: 43557796                                                                       badger 2019/03/10 13:41:56 INFO: Replay took: 12.022ยตs                                                                                           1.07 TiB / 1.18 TiB [============================= 1.18 TiB / 1.18 TiB [==================================================================================================================================] 100.00%^[[B^[[B^[[B^[[B^[[BQmQsQ9mtDXu5NTeXpinXuPUjy3nMbCi5rLfrycbf9rDdvh
Error: failed to get block for zb2rhjn4bxfqtxZrzfNYyQgm1EvKHcRket2TbdR6Y2L46zax3: data in file did not match. apt/dists/disco/universe/debian-installer/binary-i386/Packages.gz offset 0
badger 2019/03/10 17:02:14 INFO: Storing value log head: {Fid:7 Len:48 Offset:56345495}
badger 2019/03/10 17:02:17 INFO: Force compaction on level 0 done

real    200m23.176s
user    101m42.635s
sys     12m36.793s
andrew commented 5 years ago

Interestingly, adding the updated non-badger .ipfs directory also failed, on a similar file in a different path:

andrew@sd-48607:~$ time ipfs add -r --progress --offline --fscache --quieter --raw-leaves --nocopy /data/apt
 1.18 TiB / 1.18 TiB [==================================================================================================================================] 100.00%QmQsQ9mtDXu5NTeXpinXuPUjy3nMbCi5rLfrycbf9rDdvh
Error: failed to get block for zb2rhjn4bxfqtxZrzfNYyQgm1EvKHcRket2TbdR6Y2L46zax3: data in file did not match. apt/dists/disco/universe/debian-installer/binary-i386/Packages.gz offset 0

real    301m42.881s
user    130m18.937s
sys     15m29.819s
andrew commented 5 years ago

@warpfork suggested doing a run with export IPFS_PROF=yas in place to output more useful info for debugging, will kick another one off tomorrow

magik6k commented 5 years ago
Error: failed to get block for zb2rhjn4bxfqtxZrzfNYyQgm1EvKHcRket2TbdR6Y2L46zax3: data in file did not match. apt/dists/disco/universe/debian-installer/binary-i386/Packages.gz offset 0

The file was probaby updated. Filestore expects files to be immutable once added, which can be problematic in this case. I'm not sure what is the best workaround in this case, but you need to somehow make ipfs not add those files to filestore.

We could change filestore to only add read-only files (possibly with a flag, and add rw content to normal blockstore)

andrew commented 5 years ago

@magik6k ideally this would happen transparently to the users, so they don't have to declare which files may change, then keeping an IPFS mirror up to date with rsync would be easily scriptable without needing to know exactly the expected behaviour of future updates from rsync

andrew commented 5 years ago

I thought I'd try again with a slightly smaller dataset, the https://clojars.org/ maven repository, which has an rsync server and is about 60GB.

Kicking off the same set of commands with a fresh /data/.ipfs folder comes up with an estimate of 6 hours, which is quite surprising given that it took 12 hours to do 1.2TB, I would expect 1/20th the time (36 mins).

I guess its due to the differing folder structure, clojars is has many more folders at the top level?

magik6k commented 5 years ago

I'd say many small files too. Can you run:

find . -type d | wc -l
find . -type f | wc -l 

on both datasets?

andrew commented 5 years ago

/data/apt:

find . -type d | wc -l => 58,765 find . -type f | wc -l => 912,830

/data/clojars:

find . -type d | wc -l => 159,473 find . -type f | wc -l => 1,634,435

clojars has 2.7x folder count and 1.8x files than apt

andrew commented 5 years ago

I've documented some of the blockers found here in this PR: https://github.com/protocol/package-managers/pull/21/files

andrew commented 5 years ago

In the mean time, I've kicked off a fresh ipfs add of the apt data that doesn't use the filestore to see how long it takes to re-add everything after an rsync update, to avoid the error encountered with the filestore expecting files added to be immutable

If that successfully completes, I'll put together a blog post for https://blog.ipfs.io detailing how to use apt-transport-ipfs and how to set up your own IPFS mirror of an ubuntu/debian mirror.

andrew commented 5 years ago

Update on the current attempt at not using the filestore to mirror apt, the initial offline import took around 36 hours, completed successfully and as expected /data/.ifps grew to 1.2TB, slightly bigger than the size of /data/apt.

After an rsync run to update /data/apt from the mirror with the past few days worth of changes, running ipfs add -r --progress --offline --quieter --raw-leaves /data/apt again only took 5 hours, and completed successfully.

I've now ran a third rsync update (pulling in only 12 hours worth of changes), I also started the ipfs daemon this time, although still passing the --offline flag when adding changes. The reported time for ipfs add -r --progress --offline --quieter --raw-leaves /data/apt is now ~2 hours.

That seems like it's possibly successful enough to actually be usable as an apt mirror ๐ŸŽ‰

Also going to test to see how much slower it is without the --offline flag after this last run.

andrew commented 5 years ago

Third offline ipfs add (with daemon running) completed in 3 hours 20 minutes.

Another rsync run done (minimal updates since 3 hours ago), going to attempt an ipfs add without --offline now (initial estimation from the CLI is 2.5 hours)

andrew commented 5 years ago

In the meantime, you can browse the mirror here: https://cloudflare-ipfs.com/ipfs/QmThJ4k554iT3B7SZonmVMwspHALiHGThhqaem7nij1wQD

n.b. using cloudflare as I'm getting internalWebError: operation not supported from https://ipfs.io/ipfs/QmThJ4k554iT3B7SZonmVMwspHALiHGThhqaem7nij1wQD

Screenshot 2019-03-21 at 13 12 45
andrew commented 5 years ago

Updating and adding without --offline (with the daemon runnning) took the same amount of time, 3 hours. rsync is downloading about 130mb of changes.

Have published an ipns for it: https://cloudflare-ipfs.com/ipns/QmTeHfjrEfVDUDRootgUF45eZoeVxKCy3mjNLA8q5fnc1B

Now working on a docker file for a consumer using https://github.com/JaquerEspeis/apt-transport-ipfs and a script to keep the ipfs mirror up to date with regular rsyncs.

andrew commented 5 years ago

Just published a docker image that uses the IPFS mirror I've got running to install things: https://github.com/andrew/apt-on-ipfs ๐ŸŽ‰

Example output from docker build which sets things up and installs jq from IPFS:

Screenshot 2019-03-21 at 17 58 24

It's not very quick, and the IPFS daemon is very chatty in the logs but it's working!

andrew commented 5 years ago

Moved apt-on-ipfs into the ipfs-shipyard org: https://github.com/ipfs-shipyard/apt-on-ipfs

Also added a mirror.sh script that is currently running every 4 hours via cron to keep the mirror up to date: https://github.com/ipfs-shipyard/apt-on-ipfs/blob/master/mirror.sh

makew0rld commented 5 years ago

Does anyone have data on client bandwidth, for installing single package over IPFS versus the same package over apt?

NatoBoram commented 5 years ago

Strange, I'm also getting internalWebError: operation not supported on my own IPFS instance. I reported it in https://github.com/ipfs/go-ipfs/issues/6203.

As for the IPNS hash, I'm personally getting ipfs resolve -r /ipns/QmTeHfjrEfVDUDRootgUF45eZoeVxKCy3mjNLA8q5fnc1B: context deadline exceeded.

I also used this tool to check for the IPNS availability, but it seems like no one's able to get it.

andrew commented 5 years ago

@NatoBoram I've been off sick for a couple weeks and it looks like the ipfs daemon fell over on the mirror server at some point, I've restarted it now, that ipns name should start resolving again after the next successful cron job run within a few hours.

andrew commented 5 years ago

FYI I'm going to be turning off the machine that was running the experimental mirror on 2019-07-01

meiqimichelle commented 5 years ago

@andrew what's the status of this experiment? Could we (literal 'we', not royal 'we', lol) summarize, share, and archive, or is there more here to do?

andrew commented 5 years ago

"summarize, share, and archive" sounds like a good next step, this will form the basis for a lot of bench-marking and testing of file-system based package managers this quarter.

makew0rld commented 5 years ago

@andrew I asked this above:

Does anyone have data on client bandwidth, for installing single package over IPFS versus the same package over apt?

It'd be good to see client data in general, to help us understand the pros and cons of using IPFS in this scenario for a client.

cc @Mikaela