ipfs-inactive / archives

[ARCHIVED] Repo to coordinate archival efforts with IPFS
https://awesome.ipfs.io/datasets
183 stars 24 forks source link

Testing for archival purposes #136

Open meyerzinn opened 7 years ago

meyerzinn commented 7 years ago

I'm testing IPFS for the potential use across >500TB of federal climate data. I used the Rabin chunking mechanism and a sample 30GB dataset, which is available online at ftp://ladsweb.nascom.nasa.gov//allData/5/MYD021KM/2016/362.

Mirror the Data Initially

OK, thing #1 -- I need >60GB disk space to hash 30GB of data because IPFS operates in its own directory. It'd be great if IPFS could "consume" data as it converts to blocks or otherwise allow a streaming hash.

I used wget -mirror ftp://ladsweb.nascom.nasa.gov//allData/5/MYD021KM/2016/362 to mirror the remote to the drive.

Start IPFS

Attempts 1-3 or so

I, being a genius, was using a DigitalOcean $5/mo droplet (.5gb ram, 1 CPU) to try and hash 30GB of data. Naturally, it estimated to take an hour each time, and each time my internet decided that I was making too much progress and cut out. I kept forgetting to start a new screens session, but there should be a way to resume hashing with more expensive algorithms (i.e. have a temporary file somewhere with the Rabin progress info).

Attempt 4

After floundering around the droplet, I had a drink of water and planned out my approach. I created a temporary droplet with 32 GB RAM and 12 CPUs, and this time it only took 15 minutes. I also put it into a screen session so my internet wouldn't mess it up.

I should note, this was on an auxiliary volume to the droplets (extra disk) so I could scale it back down after hashing.

It works, got everything working, but it takes about the same amount of space.

There's no tool to check de-duplication (i.e. follow this hash and all links and count how many blocks are identical).

Summary

So the rabin method benefitted hugely from a temporary high-CPU-RAM workload, so maybe that could be accelerated with better concurrency (:heart: Go) or something. Anyways, I completed a preliminary test of IPFS Rabin vs normal data on a 30GB dataset (ipfs/QmZSVcKoAdsjmYBu1ZEfP1sDWZhUFM44zMudM3oMYajc7w).

meyerzinn commented 7 years ago

I would like to note that I did not observe any de-duping.

flyingzumwalt commented 7 years ago

This is great to see!

Regarding this:

I need >60GB disk space to hash 30GB of data because IPFS operates in its own directory. It'd be great if IPFS could "consume" data as it converts to blocks or otherwise allow a streaming hash.

I wonder if you can use ipfs-pack, which was written last week, to do this. @whyrusleeping is it possible to run ipfs-pack make with rabin fingerprinting? Does ipfs-pack make accept flags like --chunker? Of course, you wouldn't get deduplication of the data within the pack, but the DAG in the object store would be, which means but anyone consuming the dataset would benefit from the deduplication, especially if they're consuming multiple versions of a dataset from multiple sources that all use rabin fingerprinting (at least in theory).

meyerzinn commented 7 years ago

I looked at ipfs-pack but for some reason the make command kept failing IIRC. @flyingzumwalt

flyingzumwalt commented 7 years ago

Try grabbing it again and re-running. It was still being written when you tested it out. The build process was broken for some of that time.

flyingzumwalt commented 7 years ago

I'm working on an ipfs-pack tutorial here https://github.com/ipfs/ipfs-pack/pull/8/files I'm going to try to have it merged by midday tomorrow.