Open meyerzinn opened 7 years ago
I would like to note that I did not observe any de-duping.
This is great to see!
Regarding this:
I need >60GB disk space to hash 30GB of data because IPFS operates in its own directory. It'd be great if IPFS could "consume" data as it converts to blocks or otherwise allow a streaming hash.
I wonder if you can use ipfs-pack, which was written last week, to do this. @whyrusleeping is it possible to run ipfs-pack make
with rabin fingerprinting? Does ipfs-pack make
accept flags like --chunker
? Of course, you wouldn't get deduplication of the data within the pack, but the DAG in the object store would be, which means but anyone consuming the dataset would benefit from the deduplication, especially if they're consuming multiple versions of a dataset from multiple sources that all use rabin fingerprinting (at least in theory).
I looked at ipfs-pack but for some reason the make
command kept failing IIRC. @flyingzumwalt
Try grabbing it again and re-running. It was still being written when you tested it out. The build process was broken for some of that time.
I'm working on an ipfs-pack tutorial here https://github.com/ipfs/ipfs-pack/pull/8/files I'm going to try to have it merged by midday tomorrow.
I'm testing IPFS for the potential use across >500TB of federal climate data. I used the Rabin chunking mechanism and a sample 30GB dataset, which is available online at ftp://ladsweb.nascom.nasa.gov//allData/5/MYD021KM/2016/362.
Mirror the Data Initially
OK, thing #1 -- I need >60GB disk space to hash 30GB of data because IPFS operates in its own directory. It'd be great if IPFS could "consume" data as it converts to blocks or otherwise allow a streaming hash.
I used
wget -mirror ftp://ladsweb.nascom.nasa.gov//allData/5/MYD021KM/2016/362
to mirror the remote to the drive.Start IPFS
Attempts 1-3 or so
I, being a genius, was using a DigitalOcean $5/mo droplet (.5gb ram, 1 CPU) to try and hash 30GB of data. Naturally, it estimated to take an hour each time, and each time my internet decided that I was making too much progress and cut out. I kept forgetting to start a new screens session, but there should be a way to resume hashing with more expensive algorithms (i.e. have a temporary file somewhere with the Rabin progress info).
Attempt 4
After floundering around the droplet, I had a drink of water and planned out my approach. I created a temporary droplet with 32 GB RAM and 12 CPUs, and this time it only took 15 minutes. I also put it into a screen session so my internet wouldn't mess it up.
I should note, this was on an auxiliary volume to the droplets (extra disk) so I could scale it back down after hashing.
It works, got everything working, but it takes about the same amount of space.
There's no tool to check de-duplication (i.e. follow this hash and all links and count how many blocks are identical).
Summary
So the rabin method benefitted hugely from a temporary high-CPU-RAM workload, so maybe that could be accelerated with better concurrency (:heart: Go) or something. Anyways, I completed a preliminary test of IPFS Rabin vs normal data on a 30GB dataset (
ipfs/QmZSVcKoAdsjmYBu1ZEfP1sDWZhUFM44zMudM3oMYajc7w
).