ipfs-inactive / archives

[ARCHIVED] Repo to coordinate archival efforts with IPFS
https://awesome.ipfs.io/datasets
183 stars 24 forks source link

CERN #15

Open ghost opened 8 years ago

ghost commented 8 years ago

http://opendata.cern.ch

CERN is, since the end of 2014, serving some fraction of the colossal amount of captured data about particle collision in LHC (with detectors like CMS, ATLAS, ALICE) - summing up to 60,000,000 GB.

Through the help of a small Python crawler, I've compiled an index of all CMS-detector primary datasets (all .root files totaling cca. 27,4TB). Also index of indexes. Other detector indexes of datasets + derivative datasets to come :)

To use all that data, a special environment is required - normally CERN's OpenData is recommending the use of their CernVM, which is basically Scientific Linux + ROOT, a data analysis framework (therefore the .root files). Without ROOT, this historical milestones cannot be used as computable data directly - so the tool must also be, as the collision data, preserved/archived. There's also a mirror right here at Github.

Oh, and thanks for the amazing project!

davidar commented 8 years ago

Wow, that's a lot of data! Unless you know someone with that much storage available, we might have to do this in cooperation with CERN? @jbenet Thoughts?

jbenet commented 8 years ago

i think we're not ready to ingest much CERN data. that's the biggest data cache there is. the metadata is possible now, but still big. here's what i propose:

Kubuxu commented 8 years ago

Crawler and index hashes are gone, any hope on restoring them?

davidar commented 8 years ago

Hmm, seems I forgot to pin this to one of the storage nodes. We really need a better system for mirroring archives stuff - I'll make this a priority for the new year.