global repository for data (private cloud, maybe S3, across projects, ...)

haraldschilly commented 11 years ago

Add a global file-based repository of interesting scientific and mathematical data. This could be a Ceph-based file-system share, which sits in /data. It contains directories (or sub-volumes (?)) for each set of data.

EDIT: or a private cloud, for more than one project, and based on the S3 protocol.

Inside of each of these, to do this in an orderly fashion and allow automatic processing, there should be

the data files
scripts, libraries or worksheets that make it easy to access them (e.g. routines to read the data into sage matrices or pandas data frames to make life easy)
documentation: what the data is about, who contributed it, licensing (preferably CC0 or equivalent)
to automatically create a nice index page of all data sets, there could be a dedicated manifest.yaml file, which contains structured information (name, type, author, topic, versions+updates, the script file or even the script itself, ...) which can be parsed and processed automatically.

An example could be a large collection of zeta zeros, public data (data.gov?), ...

Implementation details/ideas:

Each contributor for data receives write access to their own sub-directory. This can be done easily, either by creating a new sub-volume for each one of them or enabling ACL on the file-system level.
Define a global environmental variable SMC_DATA which points to /data to make moving this directory possible.

williamstein commented 11 years ago

One issue to think through is that NFS is a single point of failure and doesn't scale well to a potentially large number of clients... Something like you describe could be very useful for LMFDB though.

From a scalability and data management perspective it would be much nicer to just use Cassandra. Then everything is highly scalable, robust, redundant, etc. The drawback is that all data must be something you can put into the database, which is a potentially substantial hurdle. This would be basically an analogue of the Amazon S3 API, but for our cloud. If you think about scaling up to potentially a million users, a big-data database (with good docs) seems more manageable than a massive NFS share...

haraldschilly commented 11 years ago

What about replacing NSF with "Ceph" in that idea? I fully get your point that just NFS will neither scale nor be reliable enough ... I have the impression Ceph could be the one for the job. (It's just not so easy to set up, but doable to run without a spof - at least the documentation makes me belief that)

Ceph also provides such a S3 API, besides traditional filesystems. It would also open up the possibility to offer general block devices per project, etc.

And since you mentioned LMFDB: There was even a workshop where others from fields like astronomy were invited to share how they manage their data. My strong impression is, most of them will not adapt to storing this in any kind of database (or even care to post-process it) and are very much attached to a traditional file-based system. That's also the target for their scripts and software. That's why I think this could be sort of a great selling point - and channels/embraces what the users want and do.

Cassandra is #32

williamstein commented 11 years ago

Ceph sounds very interesting. Keith Clawson has been using ceph a lot lately for the new VM's (to replace boxen.math.washington.edu), so he will have input. Great idea!

tangentspace commented 11 years ago

Lets do it. I have Ceph running on 2 nodes and I'm expanding that cluster to 4 nodes. My main concern has been that it's relatively easy to use and to scale up, but it's rather complicated internally and it will take some time and experience to feel confident about operating it for a large number of users. Setting up a data repository would be a great next step because it could be as large as want and it could get some real world use, but it's not mission critical. Most of the data wouldn't need to be backed up if it's publicly available elsewhere, and it should be easy to back up user data using the snapshot feature.

We don't have a collection of free drives to start this right now, but there are a lot of 2 TB drives in the older cloud servers that could be used for Ceph since we're already planning on upgrading to 4 TB drives. It would be best to spread the data over as many servers as possible for the trial to maximize redundancy and to get a feel for any scaling issues. One limitation could be the 2 gigabit network connection, which can definitely become saturated when you write large amounts of data to a replicated partition or replace a failed partition. For a read only archive we should be fine, and it could potentially be be much faster than NFS because read operations can be distributed among all nodes that have copies of the data.

haraldschilly commented 11 years ago

@kclawson that sounds great to me. we can easily just start with a small data set from an open machine learning data library. nothing fancy, just a demo. then we can could add a few TB from that mentioned LMFDB project, also merely as a test.

Since you work on that, what would be the best way to sort out

quotas: I've read that quotas for CephFS don't exist yet. So, it is be better to create several rdb block devices and treat them independently for each set of data, right? That would make management of them, snapshots+backup, deletion, etc. easier, too.
permissions: one or more want to have read-write access, all the others just read-access, or no access. ACL and xfs should do this.

williamstein commented 11 years ago

We also need a backup strategy...

haraldschilly commented 11 years ago

Keith will know how to backup best? There are several possibilities, below a list what could be done:

Part of ceph:
- underlying blocks are replicated: the policy for the pool could say that at each site (data center) is at least one copy?
- Each of these rbl block devices allow for snapshots, which can be layered on top of each other (due to COW). This means, it's possible to rollback if the file-system fails.
Outside: These snapshots can be cloned read-only, mount them, create a backup: e.g. let bup index the files and store the actual content - not just the raw image? (or however it's done best)

haraldschilly commented 10 years ago

Here is another take on this: Riak. That's a distributed p2p fault tolerant/self repairing apache 2 licensed key/value store. On top of it, there is S3 support: Riak Cloud Storage with different backends (e.g. LevelDB)

I haven't really looked at the details, and I also don't know enough about the actual requirements, but I think it's worth to have a look. For example, it mentions that it does support user accounts...

Also, there seems to be a major update to 2.0. It's maybe risky, but just for testing there are packages for this technical preview for ubuntu, too.

sagemath / cloud

global repository for data (private cloud, maybe S3, across projects, ...) #31