data format - Githubissues

cscheid commented 12 years ago

We need to decide on the data format. Some terminology first:

imp will be a (key, value) store, where a (key,value) pair is called a datum, a key is a /-separated string of path delimiters, where each path delimiter is an identifier string of the following characters: [A-Za-z0-9_-.]

We now need to decide on the supported value types. Jacob indicated he would rather have simple formats, for example, that value be one the following:

an array of single-precision IEEE 754 floating point numbers
a dense matrix of single-precision IEEE 754 floating point numbers
an array of double-precision IEEE 854 floating point numbers
a dense matrix of double-precision IEEE 854 floating point numbers

I think that if we're going to use a BSON variant for the transport, there might be advantages in supporting a more complicated format as well. For example, why not support BSON itself? The main advantage is that we're easily future-proofing the protocol:

R data frames are easily represented in the following JSON object: { key1:array1, key2:array2, key3:array3 }
Sparse matrices are easily represented as a record which respects a certain convention. CSC matrices would, for example, be { val: array1, row_ind: array2, col_ptr: array3 }

The argument against a complicated file format is that we might want to store complicated objects as hierarchical objects within the naming system itself. The advantage of forcing such one such convention is that we would be able to change individual pieces of the larger object separately (by storing new files in "sublocations")

I guess if the typical use case is always going to be relatively shallow objects, then this discussion is not that important. If we expect deeper objects, then it could be trouble.

cscheid commented 12 years ago

I thought about this some more, and in particular the interaction of the data format with issue #2, and this comment from Jacob:

In that case, since we are always sending the data first we can handle alignment pretty easily no? We just decide we will always have an 8 or 16 byte preamble and things are automatically aligned.

I think this is going to make our lives easier for now. Let's assume that the data on the wire will come with a fixed-size header (16 bytes ought to do for now) and then the payload.

Let's also assume only simple data like what I said above, and that each individual request can only get a single simple datum. When requesting "foo/*" queries, we mandate that each result in the set starts aligned to a 16-byte boundary by padding with zeros, and the protocol is then really super simple.

Using the naming system for storing structure will, however, mean that we won't be able to distinguish heterogeneous lists from objects. For example, how would we store

[[array1, array2], [array3, array4]]

in the naming system?

basepath/0/0/array1 basepath/0/1/array2 basepath/1/0/array1 basepath/1/1/array2

would work, but it would mean a conflict with

{0: {0: array1, 1: array2}, {0: array3, 1: array4} }

jacobhinkle commented 12 years ago

There is another point to be made. Support for more complicated formats comes at a cost actually, assuming people use these features. If an application decides to store its data in a complicated format then we need to make sure it can still be used by more simple-minded clients. For instance, I think it may be unsatisfactory if each client ends up sending data in its preferred format, so that only certain other clients can actually understand the results.

In my opinion an R data frame is convenient for programming in R, but it is a collection of data, not a single datum. I think your solution of having a tree representation solves the problem of expressing interdependency of elements. When you have something resembling a struct, just put it in a directory. A sparse matrix is definitely a single datum and as you say, storage in CSC format would require reinterpret casting the row/col elements, but it could still be represented as a float vector. Alternatively, and this may or may not be more desirable, a sparse matrix could consist of a directory with three data objects inside: vals (Float32 vector), rows(UInt32 vector), cols (Uint32 vector).

jacobhinkle commented 12 years ago

While we're on this topic, should we discuss metadata? If we allow a string type then metadata can be handled by the user within the "subdirectory" style. But it may be nice for certain things like printable descriptions of data to come along with them whenever they are requested. And of course mtime should be handled directly by imp.

The things I can currently foresee wanting to attach to data are:

mtime
size in bytes
permissions info (using ACL kind of format or something??)
description
creator (user, publisher, are these all the same?)
GPG signature
measurement units

We should be able to retrieve these things without downloading the entire object. I don't think we should get into complex queries and such. I don't wanna code a database. But we should be able to explore the data, with ls.

jacobhinkle commented 12 years ago

your last comment (which didn't show up on my end til just now) illustrates another point, that we should decide how to handle overwriting data and conflicting names. I'm of a mind to just overwrite by default without warning, but maybe having separate PUBLISH and FORCE_PUBLISH commands could be useful. The first would crap out if the datum already exists. Of course we'll need UNPUBLISH and some other stuff probably too but that's the subject of another issue.

As to the heterogeneous list issue: I think there are two options: either yes there may be conflicts sometimes or we support more complex (struct-based) types, at the expense of being able to examine individual parts of each datum without custom code. Because what you're saying is we'd need to name patterns of collections of data, and that's just a struct definition. I'd vote that we don't care about conflicts

I think it's easy for us to fall into a trap of making imp into a permanent data store that serializes all data types and can reconstruct your stack whenever you want. I call it a trap because it leads to a lot of complexity being able to handle all this stuff, and I don't frankly think that's useful. The idea that the data store is agnostic of how the data was generated sort of dictates we need to aim for the least common denominator, meaning rectangular nd arrays of primitive types. I don't see this as a weakness.

cscheid commented 12 years ago

You probably couldn't hear me throwing up at the sight of reinterpret_cast :) I would really much rather have integer and floating-point types be separate from each other. Have four basic data types: int32, int64, float32, float64, and then have rectangular n-d arrays of those.

I agree with you about complexity, and concede the point. We need to get from zero to prototype as fast as possible, and this is a place where cutting the corner makes sense.

But, just for the record, I predict that every single client will eventually have a subroutine that will "shred" deep, hierarchical objects into lots of tiny pieces to send into imp, and that querying these will be a bottleneck :)

cscheid commented 12 years ago

Mixing fine-grained ACL with anything but the simplest queries will get really messy (I think this is a known lesson from RDBMS). I suggest limiting user control to the "database" level, and then allowing for the creation of read-only users. Of course, we then need to define what we mean by "database". It could simply be that on a server, we either can write everywhere, or we can write nowhere.

In fact, we could solve this by convention, and say for example that every path must have at least two path elements, and we call the first path element the "database". Then the ACL would only be applicable to this top-level path.

If you were thinking of ACLs to limit accidental overwriting of data, I think that is better solved by commands like your PUBLISH and FORCE_PUBLISH. I think something similar to HTTP makes sense in our context http://en.wikipedia.org/wiki/HTTP#Request_methods

jacobhinkle commented 12 years ago

i agree reinterpreting is crap, but is there a problem with subdirectory-ifying structs?

Also yeah ACL is dumb, so mark that off the list. Per-"database" access is fine, and I can't think of a reason not to make it the first level directory.

About the clients shredding their deep heirarchical data, this will have to be done at some point anyway if it's to be plotted or visualized in pieces. I could be convinced possibly about more complex data structures but I'd want to see an example that a subdirectory of basics couldn't handle.

Totally agree on the basic four types. I am thinking about strings but any examples I can think of would be handled fine with metadata strings. Maybe support for vectors of strings has some purpose for labelling thing? Like say you want a bar chart of some vector, and each entry is for a different type of car or something. Then you'd wanna label each component of the vector. Computing the sizes of these string arrays might lead to some stupid bugs and shit though, so let's hold off and focus on the big four.

jacobhinkle commented 10 years ago

Resurrecting this issue. I agree, I like the idea of enforcing at least two path elements. I'd call the first element the "resource". In the user-facing library, we could provide a simple function set_resource("jacobs_awesome_program"). Then when I publish a chunk of data called ugly_image it shows up in the datastore as jacobs_awesome_program/ugly_image. So we can start with this representation in the library, along with the 4 types of nd regular arrays.

jacobhinkle / imp

data format #3