Semantics of document storage

Sophia looks pretty great, so I'm trying to figure out if I can make it work well for my requirements.

If I understood correctly, multi-part keys mean I can store more than one key and value in a document/row. The examples all set additional keys such as "db.test.index.key_b" before writing the row.

That brings up a few questions, which should ideally be added to the documentation:

Is it possible to store extra "fields" (multi-part values) that are not in the index?
Is "db.test.index.key" always a hardcoded key name or can it be replaced?
Are the strings "key" and "value" stored for each row, or does Sophia transparently remove them?
If I add extra keys, will I need to match all key parts to overwrite an existing row?
What happens if on subsequent runs, I want to define an extra key (or value) that has not been written to existing rows?
What happens if on subsequent runs, I leave out an extra key from the index definition that has previously been written to existing rows?

Definitely looks like a powerful feature, but it's a bit hard to tell how it actually works. Appreciate your time and work, thanks!

Maybe also one more question,

Can I update a row with a changed index field in one go, without doing first delete and then update? E.g. I might have an "id" key and a "is_orphan" index, and I want to change "is_orphan" without writing an entirely different row.

Ok, Thanks for your Interest :)

multi-part key might be confusing name. Idea behind it is to provide 'compound key'. Where a compound key is a key that consists of two or more simple keys that uniquely identify an entity occurrence. For example:

db.test.key = u64
db.test.key_b = string

Defines compound key where 'key' or 'key_b' can be non-unique, but together they define a unique entity. There can be many compound keys but a single value.

This is very likely to this SQL statement:

CREATE TABLE voting (
  QuestionID NUMERIC,
  MemberID NUMERIC,
  PRIMARY KEY (QuestionID, MemberID)
);

Another great feature with multi-part key, is that you can turn on key-part duplicate compression. This allows to store only one key-part per a page. Example: { some@domain.com, 1 } { some@domain.com, 2 } ... { some@domain.com, 1000 }

some@domain.com very likely be stored only once on-disk.

1. Is it possible to store extra "fields" (multi-part values) that are not in the index? You can store them in value using your own format or maybe try document format.

Sophia has two storage formats: key-value and document. With document format, Sophia expects all keys to be part of value. While defining 'key parts' they should point exactly inside 'value' buffer where the keys are stored. Sophia will not copy those keys, but point them inside a value. This might be very helpful with bson, msgpack and any other document formats, since you doing need to assemble/reassemble document every time.

2. Is "db.test.index.key" always a hardcoded key name or can it be replaced? Right now it is hardcoded, but this might be changed in future.

3. Are the strings "key" and "value" stored for each row, or does Sophia transparently remove them? Yes, they will be removed and translated to compact storage format.

4. If I add extra keys, will I need to match all key parts to overwrite an existing row? Yes.

5. What happens if on subsequent runs, I want to define an extra key (or value) that has not been written to existing rows? Alter is not supported at the moment, but might be added in future versions with secondary indexes support.

6. What happens if on subsequent runs, I leave out an extra key from the index definition that has previously been written to existing rows? Sophia expects all key-parts to be defined in every CRUD operation.

7. Can I update a row with a changed index field in one go, without doing first delete and then update? I'm not sure i understood. Could you please provide more details? :)

[7] Can I update a row with a changed index field in one go, without doing first delete and then update? I'm not sure i understood. Could you please provide more details? :)

You mentioned secondary index support, I believe that's going to address that question. Rephrased that way, I was looking to specify one key as primary key while the other ones are only indexes for searching without being part of a "composite key".

Thank you for the answers! I'll leave the issue open for you to handle since I don't know what your process is for transitioning information from issues to the "proper" documentation on sphia.org.

Is it possible to store extra "fields" (multi-part values) that are not in the index?

You can store them in value using your own format or maybe try document format.

Sophia has two storage formats: key-value and document. With document format, Sophia expects all keys to be part of value. While defining 'key parts' they should point exactly inside 'value' buffer where the keys are stored. Sophia will not copy those keys, but point them inside a value. This might be very helpful with bson, msgpack and any other document formats, since you doing need to assemble/reassemble document every time.

I wonder how this actually works out in practice. I'm not too familiar with bson, but I know msgpack: integers get encoded depending on their size, e.g. uint 1 will be stored in a single byte together with its msgpack type prefix (which in itself is variable bit length). So pointing to a byte inside a serialized msgpack buffer is not all that useful, especially since you're unlikely to know where in the buffer your value is going to be located.

If you require a single value buffer then this also doesn't work with e.g. msgpack-c's variant object, which stores a tree of objects and string/raw/ext buffers but not as contiguous memory. Each node in that tree might have its allocation elsewhere, and the node metadata is not in serialized format.

A contiguous buffer with pointers to key parts would work with simpler schemes, where every key starts at byte alignment and is neither encoded in some way nor variable size. Like this,

u32 (size) | string (value part 1) | u32 (size) | string (value part 2) | ...

That makes it unusable for msgpack but usable for custom formats whose goal is not absulute space efficiency.

Possibly related, I have a structure that goes like, type: enum (u8), data: blob where type is never queried (doesn't need to be part of the index) but I don't have the two values in contiguous memory storage.

Right now, Sophia offers me sp_setstring(o, "value", data, size) to set the whole value. This requires me to do something like this:

char* buffer = malloc(data.size + 1); // error handling is left as an exercise to the reader
buffer[0] = (char) type;
memcpy(buffer + 1, data.ptr, data.size);
sp_setstring(o, "value", buffer, data.size + 1);

Either that or I define "type" as indexed key, which also has an impact on (space) efficiency.

It would be much more efficient if Sophia let me assemble the value from multiple regions in memory, like this:

// [edit: modified proposal for better consistency with existing Sophia APIs]
sp_set(o, "value.parts", "type");
sp_set(o, "value.parts", "data");
assert(sp_getint(o, "value.offset.type") == 0); // default offset for all parts
sp_setstring(o, "value.part.type", &type, 1);
sp_setint(o, "value.offset.data", 1);
sp_setstring(o, "value.part.data", data.ptr, data.size);

When reading the data, I have to know the offsets (I could store them as part of value_0 if they're variable) but I can point my various pointers to the different locations in the contiguous value returned by the database getter.

I wonder how this actually works out in practice

Document and key-value format always assume that your keys/value or whole document is serialized in one continuous region. That is also true for internals documents. Otherwise we start making a document-oriented database from key-value storage, which might be not a bad thing after all.

What 'document storage' format can actually do (example):

struct document {
        uint32_t value;
        char used0[89];
        uint32_t key_a;
        char used1[15];
        uint32_t key_b;
        char used2[10];
};
struct document doc;

doc.key_a = 1;
doc.key_b = 2;
doc.value = 3;
void *o = sp_document(db);
sp_setstring(o, "key", &doc.key_a, sizeof(doc.key_a));
sp_setstring(o, "key_b", &doc.key_b, sizeof(doc.key_b));
sp_setstring(o, "value", &doc, sizeof(doc));
sp_set(db, o);

After that only doc will be stored in sophia and keys will have refences inside the value.

taken from this test: https://github.com/pmwkaa/sophia/blob/master/test/multithread/multithread_be.test.c#L306

Example with the variable-sizes of msgpack is a good one. But index keys are strictly define key types (u32, u64, ...) so they are expected to be in this format too inside a document buffer. As far as i know msgpack numbers can be stored even in different byteorder, which could be a bigger issue.

multi-values

I got this idea also some time ago. Really we could have several value pointers, which will be assembled in one by Sophia automatically, like: `sp_setstring(o, "value[1], ptr, size), ...value[2]`` and so on.

But once again, this brings a questions about good document store support, which is much more than that (which document format to use: bson, json, custom, which key-path format to use json-path, do we need internal document builder for fields in add/delete style, query language, and many more).

pmwkaa / sophia

Semantics of document storage #109