Store API design - Githubissues

jptmoore commented 6 years ago

The API in the new store currently looks like below.

The current implementation supports POST/GET of JSON, text and binary data.

Suggestions welcome on changes/additions.

Key/Value API

Write entry

URL: /kv/<key>
Method: POST
Parameters: JSON body of data, replace <key> with a key
Notes: store data using given key

Read entry

URL: /kv/<key>
Method: GET
Parameters: replace <key> with a key
Notes: return data for given key

Time series API

Write entry

URL: /ts/<id>
Method: POST
Parameters: JSON body of data, replace <id> with an identifier
Notes: add data to time series with given identifier

Read latest entry

URL: /ts/<id>/latest
Method: GET
Parameters: replace <id> with an identifier
Notes: return the latest entry

Read last number of entries

URL: /ts/<id>/last/<n>
Method: GET
Parameters: replace <id> with an identifier, replace <n> with the number of entries
Notes: return the number of entries requested

Read all entries since a time

URL: /ts/<id>/since/<from>
Method: GET
Parameters: replace <id> with an identifier, replace <from> with epoch seconds
Notes: return the number of entries from time provided

Read all entries in a time range

URL: /ts/<id>/range/<from>/<to>
Method: GET
Parameters: replace <id> with an identifier, replace <from> and <to> with epoch seconds
Notes: return the number of entries in time range provided

cgreenhalgh commented 6 years ago

For info, my current use case is importing rides from Strava, but it would be similar for importing tweets, fitbit activities, etc., that all have a well defined time of occurrence (started, posted, ...).

For the time series part of the API

A few clarifications to the details of the current proposal above, assuming it follows store-timeseries:

all times are seconds (integer) since UNIX epoch (so sub-second accuracy is not available)
the time associated with a value is the store time at which it is written (so it is not possible to provide an explicit time to associate with a value)
the values returned by range and since are arrays of values with the store times stripped out (so it is not possible to access the timestamps associated with values)
range end time is exclusive

Currently the most-used store is the store-json (API documented there) which differs in a few ways including:

times are milliseconds (integer) since UNIX epoch
write accepts a JSON object with fields data (the value) and timestamp (explicit timestamp, optional) or raw value
range end time is inclusive
all read operations (latest, range and since) return (arrays of) JSON objects with data and timestamp (and also datasource_id, but that isn't so useful externally except for subscriptions)

time series proposal

I would propose:

choosing a standard time representation that allows better than second accuracy, e.g. float64 seconds since UNIX epoch
allowing explicit timestamps to be specified when writing values, perhaps using the approach in the current store-json (will increase code compatibility)
returning values with explicit timestamps as in store-json
confirm if range end time is exclusive

In addition I suggest:

there should be a bulk write operation that takes an array of timestamped values
there should be some support for pagination, i.e. an optional limit on the number of entries returned by since and range (but unlike the current store-timeseries this should be after filtering, not before)

key value API

I haven't really used this so I won't make a concrete proposal, but it has always seemed odd to me that from a datasource perspective the datasource id IS the key, so it is really just a single value store, not a key-value store! It would seem to make more sense if there were, e.g.:

two levels of key, the first being the datastore ID and second unconstrained (a single default id would replace current use)

And consequently also:

a list operation that returned existing second-level keys, probably within some optional key range
a delete operation (or a well-defined 'does not exist' value to write), probably within some optional key range
a read multi operation that returns a map of second-level keys and values, probably within some optional key range
a write multi operation that sets multiple key-values at once

I would probably look at redis for inspiration.

Toshbrown commented 6 years ago

The next driver I'm writing is an IMAP email driver. A number of apps will want to process the stored data filtering for addresses and/or keywords in subject and body. The only way to achieve this with the current store would be to retrieve all the emails and prase the data in the app.

So the question is should the store support some kind of query beyond filtering on timestamps?

jptmoore commented 6 years ago

Thanks Chris. Everything looks reasonable. I will comment further when I tackle the points in more detail in the new store.

mor1 commented 6 years ago

@jptmoore @cgreenhalgh Thoughts (some of which repeat Chris' comments):

Following discussion of time, presumably /latest, /last, /since, /range must refer to the receipt-times rather than the datum-times? (I think this is another argument for having the driver able to interpret at least the timestamp of the source to normalise it.)
A well-defined timestamp type seems like a good idea. I would avoid floats myself though-- eg, Windows filestamps are uint64 nanoseconds since 1/1/1600 IIRC. There was some possibly-related discussion about this in the Mirage/OCaml worlds a while back, see eg https://github.com/mirage/mirage-clock/issues/1 and documentation for Ptime at http://erratique.ch/software/ptime/doc/index. If space considerations were a concern, some kind of packed 2*32 bit representation where a bit is used to indicate if there's a second 32-bit value or not, perhaps. (But that doesn't make any sense at all if we continue using an ASCII encoded JSON representation for everything.)
On the last point, are we going to enforce JSON encoding everywhere? For things like email, it seems that encodings will stack up pretty badly in that case. (Though perhaps this is just a use case for a BLOB store, to avoid fiddling with the bits in the mails.)
@Toshbrown I haven't forgotten about sorting out my offlineimap config (and related) to a container. Will get back to it soon. To expose mail as a useful datasource, will want some robust MIME etc parsing too-- there's a newish OCaml library that's good for this (processed my personal store of ~850k mails fine, and reasonably quickly), let me know if you want the URL (can't recall it right now)...
I see the need for pagination from a UI/web point of view -- how will that fit in with datasources where data is to be streamed? (I guess the "store" in question will look different?)

(Other suggestions all sound reasonable though.)

Toshbrown commented 6 years ago

@mor1 I was going to use this https://github.com/emersion/go-imap, but if you really want an OCaml version, let me know i won't wast any time implementing it. It looks like email may form part of the risk awareness/communication studies we will be starting here at some point soon.

mor1 commented 6 years ago

@mor1 Not so bothered about OCaml version per se-- there's an OCaml IMAP implementation (a couple I think in fact), but it's the MIME parsing that I was mentioning particularly. It's insanely complicated to get right, but absolutely necessary if you want to robustly process mail contents (rather than just transport mails to/from/between servers).

ocaml-imap
imaplet -- less actively developed than the above I believe (was for a PhD, recently completed)
MrMime

Toshbrown commented 6 years ago

@mor1 so your thinking about doing MIME parsing in the store?

I was going to do it in the driver then link using UUIDs for the binary parts

This is getting a bit off topic I will create an issue to discuss the details of the IMAP driver

mor1 commented 6 years ago

@Toshbrown not in the store per se, but you may want to explicitly put the results of having extracted content from the mail into the store (attachments, mail headers, etc). or perhaps in a derived store rather than that associated with the imap (email) driver. even just extracting attachments robustly is a surprising pita. (agree this is off-topic though.)

cgreenhalgh commented 6 years ago

See also me-box/core-export-service#28 on a possible job queue store API that might also be supported, e.g. job queue store API.

Also worth noting that the current store-json subscription API isn't shown here, and will need to be supported (or something like it)

jptmoore commented 6 years ago

@cgreenhalgh

I made some changes below (needs testing and error handing e.g. reporting path errors back to client etc)

You can try out the changes from the docker client/server

choosing a standard time representation that allows better than second accuracy, e.g. float64 seconds since UNIX epoch

Using milliseconds since epoch now

allowing explicit timestamps to be specified when writing values, perhaps using the approach in the current store-json (will increase code compatibility)

You can post with URL: /ts/[id]/at/[time] to specify your own time

returning values with explicit timestamps as in store-json

Timestamps are returned with data like this [1509564588450, [1,2,3,4,5]]

confirm if range end time is exclusive

The end range is now inclusive.

Toshbrown commented 6 years ago

You can post with URL: /ts/[id]/at/[time] to specify your own time

does this overwrite the internal timestamp? and are there any constraints on its format?

jptmoore commented 6 years ago

does this overwrite the internal timestamp? and are there any constraints on its format?

Yes, it overwrites the internal one. It is an integer of epoch milliseconds.

cgreenhalgh commented 6 years ago

@jptmoore On the updated API...

Returning the time-value in a heterogeneous array (aka array representing a tuple, rather than an object with named fields) makes it problematic to type in some languages and more complicated to marshal/unmarshal (e.g. in go, which I'm using at the moment). It's not impossible but it is a hastle. It may also reduce consistency with the notification type/values??

The example value wasn't ever so clear but I believe (hope) [1,2,3,4,5] is a single value and every value has its own timestamp.

@mor1 On the type of time I know ns since (an) Epoch was mentioned but a caution I would give about that is that a sufficient range of values can't be exactly represented in a float64 which is all that some languages will use for numbers (e.g. javascript, max int 2^^53-1). Milliseconds is OK with me. I think microseconds might also fit but is rather non-standard.

When you say we can try them, I thought the new store only supported the zeromq transport, but afaik the go client library for this doesn't exist/isn't complete yet? ( @Toshbrown ?)

Toshbrown commented 6 years ago

@cgreenhalgh I've started updating the go library Toshbrown/lib-go-databox. I've got basic KV and TS reads and writes working with tokens inside the databox example code is Toshbrown/driver-tplink-smart-plug.

It's not ready to go yet. I need to add the observe API, think about API exposed to app/driver developers, and turn the handle on the rest of the endpoints once they are stable. I'm thinking it will be mid next week before I get a chance to finish it (working on other projects until the 7th of nov)

By trying I think @jptmoore is referring to the client and server he uses outside of the databox for testing here. Its all wrapped in docker containers and allows all the functionality to be tested.

jptmoore commented 6 years ago

@cgreenhalgh

Returning the time-value in a heterogeneous array (aka array representing a tuple, rather than an object with named fields) makes it problematic to type in some languages and more complicated to marshal/unmarshal (e.g. in go, which I'm using at the moment). It's not impossible but it is a hastle. It may also reduce consistency with the notification type/values??

Do you have an example of some JSON you would like to be returned?

The example value wasn't ever so clear but I believe (hope) [1,2,3,4,5] is a single value and every value has its own timestamp.

Yeh, [1,2,3,4,5] is the JSON data POSTed. The API takes any JSON as the value.

cgreenhalgh commented 6 years ago

@jptmoore

Do you have an example of some JSON you would like to be returned?

Current store-json uses (for your example) {"timestamp":1509564588450, "data":[1,2,3,4,5]}. Or you could use shorter property names ("ts"/"t","data"/"d"??) if you are worried about the byte count, not compressing and not too bothered about readability :-) Obviously a little more overhead than the array/tuple encoding, so that's the trade-off.

jptmoore commented 6 years ago

@cgreenhalgh

I pushed a new image which returns in this format:

{"timestamp":1509626879783,"data":[1,2,3]}

Toshbrown commented 6 years ago

@jptmoore While updateing lib-go-databox I was trying the new /ts/<id>/at/<t>

and requsted permissions like this from the arbitor:

{"target":"driver-tplink-smart-plug-core-store","path":"/ts/tosh/*","method":"POST"}

These are granted by the arbiter but rejected by the store. Do you parse wildcards in the macaroon caveats?

requesting permissions like this:

{"target":"driver-tplink-smart-plug-core-store","path":"/ts/tosh/at/1509641953193165","method":"POST"}

works fine, but this means that the macaroons can't be cached when using this endpoint.

jptmoore commented 6 years ago

@Toshbrown

works fine, but this means that the macaroons can't be cached when using this endpoint.

Yeh, currently it is matching the exact path so will need to implement wildcards.

jptmoore commented 6 years ago

@Toshbrown

I have pushed a new image which should support wildcards.

cgreenhalgh commented 6 years ago

@jptmoore I'd like to push hard for a bulk add operation in the timeseries API. I know @Toshbrown hit this as a (speed) limitation with the google takeout import, and I was struggling with a simple performance test (adding 1,000s items in a reasonable time - even activity/heartrate at 1/minute = 1440/day...). Having this in the API amortises the overhead of the request/response communication and also opens the option of handling the set of values within a single transaction/commit in the datastore for further optimisation.

Perhaps

POST
to /ts/[id]/import.
Body should be JSON array of timestamp/data records (e.g. [{"timestamp":1509626879783,"data":[1,2,3]},...]).

I'm not sure about generating events: should each value generate an event, or only the last value, or should it generate a distinct import event, or nothing? For my current use cases I don't need any events (I'd be updating a parallel KV store with source metadata in parallel and could use the event from that).

It also raises a question (in my mind, at least) about whether the existing write entry point should change, e.g. to POST ts/[id]/value.

jptmoore commented 6 years ago

@cgreenhalgh could you give me a sample of the bulk JSON data you have to test with please.

cgreenhalgh commented 6 years ago

I assume the same kind of thing as you get back from a range query, e.g. for a simple value

[
  { "timestamp": 1509626879783, "data": 14.5 },
  { "timestamp": 1509626880783, "data": 14.8 },
  { "timestamp": 1509626881783, "data": 16.0 },
  { "timestamp": 1509626882783, "data": 16.5 },
]

or for a complex value

[
  { "timestamp": 1509626879783, "data": { "event": "event type 1", "value": 42, "content": "something" } },
  { "timestamp": 1509626880783, "data": { "event": "event type 1", "value": 44, "content": "nothing" } },
  { "timestamp": 1509626881783, "data": { "event": "event type 2", "content": "smells a bit" } },
  { "timestamp": 1509626882783, "data": { "event": "event type 1", "value": 48, "content": "something again" } },
]

me-box / databox

Store API design #184

Key/Value API

Write entry

Read entry

Time series API

Write entry

Read latest entry

Read last number of entries

Read all entries since a time

Read all entries in a time range

For the time series part of the API

time series proposal

key value API