me-box / databox

Databox container manager and dashboard server
MIT License
94 stars 25 forks source link

Store API design #184

Open jptmoore opened 6 years ago

jptmoore commented 6 years ago

The API in the new store currently looks like below.

The current implementation supports POST/GET of JSON, text and binary data.

Suggestions welcome on changes/additions.

Key/Value API

Write entry

URL: /kv/<key>
Method: POST
Parameters: JSON body of data, replace <key> with a key
Notes: store data using given key

Read entry

URL: /kv/<key>
Method: GET
Parameters: replace <key> with a key
Notes: return data for given key

Time series API

Write entry

URL: /ts/<id>
Method: POST
Parameters: JSON body of data, replace <id> with an identifier
Notes: add data to time series with given identifier

Read latest entry

URL: /ts/<id>/latest
Method: GET
Parameters: replace <id> with an identifier
Notes: return the latest entry

Read last number of entries

URL: /ts/<id>/last/<n>
Method: GET
Parameters: replace <id> with an identifier, replace <n> with the number of entries
Notes: return the number of entries requested

Read all entries since a time

URL: /ts/<id>/since/<from>
Method: GET
Parameters: replace <id> with an identifier, replace <from> with epoch seconds
Notes: return the number of entries from time provided

Read all entries in a time range

URL: /ts/<id>/range/<from>/<to>
Method: GET
Parameters: replace <id> with an identifier, replace <from> and <to> with epoch seconds
Notes: return the number of entries in time range provided
cgreenhalgh commented 6 years ago

For info, my current use case is importing rides from Strava, but it would be similar for importing tweets, fitbit activities, etc., that all have a well defined time of occurrence (started, posted, ...).

For the time series part of the API

A few clarifications to the details of the current proposal above, assuming it follows store-timeseries:

Currently the most-used store is the store-json (API documented there) which differs in a few ways including:

time series proposal

I would propose:

In addition I suggest:

key value API

I haven't really used this so I won't make a concrete proposal, but it has always seemed odd to me that from a datasource perspective the datasource id IS the key, so it is really just a single value store, not a key-value store! It would seem to make more sense if there were, e.g.:

And consequently also:

I would probably look at redis for inspiration.

Toshbrown commented 6 years ago

The next driver I'm writing is an IMAP email driver. A number of apps will want to process the stored data filtering for addresses and/or keywords in subject and body. The only way to achieve this with the current store would be to retrieve all the emails and prase the data in the app.

So the question is should the store support some kind of query beyond filtering on timestamps?

jptmoore commented 6 years ago

Thanks Chris. Everything looks reasonable. I will comment further when I tackle the points in more detail in the new store.

mor1 commented 6 years ago

@jptmoore @cgreenhalgh Thoughts (some of which repeat Chris' comments):

(Other suggestions all sound reasonable though.)

Toshbrown commented 6 years ago

@mor1 I was going to use this https://github.com/emersion/go-imap, but if you really want an OCaml version, let me know i won't wast any time implementing it. It looks like email may form part of the risk awareness/communication studies we will be starting here at some point soon.

mor1 commented 6 years ago

@mor1 Not so bothered about OCaml version per se-- there's an OCaml IMAP implementation (a couple I think in fact), but it's the MIME parsing that I was mentioning particularly. It's insanely complicated to get right, but absolutely necessary if you want to robustly process mail contents (rather than just transport mails to/from/between servers).

Toshbrown commented 6 years ago

@mor1 so your thinking about doing MIME parsing in the store?

I was going to do it in the driver then link using UUIDs for the binary parts

This is getting a bit off topic I will create an issue to discuss the details of the IMAP driver

mor1 commented 6 years ago

@Toshbrown not in the store per se, but you may want to explicitly put the results of having extracted content from the mail into the store (attachments, mail headers, etc). or perhaps in a derived store rather than that associated with the imap (email) driver. even just extracting attachments robustly is a surprising pita. (agree this is off-topic though.)

cgreenhalgh commented 6 years ago

See also me-box/core-export-service#28 on a possible job queue store API that might also be supported, e.g. job queue store API.

Also worth noting that the current store-json subscription API isn't shown here, and will need to be supported (or something like it)

jptmoore commented 6 years ago

@cgreenhalgh

I made some changes below (needs testing and error handing e.g. reporting path errors back to client etc)

You can try out the changes from the docker client/server

choosing a standard time representation that allows better than second accuracy, e.g. float64 seconds since UNIX epoch

Using milliseconds since epoch now

allowing explicit timestamps to be specified when writing values, perhaps using the approach in the current store-json (will increase code compatibility)

You can post with URL: /ts/[id]/at/[time] to specify your own time

returning values with explicit timestamps as in store-json

Timestamps are returned with data like this [1509564588450, [1,2,3,4,5]]

confirm if range end time is exclusive

The end range is now inclusive.

Toshbrown commented 6 years ago

You can post with URL: /ts/[id]/at/[time] to specify your own time

does this overwrite the internal timestamp? and are there any constraints on its format?

jptmoore commented 6 years ago

does this overwrite the internal timestamp? and are there any constraints on its format?

Yes, it overwrites the internal one. It is an integer of epoch milliseconds.

cgreenhalgh commented 6 years ago

@jptmoore On the updated API...

Returning the time-value in a heterogeneous array (aka array representing a tuple, rather than an object with named fields) makes it problematic to type in some languages and more complicated to marshal/unmarshal (e.g. in go, which I'm using at the moment). It's not impossible but it is a hastle. It may also reduce consistency with the notification type/values??

The example value wasn't ever so clear but I believe (hope) [1,2,3,4,5] is a single value and every value has its own timestamp.

@mor1 On the type of time I know ns since (an) Epoch was mentioned but a caution I would give about that is that a sufficient range of values can't be exactly represented in a float64 which is all that some languages will use for numbers (e.g. javascript, max int 2^^53-1). Milliseconds is OK with me. I think microseconds might also fit but is rather non-standard.

When you say we can try them, I thought the new store only supported the zeromq transport, but afaik the go client library for this doesn't exist/isn't complete yet? ( @Toshbrown ?)

Toshbrown commented 6 years ago

@cgreenhalgh I've started updating the go library Toshbrown/lib-go-databox. I've got basic KV and TS reads and writes working with tokens inside the databox example code is Toshbrown/driver-tplink-smart-plug.

It's not ready to go yet. I need to add the observe API, think about API exposed to app/driver developers, and turn the handle on the rest of the endpoints once they are stable. I'm thinking it will be mid next week before I get a chance to finish it (working on other projects until the 7th of nov)

By trying I think @jptmoore is referring to the client and server he uses outside of the databox for testing here. Its all wrapped in docker containers and allows all the functionality to be tested.

jptmoore commented 6 years ago

@cgreenhalgh

Returning the time-value in a heterogeneous array (aka array representing a tuple, rather than an object with named fields) makes it problematic to type in some languages and more complicated to marshal/unmarshal (e.g. in go, which I'm using at the moment). It's not impossible but it is a hastle. It may also reduce consistency with the notification type/values??

Do you have an example of some JSON you would like to be returned?

The example value wasn't ever so clear but I believe (hope) [1,2,3,4,5] is a single value and every value has its own timestamp.

Yeh, [1,2,3,4,5] is the JSON data POSTed. The API takes any JSON as the value.

cgreenhalgh commented 6 years ago

@jptmoore

Do you have an example of some JSON you would like to be returned?

Current store-json uses (for your example) {"timestamp":1509564588450, "data":[1,2,3,4,5]}. Or you could use shorter property names ("ts"/"t","data"/"d"??) if you are worried about the byte count, not compressing and not too bothered about readability :-) Obviously a little more overhead than the array/tuple encoding, so that's the trade-off.

jptmoore commented 6 years ago

@cgreenhalgh

I pushed a new image which returns in this format:

{"timestamp":1509626879783,"data":[1,2,3]}

Toshbrown commented 6 years ago

@jptmoore While updateing lib-go-databox I was trying the new /ts/<id>/at/<t>

and requsted permissions like this from the arbitor:

{"target":"driver-tplink-smart-plug-core-store","path":"/ts/tosh/*","method":"POST"}

These are granted by the arbiter but rejected by the store. Do you parse wildcards in the macaroon caveats?

requesting permissions like this:

{"target":"driver-tplink-smart-plug-core-store","path":"/ts/tosh/at/1509641953193165","method":"POST"}

works fine, but this means that the macaroons can't be cached when using this endpoint.

jptmoore commented 6 years ago

@Toshbrown

works fine, but this means that the macaroons can't be cached when using this endpoint.

Yeh, currently it is matching the exact path so will need to implement wildcards.

jptmoore commented 6 years ago

@Toshbrown

I have pushed a new image which should support wildcards.

cgreenhalgh commented 6 years ago

@jptmoore I'd like to push hard for a bulk add operation in the timeseries API. I know @Toshbrown hit this as a (speed) limitation with the google takeout import, and I was struggling with a simple performance test (adding 1,000s items in a reasonable time - even activity/heartrate at 1/minute = 1440/day...). Having this in the API amortises the overhead of the request/response communication and also opens the option of handling the set of values within a single transaction/commit in the datastore for further optimisation.

Perhaps

I'm not sure about generating events: should each value generate an event, or only the last value, or should it generate a distinct import event, or nothing? For my current use cases I don't need any events (I'd be updating a parallel KV store with source metadata in parallel and could use the event from that).

It also raises a question (in my mind, at least) about whether the existing write entry point should change, e.g. to POST ts/[id]/value.

jptmoore commented 6 years ago

@cgreenhalgh could you give me a sample of the bulk JSON data you have to test with please.

cgreenhalgh commented 6 years ago

I assume the same kind of thing as you get back from a range query, e.g. for a simple value

[
  { "timestamp": 1509626879783, "data": 14.5 },
  { "timestamp": 1509626880783, "data": 14.8 },
  { "timestamp": 1509626881783, "data": 16.0 },
  { "timestamp": 1509626882783, "data": 16.5 },
]

or for a complex value

[
  { "timestamp": 1509626879783, "data": { "event": "event type 1", "value": 42, "content": "something" } },
  { "timestamp": 1509626880783, "data": { "event": "event type 1", "value": 44, "content": "nothing" } },
  { "timestamp": 1509626881783, "data": { "event": "event type 2", "content": "smells a bit" } },
  { "timestamp": 1509626882783, "data": { "event": "event type 1", "value": 48, "content": "something again" } },
]