Open jptmoore opened 6 years ago
For info, my current use case is importing rides from Strava, but it would be similar for importing tweets, fitbit activities, etc., that all have a well defined time of occurrence (started, posted, ...).
A few clarifications to the details of the current proposal above, assuming it follows store-timeseries:
Currently the most-used store is the store-json (API documented there) which differs in a few ways including:
data
(the value) and timestamp
(explicit timestamp, optional) or raw valuedata
and timestamp
(and also datasource_id
, but that isn't so useful externally except for subscriptions)I would propose:
In addition I suggest:
I haven't really used this so I won't make a concrete proposal, but it has always seemed odd to me that from a datasource perspective the datasource id IS the key, so it is really just a single value store, not a key-value store! It would seem to make more sense if there were, e.g.:
And consequently also:
I would probably look at redis for inspiration.
The next driver I'm writing is an IMAP email driver. A number of apps will want to process the stored data filtering for addresses and/or keywords in subject and body. The only way to achieve this with the current store would be to retrieve all the emails and prase the data in the app.
So the question is should the store support some kind of query beyond filtering on timestamps?
Thanks Chris. Everything looks reasonable. I will comment further when I tackle the points in more detail in the new store.
@jptmoore @cgreenhalgh Thoughts (some of which repeat Chris' comments):
Following discussion of time, presumably /latest
, /last
, /since
, /range
must refer to the receipt-times rather than the datum-times? (I think this is another argument for having the driver able to interpret at least the timestamp of the source to normalise it.)
A well-defined timestamp type seems like a good idea. I would avoid floats myself though-- eg, Windows filestamps are uint64 nanoseconds since 1/1/1600 IIRC. There was some possibly-related discussion about this in the Mirage/OCaml worlds a while back, see eg https://github.com/mirage/mirage-clock/issues/1 and documentation for Ptime
at http://erratique.ch/software/ptime/doc/index. If space considerations were a concern, some kind of packed 2*32 bit representation where a bit is used to indicate if there's a second 32-bit value or not, perhaps. (But that doesn't make any sense at all if we continue using an ASCII encoded JSON representation for everything.)
On the last point, are we going to enforce JSON encoding everywhere? For things like email, it seems that encodings will stack up pretty badly in that case. (Though perhaps this is just a use case for a BLOB store, to avoid fiddling with the bits in the mails.)
@Toshbrown I haven't forgotten about sorting out my offlineimap config (and related) to a container. Will get back to it soon. To expose mail as a useful datasource, will want some robust MIME etc parsing too-- there's a newish OCaml library that's good for this (processed my personal store of ~850k mails fine, and reasonably quickly), let me know if you want the URL (can't recall it right now)...
I see the need for pagination from a UI/web point of view -- how will that fit in with datasources where data is to be streamed? (I guess the "store" in question will look different?)
(Other suggestions all sound reasonable though.)
@mor1 I was going to use this https://github.com/emersion/go-imap, but if you really want an OCaml version, let me know i won't wast any time implementing it. It looks like email may form part of the risk awareness/communication studies we will be starting here at some point soon.
@mor1 Not so bothered about OCaml version per se-- there's an OCaml IMAP implementation (a couple I think in fact), but it's the MIME parsing that I was mentioning particularly. It's insanely complicated to get right, but absolutely necessary if you want to robustly process mail contents (rather than just transport mails to/from/between servers).
@mor1 so your thinking about doing MIME parsing in the store?
I was going to do it in the driver then link using UUIDs for the binary parts
This is getting a bit off topic I will create an issue to discuss the details of the IMAP driver
@Toshbrown not in the store per se, but you may want to explicitly put the results of having extracted content from the mail into the store (attachments, mail headers, etc). or perhaps in a derived store rather than that associated with the imap (email) driver. even just extracting attachments robustly is a surprising pita. (agree this is off-topic though.)
See also me-box/core-export-service#28 on a possible job queue store API that might also be supported, e.g. job queue store API.
Also worth noting that the current store-json subscription API isn't shown here, and will need to be supported (or something like it)
@cgreenhalgh
I made some changes below (needs testing and error handing e.g. reporting path errors back to client etc)
You can try out the changes from the docker client/server
choosing a standard time representation that allows better than second accuracy, e.g. float64 seconds since UNIX epoch
Using milliseconds since epoch now
allowing explicit timestamps to be specified when writing values, perhaps using the approach in the current store-json (will increase code compatibility)
You can post with URL: /ts/[id]/at/[time] to specify your own time
returning values with explicit timestamps as in store-json
Timestamps are returned with data like this [1509564588450, [1,2,3,4,5]]
confirm if range end time is exclusive
The end range is now inclusive.
You can post with URL: /ts/[id]/at/[time] to specify your own time
does this overwrite the internal timestamp? and are there any constraints on its format?
does this overwrite the internal timestamp? and are there any constraints on its format?
Yes, it overwrites the internal one. It is an integer of epoch milliseconds.
@jptmoore On the updated API...
Returning the time-value in a heterogeneous array (aka array representing a tuple, rather than an object with named fields) makes it problematic to type in some languages and more complicated to marshal/unmarshal (e.g. in go, which I'm using at the moment). It's not impossible but it is a hastle. It may also reduce consistency with the notification type/values??
The example value wasn't ever so clear but I believe (hope) [1,2,3,4,5] is a single value and every value has its own timestamp.
@mor1 On the type of time I know ns since (an) Epoch was mentioned but a caution I would give about that is that a sufficient range of values can't be exactly represented in a float64 which is all that some languages will use for numbers (e.g. javascript, max int 2^^53-1). Milliseconds is OK with me. I think microseconds might also fit but is rather non-standard.
When you say we can try them, I thought the new store only supported the zeromq transport, but afaik the go client library for this doesn't exist/isn't complete yet? ( @Toshbrown ?)
@cgreenhalgh I've started updating the go library Toshbrown/lib-go-databox. I've got basic KV and TS reads and writes working with tokens inside the databox example code is Toshbrown/driver-tplink-smart-plug.
It's not ready to go yet. I need to add the observe API, think about API exposed to app/driver developers, and turn the handle on the rest of the endpoints once they are stable. I'm thinking it will be mid next week before I get a chance to finish it (working on other projects until the 7th of nov)
By trying I think @jptmoore is referring to the client and server he uses outside of the databox for testing here. Its all wrapped in docker containers and allows all the functionality to be tested.
@cgreenhalgh
Returning the time-value in a heterogeneous array (aka array representing a tuple, rather than an object with named fields) makes it problematic to type in some languages and more complicated to marshal/unmarshal (e.g. in go, which I'm using at the moment). It's not impossible but it is a hastle. It may also reduce consistency with the notification type/values??
Do you have an example of some JSON you would like to be returned?
The example value wasn't ever so clear but I believe (hope) [1,2,3,4,5] is a single value and every value has its own timestamp.
Yeh, [1,2,3,4,5] is the JSON data POSTed. The API takes any JSON as the value.
@jptmoore
Do you have an example of some JSON you would like to be returned?
Current store-json uses (for your example) {"timestamp":1509564588450, "data":[1,2,3,4,5]}
. Or you could use shorter property names ("ts"/"t","data"/"d"??) if you are worried about the byte count, not compressing and not too bothered about readability :-) Obviously a little more overhead than the array/tuple encoding, so that's the trade-off.
@cgreenhalgh
I pushed a new image which returns in this format:
{"timestamp":1509626879783,"data":[1,2,3]}
@jptmoore While updateing lib-go-databox I was trying the new /ts/<id>/at/<t>
and requsted permissions like this from the arbitor:
{"target":"driver-tplink-smart-plug-core-store","path":"/ts/tosh/*","method":"POST"}
These are granted by the arbiter but rejected by the store. Do you parse wildcards in the macaroon caveats?
requesting permissions like this:
{"target":"driver-tplink-smart-plug-core-store","path":"/ts/tosh/at/1509641953193165","method":"POST"}
works fine, but this means that the macaroons can't be cached when using this endpoint.
@Toshbrown
works fine, but this means that the macaroons can't be cached when using this endpoint.
Yeh, currently it is matching the exact path so will need to implement wildcards.
@Toshbrown
I have pushed a new image which should support wildcards.
@jptmoore I'd like to push hard for a bulk add operation in the timeseries API. I know @Toshbrown hit this as a (speed) limitation with the google takeout import, and I was struggling with a simple performance test (adding 1,000s items in a reasonable time - even activity/heartrate at 1/minute = 1440/day...). Having this in the API amortises the overhead of the request/response communication and also opens the option of handling the set of values within a single transaction/commit in the datastore for further optimisation.
Perhaps
/ts/[id]/import
. [{"timestamp":1509626879783,"data":[1,2,3]},...]
). I'm not sure about generating events: should each value generate an event, or only the last value, or should it generate a distinct import
event, or nothing? For my current use cases I don't need any events (I'd be updating a parallel KV store with source metadata in parallel and could use the event from that).
It also raises a question (in my mind, at least) about whether the existing write entry point should change, e.g. to POST ts/[id]/value
.
@cgreenhalgh could you give me a sample of the bulk JSON data you have to test with please.
I assume the same kind of thing as you get back from a range query, e.g. for a simple value
[
{ "timestamp": 1509626879783, "data": 14.5 },
{ "timestamp": 1509626880783, "data": 14.8 },
{ "timestamp": 1509626881783, "data": 16.0 },
{ "timestamp": 1509626882783, "data": 16.5 },
]
or for a complex value
[
{ "timestamp": 1509626879783, "data": { "event": "event type 1", "value": 42, "content": "something" } },
{ "timestamp": 1509626880783, "data": { "event": "event type 1", "value": 44, "content": "nothing" } },
{ "timestamp": 1509626881783, "data": { "event": "event type 2", "content": "smells a bit" } },
{ "timestamp": 1509626882783, "data": { "event": "event type 1", "value": 48, "content": "something again" } },
]
The API in the new store currently looks like below.
The current implementation supports POST/GET of JSON, text and binary data.
Suggestions welcome on changes/additions.
Key/Value API
Write entry
Read entry
Time series API
Write entry
Read latest entry
Read last number of entries
Read all entries since a time
Read all entries in a time range