Review sisock.data_node_server API

simonsobs / sisock

Sisock ('saɪsɒk): streaming of Simons Obs. data over websockets for quicklook

Other

2 stars 0 forks source link

Review sisock.data_node_server API #2

Closed ahincks closed 6 years ago

ahincks commented 6 years ago

Review the API of the sisock.data_node_server base class. Currently there are only three methods, but I think they have the majority of the functionality we want.

Very open to proposals for new parameters, or different format for returning data.

There is an example that implements most of the current functionality (example_server_weather.py), and I am working on another (example_server_sensors.py): writing these have been helpful for figuring out some use cases.

ahincks commented 6 years ago

I propose that initial back-and-forth occur as comments here, and then when things get a little more concrete, we use pull requests to implement what we're zeroing in on.

mhasself commented 6 years ago

Great to see this. Sticking just to the API, here are my thoughts:

get_data(start, length, min_stride=None):

I think (start,stop) is a better pattern than (start,length). Consider two consecutive requests:

get_data(time0, time1)
get_data(time1, time2)

Between them, these should give me every sample between time0 and time2. But in the (start, length) API the calls would be:

get_data(time0, time1-time0)
get_data(time1, time2-time1)

It's possible that this, due to rounding, would drop or duplicate items around time1. Do we care about this unlikely corner case? Not usually. But sometimes, yes. When choosing between these seemingly equivalent approaches, float-safety is something to consider.

If we're worried about how to say things like "give me the past 60 seconds of data", that can still be done in the (start,stop) API. If start < 0, that can trigger both the start and stop times to be interpreted relative to now(). Then get_data(-60, 0) is how you retrieve the most recent 60 seconds of data.

get_fields(t):

I think the get_fields function should accept a (start, stop) time range, like get_data. Interval support is necessary in order to reliably navigate a complex data set. Imagine, for example, that I wanted to get a list of all fields in the dataset, covering all times. And then suppose I wanted to determine the first and last time at which a field occurred. With interval support, I can binary-search for the answer to the latter question.

A simple implementation might decline to implement this fully, and just always return the same list of fields (regardless of the time interval specified). That's fine. But for any more advanced functionality, an interval solves the problem completely but a single time point does not.

last_update(field):

Why is this needed? :) Why not just call get_data (since that's what I'm going to do in a second anyway)?

ahincks commented 6 years ago

Thanks for the detailed thoughts. My responses/comments to your comments are below. But also, if you can think of functions that should exist in addition to these 3 (or perhaps 2, given your comments on the third), do chime in. (And finally, I take your silence on the snake-case prototypes as tacit approval :-) .)

get_data() Doesn't the issue you're worrying about also depend on the implementation by the server? What if the natural way for it to measure out the data is with start and length? Then if you provide it with time0 and time1, it's going to calculate a duration by taking time1-time0 and you get the same rounding issue, if I'm not mistaken.

I'm fine with time0 and time1 (although aesthetically and intuitively I prefer start and length), but we should be careful we're not assuming/imposing a design on the data node server.

get_fields() So what would be the return value in the scenario that it's get_fields(start, stop)? If a field has as at least one datum available in the interval, report it?

I'm thinking of a scenario in which a field is available for a short period after start, then doesn't exist, and then comes back on soon before stop. Should we worry about the consumer not being aware that the field was actually not available for the majority of the interval?

last_update() I may have included this because Barth mentioned it on the call! I think you're reasoning is correct that it doesn't provide any independently useful functionality.

Maybe something similar that would be useful would be some kind of query to see if a data source is live, and if so, how deep its buffer is. Something like is_live(field), which returns the oldest and most recent timestamp if its live, and False if its an archive? I could imagine this being useful information for figuring out, e.g., how much data is available to plot, but on the other hand it may be too specific to divide data nodes into 'live' and 'archive'.

mhasself commented 6 years ago

I like snake case... but let's follow PEP 8 in all new code. https://www.python.org/dev/peps/pep-0008 I think you've done that except for the class name data_node_server. Autobahn usese camelCase because it came from C++ (I imagine); that's fine, I actually like that you can tell which functions are for autobahn and which ones are our public API.

get_data() I'm not sure I can imagine a server that would internally use (start, length) in the way you suggest. At least not in the model where the server is returning timestamped data. Can't we assume that each sample that the server can serve has a well-defined, invariant timestamp? Surely the timestamp does not change depending on the request from the client (i.e. the values of start and length).

get_fields() Yes, I think the best rules for get_fields(t0,t1) are:

if the field has 1 or more sample in [t0,t1), the field must be included in the output.
if the field has 0 samples in [t0,t1), it is optional to include the field in the output.

This allows the server to use rough rubricks, if necessary, when assessing presence (e.g. perhaps it has pre-scanned 100 different files, and only knows the names of each field in each file, though the field may come and go a bit within those files).

Ultimately the consumer doesn't really know what they're going to get until they get_data(); that's the best time to deal with the consequences (of sparsity, 0 samples, whatever). But at least they have a (super-)complete set of fields to query.

last_update() Ya... it does seem like data that is updating needs some additional API support. Perhaps it will become more clear once we're doing it... I think is_live(fields) could be useful; or a slightly more informative thing might be called get_polling_interval(fields), which provides a suggestion to the client about how often new data are becoming available.

ahincks commented 6 years ago

Thanks!

Yes on PEP 8.
get_data(): OK. Just wanted to be sure that we're not painting anyone into a corner, but sounds like you're confident the start and end is the way to go, so I'm happy with that.
get_fields(): OK

It occurs to me that something like following function might be useful, and meet the need we have an instinct for:

get_timeline(start, stop)

It returns:

All the available timelines in the interval.
For each timeline it gives the interval between timestamps (which can be approximate if there isn't sub-second precision), or None if it is aperiodic.
For each timeline, it gives the list of fields associated with it. Thus it can be complementary to get_fields(). I was finding writing the JavaScript example that I needed a timeline->field dictionary, rather than field->timeline. But I can easily imagine that the field->timeline dictionary is useful in other contexts. So the complementarity would be handy, without being strictly necessary (you can get people to make the inverse association themselves, as I did in the JS example)..

This is somewhat off the top of my head, so do iterate ... or shoot down.

mhasself commented 6 years ago

I like this get_timeline(start, stop) behavior. Interesting that it's a superset of what get_fields() does... Perhaps this should be what get_fields() does?

Knowing the sampling rate is helpful for a variety of reasons. But also a live plotter needs to know when a time interval is "finalized" -- i.e. when further requests for data will not reveal new data, even after waiting and trying again. Perhaps each "timeline" description should also include a field or flag that describes whether data in the requested interval is subject to change (this is like the "is_live" suggestion). Or, what if each get_data request included a "finalized" field, for each timeline, that gives the timestamp (probably near the end of the requested window) prior to which the caller may assume that all data are no longer subject to change.

For example suppose a plotting code requests all data in the interval (10230, 10290), where 10290 is roughly "now" ("The year was 1970..."). The server is only getting updates every 30 seconds or so, and only has data up to 10280. So it returns the data it has, along with the pair "finalized: 10280" for the timeline. The caller would now know that its next request should start at 10280. A slightly more sophisticated caller could probably figure out the typical latency (based on data so far: 10290 - 10280 = 10) and set the re-query time based on that; i.e. don't bother pinging the server every 1s if new data only seem to appear every 30 s.

The special value "finalized: None" could be used to signal "ya, it's all finalized; this is not a dynamic data source."

ahincks commented 6 years ago

I'm not sure I fully get your first paragraph. Are you suggesting that get_timelines() and get_fields() return the same data, except that one has a timeline -> field dictionary and the other has a field -> timeline dictionary? If so, I agree.
I like finalised because I'm worried there are cases in which is_live is ambiguous. For instance, say I have a data archive that gets updated every 24 hours. If you return is_live = True it's misleading because the consumer will have to wait up to a day for the data; but if you return False a client running long term may just assume that no new data are ever added. I think finalised fixes this. Given the example I've given, though, should it ever return finalised: None?
(Really 2b, but markdown won't let me name an enumerate thus.) I confess that the term 'finalised' threw me off at first: to me it intuitively indicated that the server wasn't expecting any more data, i.e., that the archive was completed. Maybe something along the lines of buffer_end? Though I'm not sure that's much better.

mhasself commented 6 years ago

Actually I was suggesting that get_fields() could return the timeline -> field dictionary, and that get_timelines() is then unnecessary. But I see now that get_fields() returns its own indispensible things. Perhaps get_fields() could return both of these dicts? This does seem like a case where the caller usually needs both pieces of information.
For more clarity, we could say "finalized_to" or "finalized_until" or "finalized_up_to". Important to emphasize that it's a timestamp, not a boolean (i.e. it's not an is_finalized). In your 24-hour update example, a requester might ask for hours of data at a time, and basically the server would reply "my current buffer ends at 3pm yesterday". The client is not given direct information about when the new data might arrive... but an intelligent client can still base the update interval on the current data staleness.

In the case of a server that is not expecting more data, ever.... it returns "finalized_to: None", regardless of the query.

I would like to standardize U.S. English for all new code. Even if everyone reading this is Canadian... we need a lowest-common-denominator standard :)

ahincks commented 6 years ago

I think we've converged sufficiently at this point. I'll work on creating a pull request that implements our discussion and add you as a reviewer. (Once that pull request gets merged I can close out this issue.)

ahincks commented 6 years ago

Issues addressed in merged pull request 8feb58a.