hapi-server / data-specification

HAPI Data Access Specification
https://hapi-server.org
23 stars 7 forks source link

add processing options for servers and for specific datasets #79

Open jvandegriff opened 5 years ago

jvandegriff commented 5 years ago

At the 2019-06-03 telecon, we talked about adding processing options in a generic way that clients could make use of. Here's a summary of that discussion.

Jeremy presented Das2 server options, where a dataset on a Das2 server can have processing flags. Each set of options is for an individual dataset. This practice grew out of the original use for Das2 servers, which was as a somewhat internal protocol between a client and server written by one developer, who understood what all the "secret" options were and could use them to optimize the data transfer for what the client needed. Jeremy advised against this kind of behind-the-scenes options proliferation.

However, some kind of way to allow configuration options for a dataset could be useful. There are at least two classes of processing options:

  1. generic processing options that any HAPI server can optionally offer to do on any of its datasets; these are uniform across all HAPI servers; examples include: binning, interpolation, and spike removal (there is actually a lot to say about keeping these as simple and universal as possible, and even then not everyone will accept or want to use a generic approach); the algorithm for each process would be standardized
  2. customized options provided by a server across all its datasets; if a data provider has a filter or binning approach different than the HAPI standard ones, they can offer that for all datasets as well
  3. dataset-specific processing options that are meant to be used by specialists for a particular dataset; these are not uniform across all HAPI servers, and indeed would be expected to vary even among datasets within a server; they represent configuration settings, processing adjustments, calibration options, specialized filtering, custom data binning, etc, that the data provider needs to allow meet the varied needs of specialist users for their data; HAPI could accommodate this now by having a combinatorial explosion of datasets, each labelled to indicate which combination of options are utilized, but this is unwieldy, and confusing to both non-expert users and expert users

There are multiple benefits to supporting these kinds of options.

  1. instrument teams are more likely to use HAPI for their primary data mechanism if it supports specialized access for experts; and instrument team development money is the way most tools get built, so if HAPI is not useful to these teams, it will never get wide adoption
  2. large data centers are offering options like this already, so adding common binning / filtering options to HAPI standardizes what is already being done and lets data centers express more of their functionality through HAPI (and in a standard way)

For the algorithms that are to be universal across all HAPI servers, they should be

The list of potential generic services is: binning, interpolation, spike removal. For binning, the simplest possible method would be: given a start time and a bin width, accumulate data points in each bin, and then divide by the number in each bin. There are options for how to handle empty bin: skip (con't include in output), use fill value for that bin, interpolate using one of a set of algorithms (might want to specify a maximum time width to allow interpolation above which FILL is inserted instead).

There are wording challenges here since interpolation to some does not necessarily mean overlaying data on to a regular grid. However, for our purposes, binning and interpolation do refer to a uniform grid, and re-sampling is used to indicate the capturing of points from one dataset at an arbitrary set of other time points that need not be uniformly spaced.

The capabilities mechanism needs to be expanded to allow server-wide and dataset-specific options to be described so that generic clients can detect that there are options and have a useful set of information that can be presented to users through clients so that the users can decide if and how to include any of the options.

Instrument teams tools usually include specialized flags that can be set when reading the data. Jeremy's example of "don't exclude instrumental spikes" is a great example, since the default behavior should be to remove instrumental spikes / glitches, but specialist users may want to see the spikes to make sure the spike removal algorithm is working (and has not excluded and real data that happens to be jumpy!)

For HAPI, all datasets should be scientifically optimal given the default set of options. Options can introduce or relax restrictions, binning, interpolation, reductions, different calibrations, spike removal, etc., and the inclusion of these options requires extra work by users to figure out and understand if they want the data modified in the ways advertised by the special options.

rweigel commented 5 years ago

Related: #59

jvandegriff commented 4 years ago

Separate project - move to wiki and consider as separate effort. Could use as separate "in-between" server to do translation to other resolution, etc.

jvandegriff commented 4 years ago

moving out of Milestone 3.0, since this should also be an extension to try it out and see if it could be captured in a generic way as something for all HAPI servers.