hapi-server / data-specification

HAPI Data Access Specification
https://hapi-server.org
23 stars 7 forks source link

Proposal for availability files #70

Open jbfaden opened 6 years ago

jbfaden commented 6 years ago

There was a brief discussion about how availability might be done with HAPI servers. The gist of the conclusion is that the info would contain a reference to another dataset which describes the availability of the first data set. I'll make a proposal for how this would be done beyond that. The info request for "http://hapi-server.org/hapi/info?id=0B000800408DD710" might return:

"x_availability":"availability/0B000800408DD710"

which would be a dataset id on the same server which should have four columns: time,endTime,code,message

where code would be either "200" or "204". Note there is no requirement of endTime, other than it would be later than time. 200 indicates data will be found in this interval, and 204 (empty response) indicates no data will be found in the interval. When only 200 is returned, one may assume that the opposite intervals do not have data. When only 204 is returned, one may assume that the opposite intervals do have data. I don't believe clients would necessarily be bound to being overly precise. For example, if 90% of a day contains data, and 10% is a typical missing-data rate, then the entire day could be included in the present list.

Note: existing clients are supported because this is just another dataset.

I would also suggest that availability could refer to another server, like so:

"x_availability":"http://hapi-server.org/availability/hapi?id=0B000800408DD710"

because this would allow a server to be indexed externally.

berniegsfc commented 6 years ago

My current "data inventory" information is simply in the info response like this example "x_spdf_inventory": { "intervals": [ { "begin": "1996-03-20T00:00:06Z", "end": "1996-03-21T21:06:30Z" }, { "begin": "1996-03-22T00:00:09Z", "end": "1996-03-27T21:08:49Z" },... But I could re-implement it as your "x_availability":"https://cdaweb.gsfc.nasa.gov/dataviews/sp_phys/datasets/po_h0_hyd/inventory?format=hapi" proposal. There are of course a lot of caveats that go with the information I'm returning.

berniegsfc commented 6 years ago

I guess my link to the whole info response got lost above. Here's another attempt https://www.dropbox.com/s/zf806oufxx6wqs5/po_h0_hyd_hapi_info.json?dl=0

jvandegriff commented 6 years ago

On the 2018-06-18 telecon, we decided to try this approach: add an optional capability called 'availability' using the capabilities endpoint.

Just referencing the bare endpoint should result in a list of dataset IDs for which availability info can be obtained. These IDs should match the IDs in the catalog endpoint.

If you give an ID to the availability endpoint, then it will return a list of time ranges in the following format: column 1: start time column 2: stop time column 3: 0 for no data in this interval, 1 for data in this interval columns beyond this are optional and can contain other user-specific data, such as what fraction of the interval is filled with data, or a label for the time interval. It was decided not to attempt to regularize any of this, sine it really opens up a can of worms trying to figure out how to specify event list type of info, and there are already standards for that.

The availability info format is going to be kept very simple, and HAPI-centric, and will not be made available as other event list formats. Converters to those formats would be simple, and could be included in clients.

berniegsfc commented 6 years ago

I would prefer a JSON response like this $ curl -s "http://localhost:8084/WS/hapi/availability?id=AC_H3_MFI" | python -mjson.tool { "HAPI": "2.1", "availability": [ { "available": 1, "startDate": "1998-01-01T00:00:00Z", "stopDate": "2008-10-25T23:59:59Z" }, { "available": 1, "startDate": "2009-01-01T00:00:00Z", "stopDate": "2018-03-28T23:59:59Z" } ], "creationDate": "2018-06-20T11:28:20.839Z", "status": { "code": 1200, "message": "OK" } }

rweigel commented 6 years ago

One issue to consider with the "0/1" to indicate availability is redundancy. We could communicate everything about availability with only two columns

/hapi/availability -> list of data set IDs with availability information (same schema as /catalog; endpoint optional)

The minimal requirement for the endpoint would be to indicate intervals where a user can expect to get at least one data record if they used the listed start/stops in a data request.

/hapi/availability?id=ABC

start1,stop1 start2,stop2 start3,stop3

A potential problem with allowing additional columns that are not regularized is how we would communicate what the columns mean. We would need something like

/hapi/availablility/info

which is not ideal.

jvandegriff commented 4 years ago

I think I favor a simpler approach here like Bob is suggesting. Just a list of intervals where data is present, so-called Good Time Intervals (GTIs).

If we thought it was useful, we could support with an optional third column indicating the fraction of the interval that is filled, but there are lots of potential ways you could calculate this, so we would have to be specific about this.

rweigel commented 1 year ago

I think this sort of services is needed. However, I'd (very strongly) prefer that it was a different API because:

  1. It is much simpler than HAPI and may get lost. It can also get quite complex (allowing constraint parameters)
  2. I want people to associate HAPI with being able to get at the numbers with no additional work.

So I propose that we develop a new API standard named FLAPI (File Listing API). It would use parts of the HAPI metadata specification and some of the existing HAPI software could be used.

rweigel commented 1 year ago

See als #116

jvandegriff commented 1 year ago

[adding email discussion, since it's relevant.]

Hi Bob and Jon,

 Could you have a look at 

https://cottagesystems.com/server/esac/hapi/info?id=C4_CP_CIS-CODIF_HS_O1_PEF/availability

and let me know if you find this data scheme agreeable? This is what I was thinking we should use for availability. Autoplot would detect that it starts with two isotimes and display the data as an events bar. (I feel fairly strongly that this should be the scheme. While it's tempting to do something like time and then length for the second column, having two times makes it very clear what the intended use is.) Autoplot doesn't do anything special with this yet, but I plan on adding it in. (See the image below, which is just the number of records vs start time.)

                      Jeremy
jvandegriff commented 1 year ago

[Bob's response.] We should probably discuss this on the telecon; this is going to take some thought to get right. I think we should come up with a schema for this as you suggest, so that people use the same parameter names and syntax for the id.

You may want to look at Bernie's CDAS rest server, which has /inventory and /orig_data endpoints.