hapi-server / data-specification

HAPI Data Access Specification
https://hapi-server.org
21 stars 6 forks source link

can HAPI also serve a time series of FITS images or other remote sensing data? #116

Closed jvandegriff closed 1 year ago

jvandegriff commented 3 years ago

This came up in the ISWAT meeting 2021-03-17 so I'll add it here to trigger a discussion at a future meeting. People were asking if HAPI could server a series of image data (i.e., FITS images of the sun) in a similar way as is done for other multi-dimensional time series data.

The VSO already supports image search and retrieval, and HAPI would just be the retrieval part. The image data is indexed by many other things besides time, so the concept of a dataset would need to be clear about which images are included.

Seems like you would need a separate interface to find the time ranges of the images that meet your spatial or other search criteria. We would also need to preserve all the FITS keywords.

Related to preserving FITS keywords, there was also a question about how to have HAPI still transmit ISTP metadata, much of which is now lost when going through HAPI. The model-to-data-comparison folks tend to use that ISTP content. So I'll create a separate ticket to talk about preserving a dataset's more rich metadata.

jvandegriff commented 3 years ago

Spec mods may result, but try first with existing capabilities.

sandyfreelance commented 2 years ago

I'd like to discuss serving images through HAPI. The justification is that images are file-based, discrete, encapsulated entities that are already efficiently packaged. But the risk is non-image users (CDF?) might lazily decide to serve all their (non-image) data this way (which breaks efficiency; one plus of HAPI is users can get their time series data over their chosen interval rather than being restricted to 'data within a file').

Jon had the good idea of making a separate spec for images, e.g. HAPI+. It would still use the core of HAPI and be compatible with HAPI v3: given a time interval and what data item(s) you want, returns them. In this case it allows for the additional data item of "image_url", a direct link to the actual file. Optionally, metadata could also be allowed as data items (keywords extracted, the file type available, etc).

One use case would be GUVI (http://guvitimed.jhuapl.edu/data_products), where I'm making a HAPI server for the spectrograph data, but there's also associated UV images that users want.

Thoughts?

jbfaden commented 2 years ago

My thought is that URLs pointing to image files could be served as string types. This would require that the maximum length of all URLs is known, and software like Autoplot is going to start looking for things that look like URLs, which isn't necessarily a good thing. Perhaps a URL type could be added, and then an optional content type could accompany the field description in the info response.

rweigel commented 2 years ago

We had spoken before about the option of allowing users to determine what files a HAPI response used. We should find that discussion as it seems relevant here. My recollection was that the list of files would be a dataset. I suppose we could also allow a more efficient representation as a URI template.

It is somewhat trivial to serve an image in HAPI. At each time step, serve a size=[10 10 10 10] dataset (R, G, B, A), where the values are integers in range 0 through 255. The issue really becomes how to communicate to a client what the numbers mean. At present, in the Python and MATLAB clients, I assume that if size = [N] and N >= 10, it should be represented as a spectrogram and if the bins info is available, they are used, otherwise the y-axis in the spectrogram is "column number". If N < 10, N time series are plotted with N legend labels that include the bin info as strings if it was given.

If size = [N,M], I create M plots following the same logic.

We should really understand the API used by others for images. My understanding is that generally a list of files is served given a query and then there are libraries that do all of the downloading. This seems to work well, save for the one complaint that I've heard which is that often the file contains far more data than the user wants. Jie Zhang at GMU often works with solar image data and just buys more disks; I did the same when I had student workings on a solar product.

HAPI was motivated by the fact that many time series data providers had APIs that were similar in functionality but different in their specification. If this is also the case with data providers who serve images, then one can make a good argument for the use of HAPI. However, I don't know if that is the case. It seems likely that FITS + a standard for the representation of the file list is sufficient.

On Mon, Aug 30, 2021 at 2:44 PM Sandy Antunes @.***> wrote:

I'd like to discuss serving images through HAPI. The justification is that images are file-based, discrete, encapsulated entities that are already efficiently packaged. But the risk is non-image users (CDF?) might lazily decide to serve all their (non-image) data this way (which breaks efficiency; one plus of HAPI is users can get their time series data over their chosen interval rather than being restricted to 'data within a file').

Jon had the good idea of making a separate spec for images, e.g. HAPI+. It would still use the core of HAPI and be compatible with HAPI v3: given a time interval and what data item(s) you want, returns them. In this case it allows for the additional data item of "image_url", a direct link to the actual file. Optionally, metadata could also be allowed as data items (keywords extracted, the file type available, etc).

One use case would be GUVI (http://guvitimed.jhuapl.edu/data_products), where I'm making a HAPI server for the spectrograph data, but there's also associated UV images that users want.

Thoughts?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/hapi-server/data-specification/issues/116#issuecomment-908593334, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAUQ57SQI55IX2A4O3LRKZDT7PGR3ANCNFSM4ZLDOVSA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

jvandegriff commented 2 years ago

Talked about this on 2021-09-20 telecon:

Sandy and Jeremy talked about adding a new parameter type specifically for URLs, so that HAPI returns a list of there URLs, each one a reference to an image file. Unlike the arbitrary nature of file boundaries for time series data, image files are a well understood and accepted and usable unit of data and thus a reference to an image file is indeed useful.

There are similar types of lists. Bobby mentioned that event lists are similar in that the parameters are often not numeric (flare type or CME info, or geomagnetic storm type) or if numeric, still relatively brief. Bob reminded us that the file-listing capability we talked about adding to HAPI would be similar (for a given time range, what files - listed at URLs - are available).

There is a danger of making the HAPI spec weaker if we allow this kind of capability, since if listing the files was all a data provider did for their time series data holdings, they could claim HAPI compliance but then not actually stream any data. Such a HAPI server would not be very functional.

Instead of calling the new parameter type a reference (or URL), we could call it "image-reference" to emphasize that it should only be used for images. This would not allow it to be used for file listings or event lists, though. We could also use a different endpoint besides the data endpoint, such as a references endpoint. Any then maybe the spec allows URLs to only appear in the references endpoint? Well, that might get confusing. And if the references data could have other numeric columns (FITS keywords, satellite ephemeris data, image pixel mappings to lat/lon or RA/DEC), then the distinction between data and references becomes not too distinct.

Of course, we could just have the image data be numerically provided, as multi-dimensional arrays. This is already being done at the CDAWeb HAPI server for some GUVI data, which has image content as multi-dimensional values in CDF.

One reason HAPI is not perfect for this is that a query for image data usually involves more than just a time range constraint. We might want to support more complex queries by allowing extensions to the query parameters. Currently, this is explicitly not allowed, but it keeps coming up, and we talked about it today with regards to SuperMAG. Additional query parameters could be described in the capabilities endpoint. (see telecon notes for 2021-09-20).

jvandegriff commented 2 years ago

Meant to also mention that we could have multiple ways for image data to be conveyed - both the reference-based approach as well as the numeric approach.

jbfaden commented 2 years ago

There are interesting analogies all over the place here. When forming an image as an MxNx3 array, there is a coordinate system the corresponds to the "3" index. RGB is a coordinate system, as is HSV or GSE. All that a coordinate system does is assert that these numbers have a special relation to one another and certain operations are valid.

jvandegriff commented 1 year ago

More discussion about this on Sep 19 telecon.

Points discussed:

  1. talked again about adding a URL type
  2. new idea: include a MIME type for any URL content so that clients could know what is in the granules.
  3. there is still concern about diluting HAPI so that someone claims HAPI compliance but only lists files (but eventually we do need to trust providers with this, maybe once HAPI is well established as more than a listing service)
  4. one way to mitigate this weakening would be to have a separate endpoint, filelisting which emphasizes that this is not usual HAPI streaming data, just a listing of files (with potential MIME types to help clients know how to interpret the contents)
  5. but this is really not that different from a regular data endpoint, in that it just has a string (or URL) column, with possible other numeric columns (bits of FITS headers extracted for convenience). In fact, this can be fully done already, even including the MIME type as an x_ extension.
  6. Jeremy mentioned that for the spec and for servers, this requires very little extra code, since HAPI can do it already. Bob pointed out that the complexity comes in describing what is allowed.
  7. the real complexity comes in the clients, which would then have to be able to parse the contents of arbitrary files (unless we restricted the MIME types?). The main intent is to serve images, so we could just allow image types (but then listing files is actually also useful)
  8. Bob mentioned that URI templates can remove the need for getting URLs for a time range. So HAPI becomes a kind of complex way to make listings - but in some cases it might be worth it, if for example the template scheme is not sufficient.

The bottom line is that there is a lot of complexity here, and it might be managed better as a fully separate access interface altogether, or at least explored that way first and then we could see about pulling it into HAPI.

jvandegriff commented 1 year ago

This is related to a file finding service, and we talked about potentially having a file finding capability as a separate service under a different prefix, such as server.org/hapi-ff/ or server.org/hapi-files or server.org/hapi-filelisting

A find of file finder service is actually fairly common. (SPDF has it, CCMC needs it for model output files).

For serving the image data analysis aspect, a key perspective to engage would be to involve people at the VSO (Ireland, etc), and see what's missing from the VSO interface and the FIDO capability. (Maybe it's just not generic enough for multiple domains and servers).

One issue with serving images is that requests rarely just use time-based constraints because image data is indexed based on multiple other things (wavelength, lat/lon, RA/DEC, altitude, etc).

jvandegriff commented 1 year ago

I think we really do need a listing endpoint for people who want to list URLs, including images. This keeps lists of files separate from HAPI data, and it lets people use the simple HAPI machinery for related tasks, but keeps HAPI data still having a strong, enforced connection to streamed data, i.e., the actual numeric content.

I talked to someone today at the AGU from the Canadian all-sky monitor program. They have 10 different "missions" (they call them programs), each with a set of platforms (physical platforms with all-sky-imagers), and the cameras on each platform produce 1 or more datasets per camera. They really want to make the auroral images available in HAPI. Each image may have different characteristics that their web site exposes via a search engine: background subtracted? yes/no image contains feature X (such as pulsating aurora) image contains feature Y are clouds in the image?

There are images every 3 seconds, but they also have a 1 per minute version of the data (just the first image taken that minute). There is also a burst mode with faster cadence.

Here's what I told him he could do to make this data available with HAPI:

Each program is a separate HAPI server. (The THEMIS set of all-sky-imagers would be one set - there are a dozen or so platforms). Then each hapi server has available a bunch of datasets. Here's the THEMIS one: all-sky-monitor/THEMIS/hap/catalog p1c1_PT3S (short for platform1camera1 - this is the RBG, visible light camera) p1c1_PT1M (the one-per-minute version) p1c1_burst p1c2_PT3S (this is an IR camera) p1c2_PT1M p1c2_burst p1c3 p2c1 (now on platform 2) p2c2 p2c3 p2c4

Something in the catalog has to indicate that these datasets are only available for use with the file listing service. Or maybe every dataset now indicates if it supports data queries or listing queries or both.

Then all the additional characteristics for each images (bg subtr. level, hasClouds?, hasPulsations?, etc) are other columns in the resulting listing, so that it is up to clients to filter the results, based on their interpretation of the column meanings. This is actually rather like what we would expect of numerical data columns - it's up to clients to know what those mean and do any filtering. For FITS images, these columns would represent FITS keywords, each row having that image's values for the set of keywords.

The fact that all images of a dataset are returned (rather than having the server implement filtering by keyword) solves our problem of how to represent image queries by pushing that to the client, and in terms of data volume, we are talking about listings, which are low data volume, so returning a longer listing (that needs filtering on the client) is still going to be fast. Communities can develop tools for doing that filtering, and that's on them but at least everyone (any generic HAPI server) can get the listing in a standard way, and tools like Autoplot could show the description of each keyword, and potentially develop user-driven, but totally generic filtering capability for the rows in the listing.

all-sky-monitor/THEMIS/hap/listing?id=p1c1_burst&start=2022-12-10T00Z&stop=2022-12-16T00Z and the table returned is then:

time, URL, isCloudy, hasPulsating, backgroundSubtrLevel,
2022-12-10T00:00:00Z, http://data.org/p1c1/2022/2022_12_10.png,true,true,4
2022-12-11T00:00:00Z, http://data.org/p1c1/2022/2022_12_10.png,true,true,4
2022-12-12T00:00:00Z, http://data.org/p1c1/2022/2022_12_10.png,false,true,4
2022-12-13T00:00:00Z, http://data.org/p1c1/2022/2022_12_10.png,false,true,4
VoyagerPWS commented 1 year ago

But you have a time series of 3D data. Surely you want to be able to stream that as well, right?

(Larry Granroth)

jbfaden commented 1 year ago

Yes, you can already stream this using a "size" like [18,20] for a time series of multi-dimensional arrays.

jvandegriff commented 1 year ago

From 2022-12-19 telecon: before solving this on our own, we really need input from more image handling experts at VSO, the Canadian ASI group, and NOAA, and LASP too. They will understand the problem space. We need form a group to tackle this, and soon!

jvandegriff commented 1 year ago

We don't need to add another endpoint, we can just add a new data type for URL (alongside time, integer, double, string) and then we can serve actionable images as URLs.

A problem is that image analysis clients often want to offer more than just time queries for images, and HAPI only has start and end time for queries. Here's one approach that supports richer queries by relying on the client to do more filtering:

Image datasets are still only indexed by time in HAPI. But each record contains not just the URL to an image, but other columns with the relevant metadata items (wavelength, cadence, lat/lon, features in the image, etc). Then the client can offer to filter the list of images by those keywords, and then only retrieve actual images for the ones that pass the filters. The listing of the URLS + keywords is not very much data, especially compared to the image data content, so it would still be performant.

This requires a different kind of client code to handle the images, but it strengthens the HAPI brand by allowing both kinds of image serving.

It is possibly also more compatible with other federated query mechanisms like the fido mechanism at the VSO and the EPN-TAP system that ESA uses. This needs to be explored more.

File listings would then also be supported in the same way.

Maybe we need a dataset flavor to indicate if a HAPI dataset is intended to be used as a fully numeric capture of time series numbers (tradtional HAPI), or HAPI for images, or HAPI for listings (the latter two being similar). So maybe just indicate in the info response if the content is primarily numeric or fileListing (or granules or just files).

rweigel commented 1 year ago

I see three use cases

  1. Given a time range, list all files from which data for a dataset and/or parameter was extracted. This would be useful mostly for debugging or for a user to be able to determine what versions of files were used to form a response at any point in time. I see this as being something that we suggest as a possibility to server developers who want to communicate provenance.

  2. Given a dataset, parameter, and time range, dump a list of files needed for the user to deal with (assuming the contents are not available from HAPI). In general, HAPI clients will not be able to fill their array and create a plot programmatically in unless someone writes generic readers for all possible scientific file formats, including ASCII (and schema used within). This definitely is in conflict with "given a dataset, parameter, and time range, fill my array and provide metadata so a plot can be programmatically created". In addition, there is a risk that people will say "I serve my data via HAPI" when they really only provide directory listings.

I think this is a much needed capability and it is integral to Fido and we should collaborate with them. We've always said that we should punt search tasks to SPASE, and I think part of the reason we were successful is we've avoided the temptation to solve search problems. Search is complicated. SPASE is 20 years old and ...

  1. Given a dataset, parameter, and time range, list the metadata in each file. This could be quite useful for searching for what files you need to visit and I can see usefulness to some of the metadata being automatically plotted. But then when you need to read the contents of the file, you can't use HAPI. And would we want to develop a specification for the metadata that is now being served as data? For example, if the metadata is a duration, does it have to be ISO 8601? If not, each client will need to handle all of the ways people express durations in order for queries to work.

I'd rather not stretch the simple and focused objective of "given a dataset, parameter, and time range, fill my array and provide metadata so a plot can be programmatically created".

I'd like to see all of this done, but I don't want the HAPI spec to get bloated. I see risk in trying to solve too many problems with one specification.

jbfaden commented 1 year ago

There's a lot of utility in being able to assert that the item is a URL. You are asserting that the String can be interpretted as a URL and can be downloaded. I think it's something like having a time type. We could have said that times could just be strings, but there's value in asserting a scheme, that it is a special type of string that can be parsed as a time. It's actionable, and likewise saying more specifically that the string is a URL is also actionable. No one is saying that HAPI would have to download and parse the URL target, it's just a way that the target exists. Clients can support this with a trivial change to their code--just interpret URL fields like you would a string.

This also allows file finding to be solved simply as a particular scheme in HAPI. Like we know that three columns of doubles can be a representation of a vector, we can also say startTime, stopTime, and URL is a file finding service. Further, since we're still in HAPI, we can attach all sorts of metadata to the images. I keep meaning to set up an example where I serve the URL for images (https://cottagesystems.com/data/hapi/pics/) along with image sizes and other metadata.

eelcodoornbos commented 1 year ago

One advantage of the current HAPI is that it can be used to get very fast access to timeseries data, which is why it so useful for building an interactive (web-based) plotting tool. I think due to the multi-dimensional nature, the use cases described above would not necessarily have this advantage anymore. And that’s ok, because there are other uses as well. But I would still really like to make use case # 2 work with the interactive timeline viewer. The plan to do so is that the files that are pointed at are reasonable sized (not more than 1-2 megapixels or so) JPEGs or PNGs. I can imagine that a data provider would want to provide such quick-look images files anyway, for fast browsing by users, in addition to more detailed/precise FITS or NetCDF files for more detailed scientific analyses. These are the assumptions I’m working with at the moment for further developing the timeline viewer tool.

I also agree that there is value in specifying that a string is a URL in an info request, just like the time type. I’m starting to use that info to have the timeline viewer populate drop-down menus for dataset and parameter selection. In addition, it would be very useful for the viewer client to get the file type (MIME type) and pixel resolution of the files pointed to in the URL from the info.

In the timeline viewer, I’m also making a distinction between “images per epoch” and “images per timespan / between two epochs”. The first (per epoch) fits in well with use case # 2 and can even already be done to some extent making use of the current HAPI spec using URLs in a string parameter. The second (timespan images) would need additional thinking for a good spec/implementation. The images per epoch, such as solar imagery and output of models plotted on a map, could be shown as image sequences (movies) in the viewer app, while images per timespan, such as spectograms and keograms, will get squashed or stretched horizontally when the user zooms in or out on the timeline.

sandyfreelance commented 1 year ago

Hi all,

I like the idea of starting with requirements and implementations. I have two user stories driving my wish for adoption. Might be useful if someone wrote up those HAPI-variant use cases as well.

** Sandy case: For my EUV-ML dataset, to fetch lists of FITS files in a given timespan I would like either or both cases of: a) a query for a time range returns the CSV of URIs (s3://)

b) a query for a time range returns the CSV of URIs (s3://) and additional data parameters (car_lat, car_lon) that are in the dataset, so I can do later subsetting on my own

It would be nice if there was a catalog or info parameter than enumerated which file type (MIME type) it was in an unambiguous manner, so I didn’t have to parse the file stem for .fts/.FTS/.fits/.Fits/.FITS/etc

So MIME type is in the catalog or info not in the CSV itself, the returned CSV data itself is just:

2018-01-19T02:22:02Z,26.1,15.0,s3://testdir/test25.cdf
2018-01-19T02:23:11Z,25.5,15.0,s3://testdir/test23.cdf
2018-01-19T02:24:33Z,24.4,15.0,s3://testdir/test21.cdf

**

Sandy case 2: HelioCloud repository to fetch lists of files for a time range

Same as above but with some more info:

# startime, key, filesize, checksum, wavelength, carr_lon, carr_lat
'2010-05-08T12:05:30.000Z','s3://edu-apl-helio-public/euvml/stereo/a/195/20100508_120530_n4euA.fts',’0’,’246000','195','20.4','30.0'
'2010-05-08T12:06:15.000Z','s3://edu-apl-helio-public/euvml/stereo/a/195/20100508_120615_n4euA.fts',’1’,’246000','195','21.8','30.0'
'2010-05-08T12:10:30.000Z','s3://edu-apl-helio-public/euvml/stereo/a/195/20100508_121030_n4euA.fts',’0’,’246000','195','22.4','30.0'

(overzealous use of ‘quotes’ can be ignored, not crucial). ** Eelco timeline viewer case: returns JPEG or PNG quicklook files plus MIME type and pixel resolution, plus a link to the full-sized file.

** Jon’s example (from the ticket): all-sky-monitor/THEMIS/hap/listing?id=p1c1_burst&start=2022-12-10T00Z&stop=2022-12-16T00Z

#time, URL, isCloudy, hasPulsating, backgroundSubtrLevel,
2022-12-10T00:00:00Z, http://data.org/p1c1/2022/2022_12_10.png,true,true,4
2022-12-11T00:00:00Z, http://data.org/p1c1/2022/2022_12_10.png,true,true,4
2022-12-12T00:00:00Z, http://data.org/p1c1/2022/2022_12_10.png,false,true,4
2022-12-13T00:00:00Z, http://data.org/p1c1/2022/2022_12_10.png,false,true,4

Others?

rweigel commented 1 year ago

Don't know why/how I closed. I think I was writing something the dog stepped on my keyboard and then I got up and forgot what I was doing.

jvandegriff commented 1 year ago

Figured it was something like that! :)

jvandegriff commented 1 year ago

We have meeting today to talk about this, and we should start with making sure we understand all the use cases that have been presented. I think I understand most of them, but would like to be sure.

Bob's use case 1 seems to be about capturing or listing the files used to fulfill a HAPI request. You could do that with existing HAPI features only. If a header is returned with the data (this is not the default, but it can be requested), the info portion in front of the data could include a separate x_constituentFiles keyword with list of the files used to fulfill the request. If we thought it useful enough, we could also change the spec to make constituentFiles an optional keyword (that users would only see in the header if it was prepended to a data response).

The second use case is the one for serving images: list files associated with a time range. This is the critical one that I think we do need to support somehow.

The third use case seems too far afield - managing and listing just metadata is best handled by a different search mechanism. Note that there are few SPASE-based search engines now, mostly just the Heliophysics Data Portal. But the point is that SPASE or other search options allow you to find the dataset names and potentially time ranges of interest, and then HAPI picks up there to get you the content.

The fact is that HAPI is a pretty generic way to serve lists of things, so the fact that clients so far are focused on absorbing time series data for plotting (or other analysis) is just because that's what is important to us. But HAPI can list events or images, as Eelco has shown. And NOAA would use HAPI if it served images, and the Canadian all-sky observing program (lots of cameras at different locations) has the same need.

We can put the image serving capability in three places:

  1. just endorse its use within regular HAPI; add a URL (and a URI?) type, and add an optional MIME type for URL and URI parameters so you don't have to guess at the content.
  2. add a new endpoint filelisting for lists of files where the large digital data is not the product, but the product is in the files that are being listed; other columns other than the URL (or URI) can be listed to allow client filtering based on values or ranges of those other parameters; we would likely need to indicate int he catalog if a datasets was meant for data or for a filelisting endpoint.
  3. create a separate spec for just listing files or images - this would involve replicating a lot of HAPI info - seems too complicated.

I favor the first approach - let HAPI be used for what people want to use it for. Provide guidelines for how to list content that is not raw, voluminous data. The listing endpoint is not ad good of an approach because the listing endpoint will start to seem like an arbitrary split, because what some people view as metadata, others view as data (the wavelength of an image set over time can be plotted, so is that data or metadata?)

sandyfreelance commented 1 year ago

On implementation, for the simple case of allowing HAPI to return URIs as a 'URI' datatype within a normal HAPI query (no plotting/ingest capability):

in the info/*.json, add

mimetype: string: (optional), MIME type, only applies if the URI type is a datatype in the set

for Parameter, add 'uri' as a type to the current list of string/double/int/isotime. Variable length string. (Means cannot be served with 'binary' data unless an optional 'length' is given, but 'length' is not required for URI by default).

e.g.

{"name": "s3key",
  "type": "uri",
  "fill": null,
  "description ": "s3 location of this EUV file"
},
rweigel commented 1 year ago

For review:

https://docs.sunpy.org/en/stable/guide/acquiring_data/fido.html#fido-guide

rweigel commented 1 year ago

Also, https://www.lmsal.com/heksearch/

jbfaden commented 1 year ago

I get nervous with the "Variable length string" because that is a break from the HAPI spec. Note that even for CSV, the strings must not be longer than what the info specifies. This is because you might be reading data in C where you need to allocate a space to put the data.

jbfaden commented 1 year ago

(And this is where I see an advantage to breaking with the spec, so that only CSV is returned and URL/URIs don't need to have a length limit. But that said I think for now we should just fit the use case into HAPI with this relatively small and useful addition.)

jvandegriff commented 1 year ago

For Sandy's use case of offering Amazon data in S3 buckets with URIs, he was asking for additional metadata about the AWS-specific aspects. That could be included inside an additionalMetadata block:


"additionalMetadata": {
    { "name" : "AWSInfo",
      "content":  { "awsRegion" : "us-east", "checksumAlgorithm": "SHA-256" },
      "aboutURL": "https://heliocloud.org/aws-metadata/about/"
    },
}

All the file-specific attributes can just be parameters on the dataset the has the URIs.

And so we need a URI type as well as URL.
jbfaden commented 1 year ago

So can any Amazon app running on us-east see this bucket, if it is public? Is an s3 bucket more analogous to a URL or a file?

jvandegriff commented 1 year ago

We will use this google doc for today's discussion: https://docs.google.com/document/d/1zK3dvtfM24NUTN5ru8gfctN2t4VsXa5X4ZbTIoJQbOI/edit?usp=sharing

jvandegriff commented 1 year ago

Here is what we decided at the Feb 15 discussion:

To support images, we will allow a new data type for content in a dataset. In addition to integer, double, string and tie, we will add a new dat parameter type to represent URI values, and this type will have an optional flag to indicate if it is encoded or not (default being not encoded). This type is effectively a constrained kind of string type, and hence the max length must be specified, to preserve readability in binary (where padding will be needed to fill out blanks up to the max length).

We will also allow for the optional association of a media-type with a URI parameter. (This is basically the same as the older concept of MIMI type.) The media-type allows servers to indicate what the content is behind the URI, so clients can know that without having to retrieve any content (for URLs, the media-type or MIME type is usually in the header, but you have to make a request to get that info.) Note that the media-type is optional, because some URI paramters may point to multiple types (not ideal, but someone's probably done it), or it may be specialty content for which there is no standard media-type.

So we will not add a new endpoint (i.e., no file listing endpoint), so that means that the way HAPI gets used is being opened up to be a time series listing mechanism for whatever the content is behind the URIs being listed. If a URI represents some kind of "thingyness", then generic HAPI clients should not be expected to take action or know what to do with the things. Specialized HAPI clients are free to take advantage of the "thingyness" and do something with it.

Of concern is the possibility that some server implementors with time series data will now simply make file listings available, rather than making data content (from within the files) available. This is strongly discouraged. Listing files in addition to having the data is fine, and probably a good idea.

Implementors are encouraged to make the URIs point to something public and accessible that could be expected to be useful. Examples include: image URLs, document DOIs, Amazon Web Services S3 URIs.

There are still some aspects of serving images that this solution does not address. Since HAPI only allows queries based on time, all other image metadata would likely need to be included as other columns after the URI. Any kind of regularization for what to put in those columns is outside the HAPI spec - it would belong in some kind of HAPI schema (a way to specify the structure and meaning of a HAPI response)

Also not addressed is the need to communicate the time duration for the URI content. Images for example, may represent a kind of snapshot (just a start time, probably with a very small integration time), or an extended time range, like in a PNG file representing a one-day summary plot. Also, an event list may contain events that are either a point in time (like a flare, or a shock arrival), or have a duration (like a magnetic storm, or a CME passage). This is definitely relevant now that URIs are allowed, but will be dealt with in a separate issue.

jvandegriff commented 1 year ago

All current parameter types are lower case. I recommend we also have the URI type be lower case, so the types will be: time double integer string uri

jvandegriff commented 1 year ago

What about the units for a URI parameter? I recommend that it be dimensionless, or if it is an image, then you could give a specific unit, like Jy, for example.

jbfaden commented 1 year ago

What are the units for a string? Seems like when there's a question we should do what string does.

jvandegriff commented 1 year ago

Strings are presumably dimensionless. But now it's different since units in this case could refer to the content of the URI. We should at least mention this in the spec - what we recommend for units on URIs.

jvandegriff commented 1 year ago

Where should we put the 'isEncodedandmediaType` values? And do we use this capitalization?

Here's an example that puts the info in a separate uriInfo block:

"parameters": 
    [ { "name": "Time",
        "type": "isotime",
        "units": "UTC",
        "fill": null,
        "length": 24
      },
      { "name": "solar_images",
        "description": "full-disk images of the Sun by SDO/AIA at wavelength of 304 angstroms"
        "type": "uri",  // UGH - new type...
        "uriInfo": { "isEncoded": true, "mediaType": "image/fits" },
        "length": 64,
        "units": "Jy",  (probably no do this - units should be null)
        "fill": NaN
      },
      { "name": "solar_images",
        "description": "full-disk images of the Sun by SDO/AIA at wavelength of 304 angstroms"
        "type": "encodedUri" - UGH - even more new types, and combinatorics on the new type
        "mediaType": "image/fits",
        "length": 64,
        "units": null,
        "fill": NaN
      },
-OR-
      { "name": "solar_images",
        "description": "full-disk images of the Sun by SDO/AIA at wavelength of 304 angstroms"
        "type": "uri",   (all must be not encoded)
        "mediaType": "image/fits", // not ideal, since this would only apply if it was a "type": "uri"
        "length": 64,
        "units": null,
        "fill": NaN
      },
-OR-
      { "name": "solar_images",
        "description": "full-disk images of the Sun by SDO/AIA at wavelength of 304 angstroms"
        "type": "uri:mediaType=image/fits",  // adds too much complexity to type interpretation; would confuse client code
        "length": 64,
        "units": null,
        "fill": NaN
      },
-OR-
      { "name": "solar_images",
        "description": "full-disk images of the Sun by SDO/AIA at wavelength of 304 angstroms"
        "type": { "id": "uri", "mediaType": "image/fits", "isEncoded": false }, // very huge change to the "type" definition
        "length": 64,
        "units": null,
        "fill": NaN
      },
-OR-
      { "name": "solar_images",
        "description": "full-disk images of the Sun by SDO/AIA at wavelength of 304 angstroms"
        "type": "string", // just use existing string definition, and the annual an optional block for specialty strings
        "stringInfo": { "type": "uri", "mediaType": "image/fits", "scheme": "https | s3 | ftp | doi | http | other", "x_LASP_scheme": "lisird"}, 
         "x_DougsStringInfo": {},
        "length": 64,
        "units": null,
        "fill": null  OR "" OR "anyString"
      }
     ]

We talked about many of these options at the Feb 20 telecon, and liked the last one the most. We are not wanting to add the "scheme" option now, since it would be easy to add later but hard to take out.

jvandegriff commented 1 year ago

Discussion on Feb 20:

We will not add a new URI type, but just add a stringInfo keyword that for now is just for indicating that a string contains a URI value. All URI values must NOT be encoded. Strings can contain UTF-8 values, so all possible URI values should be able to represented in the string.

"parameters": [
      { "name": "Time",
        "type": "isotime",
        "units": "UTC",
        "fill": null,
        "length": 24
      },
      { "name": "solar_images",
        "description": "full-disk images of the Sun by SDO/AIA at wavelength of 304 angstroms"
        "type": "string",
        "stringInfo": { "type": "uri", "mediaType": "image/fits" },
        "length": 64,
        "units": null,
        "fill": null  OR "" OR "anyString"
     }
]

The stringInfo is optional and only should be present if the string is a URI. Within stringInfo, the type is required, and mediaType is optional. No scheme for now in the stringInfo. People can add it now as an "x_" element and if it's useful, we can add it to the spec.

Because the URIs are string types, the units should be null, like they are for other strings.

The fill values can be: null - this communicates there are no fill values "" (empty string) OR "any other string" - this lets clients know when there is no URI for a given row; note that an empty string is just a special case of "anyString"

sandyfreelance commented 1 year ago

Agreed. Are we adding "scheme": "https, http, s3, ftp, autoplot, other" to the stringInfo as an option?

eelcodoornbos commented 1 year ago

I like the StringInfo solution.

It would be very helpful to have the scheme in the info for a dataproduct, in addition to the mediaType. When the URI scheme is "https" or "http" and the mediatype is "image/jpeg", "image/png" or "image/gif", a web-based client can then try to fetch the images in the URIs and show them to the user. When they are something else, it can let the user know that it cannot deal with the particular scheme and/or media type, by providing a message, or disabling the user from selecting such a parameter in a selection menu, before ever having to fetch any actual data. Other clients might know how to deal with other schemes and mediatypes, of course.

This all assumes that a parameter in a dataset will contain uniform URIs and content behind those URIs. Otherwise there must perhaps be some way to indicate the scheme and/or media type is "mixed".

For my client and custom HAPI server, I will probably at some point also want to add keywords in the info relating to image resolution, color map and scale extents used, etc., and offer different options for these in different parameters of a dataset, pointing at PNGs or JPEGs generated of the same underlying data. These will then probably be custom keywords for use by my client.

rweigel commented 1 year ago

This is pretty close.

"type": "string",
"stringInfo": { "type": "uri", "mediaType": "image/fits" },

Two issues:

  1. Info already has a meaning in HAPI as an endpoint, so ideally we would use something else.
  2. It seems that mediaType is should be a child of uri. Maybe stringInfo = {"uri": {"mediaType": "image/fits"} Not are ideal either b/c of the one-deep nesting. We could have uriInfo instead of stringInfo to flatten things. We are fighting against the assumption that the values of each column in a HAPI response "y-" and "z-" values, so we'll have to pick the option that looks the least weird.
jvandegriff commented 1 year ago

What about stringConstraints instead of stringInfo?

Also, I like having it be nested like you've shown. Even if it is more layers, it keeps things separated, so that a media type is only ever relevant for a URI constraint. The alternative is that we have rules (statements in the spec) that tell people how to keep things separate ("don't use media type unless you have a URI"), so the extra layer seems better since it naturally enforces the relevance rules.

I think we should go ahead and include the scheme as an optional element. It does help clients know exactly how they could access the content behind the URI.

We could make it a list or a dictionary. Here's what the list would look like.

"type": "string",
"stringConstraints": [ "uri": { "mediaType": "image/png", "scheme": "https" } ]
rweigel commented 1 year ago

That works. It would need to be (no square braces):

{ "uri": { "mediaType": "image/png", "scheme": "https" } }

But what if I want to avoid specifying a media type and/or scheme? Maybe

{ 
"type": "uri", 
"uriConstraints" { 
   "mediaType": "image/png", "scheme": "https"
   }   
}

This isn't great, but it parallels the type="string" with a "stringConstraints" at the same level.

jbfaden commented 1 year ago

Is an empty dictionary allowed in JSON?

{ "uri": {  } }
rweigel commented 1 year ago

@jbfaden https://jsonlint.com/ says yes.

I missed a colon after "uriConstraints".

{ "type": "uri", "uriConstraints": { "mediaType": "image/png", "scheme": "https" }
}

jvandegriff commented 1 year ago

Jeremy's question seems relevant - what if there are no constraints other than saying it is a URI - we should think about that more.

Or maybe we should require the mediaType to be be there since it is sort of like the data type, and we want to encourage people to really only put in content that someone would reasonably be expected to consume, and requiring the mediaType emphasizes this. There is presumably a way to say "unknown" or "other" that servers could put in there if the URI content really is something unusual or non-standard.

jbfaden commented 1 year ago

Wouldn't this work (using a dictionary/map rather than an array):

"type": "string",
"stringConstraints": { "uri": { } }

This asserts that string is a uri, and all the defaults for uri are to be used.

rweigel commented 1 year ago

We could have stringConstraint be a string or an object. We do that with label, which can be a string or array of strings.

So

stringConstraint: "uri" (simplest, and probably will be most often used)

or

stringConstraint: { ... }

rweigel commented 1 year ago

So here is the issue I was worried about from the outset regarding this getting complex. I was working on creating a file listing. And I want to

  1. include the base URI in the info response so that one does not need to repeat it in the data and
  2. indicate the file size, start and stop of data in the file, and last modified in the data response. Maybe even the MD5. Also that each file has an associated preview image.

We could add "stringPrefix" for 1.

For 2., we should really require that if this info were available, the parameters would have names from a controlled vocabulary. Or there would be a mapping in the info metadata from the controlled vocabulary to the parameter name.

We could suggest that you have two servers if you have file listings and also serve the data in the files from a HAPI server. For example, you run two servers

CDAWeb/

CDAWeb-files/ (Provides a list of files from which a given parameter in CDAWeb is drawn.)

One could also have

CDAWeb-file-images/ For plots associated with each file in CDAWeb-files.

Otherwise, we'd have a /catalog response that had something like

{"id": "AC_H2_MFI",
{"id": "AC_H2_MFI-some-name-to-indicate-file-list",
{"id": " AC_H2_MFI-some-name-to-indicate-gifwalks",
  ...

but most HAPI users would only want AC_H2_MFI data and would have to filter out the other stuff (or we would need to add some way of indicating that some datasets are only metadata datasets). I recall a similar issue in CDAWeb's all.xml. I think that when Nand first set up the server, he had dataset ids in his /catalog response corresponding to gifwalk datasets.

We could indicate the type of dataset at the catalog level, but again we are stretching the scope of HAPI.

jvandegriff commented 1 year ago

But this need for a controlled vocabulary for file listings is no different than for data.

If I have a magnetometer dataset, I need to communication which fields are for the actual magnetic field data, wether or not there are uncertainties, and which column has the quality flags are, and what those mean.

If I have an energetic particle dataset, I want to communication which parameters have the pitch angle, and which spectra correspond to protons, and which to alphas, and which to He+, and which to a range of heavy elements (CNO, for example).

Also, for any in-situ data, sometimes there are columns with the spacecraft position, so it would be nice to indicate that too.

HAPI lets people get to the numbers, but doesn't assign any meaning to them.

This is what Jeremy and I keep talking about by saying the next thing is to have a semantic capability on top of HAPI that lets people know what the parameters are.

For now, I would suggest an additional metadata block, that has a controlled vocabulary for specific kinds of known quantities, and in there you can identify which columns can be used for specific purposes. IT's easiest for magnetometer data, and also for your file listing example:

"additionalMetadata: [
   { "name": "hapiSemantics",
     "aboutURL": "http://ref.to.info/about/ways/tolabel/your/parameters"
     "content": { "fileSize": "file_sz_kB",
                           "fileLastModified": "last_mod_time",
                           "fileChecksumMD5": "data_md5",
                           "fileDataStartTime": "first_fime",
                           "fileDataStopTime": "last_fime" }
   }
]

For mag data, it might look something like this:

"additionalMetadata: [
   { "name": "hapiSemantics",
     "aboutURL": "http://ref.to.info/about/ways/tolabel/your/parameters"
     "content": { "magVector": "mag_GSE",
                           "magUncert": null,
                           "magQualityFlagString": "mag_qual" }
   }
]

Since there may be more than one mag vector in a file (or more than one file listing?), I guess you would want to allow for that:

"additionalMetadata: [
   {    "name": "hapiSemantics",
         "aboutURL": "http://ref.to.info/about/ways/tolabel/your/parameters"
         "content": {
              [ // needs to be a list since you could have more than one
                "magneticField": {
                     "vector": "mag_GSE",
                     "uncertVector": "mag_GSE_uncert",
                      "magQualityFlagString": "mag_qual" }
               ]
          }
   }
]
jvandegriff commented 1 year ago

About including the base URI in the info response so it does not need to be repeated (something Bob motions in point 1 above): I suggest we don't add complications to support this. Jeremy points out that the effect of this kind of "waste" (repeated characters in the data stream) is mitigated by web servers and clients automagically compressing things in transport, so I vote that we don't worry about keeping URIs short.

jvandegriff commented 1 year ago

I like the idea of having the stringConstraints be either a string value, such as uri or an object with that values as the key and other options specified.

Do we do this anywhere else?