Display all locations that contain "air_temperature" on a map

rsignell-usgs commented 6 years ago

From this ERDDAP endpoint: http://erddap.sensors.ioos.us/erddap/index.html

Return the minimum lon,lat for all datasets using the ERDDAP allDatasets API: http://erddap.sensors.ioos.us/erddap/tabledap/allDatasets.json?datasetID%2CminLongitude%2CminLatitude returning a master lon/lat station list.
Find all the datasets with cdm_data_type=timeseries data that contain a variable with standard_name=air_temperature and that has data within the last 7 days. Return the Dataset ID for datasets that match these criteria using the advanced search API:

http://erddap.sensors.ioos.us/erddap/search/advanced.json?page=1&itemsPerPage=1000&searchFor=&protocol=tabledap&cdm_data_type=timeseries&institution=(ANY)&ioos_category=(ANY)&keywords=(ANY)&long_name=(ANY)&standard_name=air_temperature&variableName=(ANY)&maxLat=90&minLon=-180&maxLon=-45&minLat=10&minTime=now-7days&maxTime=now

Use the returned Dataset ID list to extract the lat/lon from the master lat/lon station list.
Display these points on a map (using Leaflet, OpenLayers3 or equivalent).

rsignell-usgs commented 6 years ago

@BobSimons, is this the most efficient approach to finding the lon/lat of all timeseries datasets that have a specific standard_name?

BobSimons commented 6 years ago

By specifying "protocol=tabledap", you are limiting the response and not getting data "for all datasets". E.g., you won't get gridded/model data.

By specifying "cdm_data_type=timeseries", you are limiting the response and not getting data "for all datasets". E.g., you won't get trajectories (e.g., data from ships/cruises).

By specifying those Lon limits, you are limiting the response and not getting data "for all datasets". Beyond the obvious limit (maxLon=-45), some datasets may have lon in the range of 0 - 360.

By specifying itemsPerPage=1000, you are limiting the response. That is the default, but you can and should change it to more that the number of datasets on that ERDDAP (36000?!), e.g., 50000.

Some datasets (notably data from relational databases) don't necessarily have known min and max time values. So the time constraints may limit the response.

Beyond that, the basic approach is correct: get a list of datasetIDs that meet some criteria, then get the data from each.

One of the reasons ERDDAP doesn't have something like this built in is that I am always really, really uncomfortable with this kind of effort.

This effort mixes data from different altitudes.
This effort mixes data from different times: different days and even different times of day.
This effort ignores the different ways that different datasets calculated the air_temperature (average over some period vs instantaneous, direct calculation vs derived, exposure of the sensor, etc).
This effort ignores the quality of the data (does a given dataset have quality flags that mark some values as "Bad").

I understand the appeal of this kind of effort, but I remain unconvinced that is the right thing to do. I think that either:

you need to set up a system with a ton more information (e.g., quality flags, something more specific than standard_name (see Roy Lowry's efforts for a start)) so that you can confidently combine data from different datasets
or, a human needs to be involved to say that combining the data from this dataset and that dataset is legitimate because the datasets are so similar.

sasignell commented 6 years ago

I was under the impression that this effort was mostly intended to connect folks with recent data from a particular station, not as a tool allowing one to compare data between stations/sensors. So I'm thinking that mixing of the data is not a big concern here, e.g. they click on one station and get the data as presented and collected, and when they click on another they get the data as collected from that station. It would be incumbent on the user to interpret the different stations data properly of course, but nowhere would we be presenting site data side by side as if they were equal. Rich?

rsignell-usgs commented 6 years ago

Yes @sasignell , that is correct. The main thing is to create some non-trivial JS code that someone else could use to expand on also, adding QC flag options, elevation restriction, etc.

So my takeaway from @BobSimons's response is: we should increase itemsPerPage to 50000, but otherwise, we've got the right approach! :smile_cat:

rsignell-usgs commented 6 years ago

@BobSimons , I should have mentioned that we are actually only after all the single point/sensor data, hence the restriction to "protocol=tabledap" and "cdm_data_type=timeseries".

BobSimons commented 6 years ago

Yesterday, I just looked at the URL and answered the question in a general way. Since then I have look at that ERDDAP and the datasets. Let me offer an additional answer...

Currently that ERDDAP has one ERDDAP dataset per sensor. That's fine. Keep that. But you might consider also creating an aggregated ERDDAP dataset for each type of sensor. Thus one dataset might have data from 1000's of sensors, all of one type. There a couple of ways to do this in ERDDAP, so the method to use depends on how you have things set up on the back end. The most efficient is to give the new dataset access to all of the files for all of the related sensors. Presumably, each current dataset/data file for a given type of sensor has exactly the same variable names as the others. And fortunately, the datasets seem to all include a sensor ID. So it is probably straight forward to add aggregated datasets. And since each sensor of a given type has very similar data, collected and processed in the same way, it is probably valid to make an aggregated dataset. Thus, all my concerns yesterday about combining data from different datasets go away.

Then, your question that started this thread has a better answer: A user can just query the aggregated dataset and get the relevant data (e.g., for a given lat/lon/time range) from 1000's of sensors. That is vastly (as in, three orders of magnitude!) faster, easier and more efficient than the approach initially suggested.

rsignell-usgs commented 6 years ago

@BobSimons, according to @kwilcox, this specific ERDDAP endpoint connects on the backend to a custom service that Axiom developed, so I don't know what's possible there.

I wasn't thinking of this as the most efficient viewer for this specific dataset, more of a generic viewer that folks could modify slightly and use with their own ERDDAP endpoint.

Although it may be way faster to aggregate, it seems plenty fast without the aggregation.

BobSimons commented 6 years ago

You should still be able to make an EDDTableAggregateRows dataset which has 1000's of EDDTableFromErddap child datasets.

rsignell-usgs commented 6 years ago

@BobSimons , interesting -- I didn't know about that one, but I suppose there are reasons that Axiom didn't want to do that. Perhaps they want to win with the largest number of datasets! :-)

I think it will be interesting to just try with individual datasets first anyway.

BobSimons commented 6 years ago

I'm encouraging you to do both individual datasets and the aggregated dataset. In fact, with the EDDTableAggregateRows based on 1000's of EDDTableFromErddaps approach, you have to do both. And this will get them the most datasets, e.g., 36000 individual sensor datasets + e.g., 10 aggregated datasets. :-)

rsignell-usgs commented 6 years ago

@BobSimons , ah, now I get it.
Okay, @kwilcox, what do you say? :smile_cat:

rsignell-usgs / erddap-tsviewer

Display all locations that contain "air_temperature" on a map #1