Open mwengren opened 4 years ago
@mwengren
I've spent time examining this issue and how it relates to #804. I am close to finding a solution which would enable the checking of an ERDDAP dataset through only its metadata by parsing the .ncCFHeader
file, which is similar to an ncdump -h
output. For example, the WQB-04.ncCFHeader
:
1 netcdf WQB-04_3a1c_1e61_386e.nc {
2 dimensions:
3 timeseries = 1;
4 obs = 293;
5 variables:
6 float latitude(timeseries=1);
7 :_CoordinateAxisType = "Lat";
8 :actual_range = 19.7341f, 19.7341f; // float
9 :axis = "Y";
10 :comment = "instrument is in fixed location";
11 :ioos_category = "Location";
12 :long_name = "Latitude";
13 :short_name = "lat";
14 :standard_name = "latitude";
15 :units = "degrees_north";
16 :valid_range = 19.7341f, 19.7341f; // float
Create a netCDF4.Dataset
object in write mode using only the information given in the .ncCFHeader
file. No IOOS checks examine the array data (correct me if I'm wrong, @benjwadams), and thus a metadata-only Dataset
would provide the necessary interface for the Compliance Checker to do its work.
Many variables require numeric attributes, such as valid_min
, valid_range
, _FillValue
, etc. In the .ncCFHeader
file, these attributes are encoded as characters:
:valid_range = 19.7341f, 19.7341f; // float
Humans reading this understand it to be an array of floating point values, but a machine only reads it as characters encoded into a text file.
Assuming that these numeric attributes are numeric arrays could be a large misstep. If the attribute is actually encoded in the file as a character array, it should be marked as an incorrect datatype. Thus, the question:
"Is it safe to assume comma-separated strings of numeric-only (or suffixed by f
for float
, d
for double
, etc) characters are encoded into an ERDDAP .ncCF file as numeric arrays?"
Because ERDDAP is so strict with certain aspects of its typing, this may be the case.
Tagging @ocefpaf in here just in case he wants to drop some more knowledge on us ;)
@daltonkell Another consideration is that requesting the generic .ncCFHeader
output type in ERDDAP, without any filtering, can cause serious load and/or wait time on the ERDDAP server, depending on the dataset size. The file download size won't be excessive since it's just metadata (like it might be for the equivalent .ncCF
format), but the server may still need to read a lot of source files to generate the response.
I talked about looking for workarounds like using a generic time
query filter value in: https://github.com/ioos/compliance-checker/issues/804#issue-614304533.
For the WQB-04 dataset, the obs
dimension response varies depending on what filter criteria are passed for time
in these two requests:
https://pae-paha.pacioos.hawaii.edu/erddap/tabledap/WQB-04.ncCFHeader?&time%3E=2020-07-06T16:00:00Z https://pae-paha.pacioos.hawaii.edu/erddap/tabledap/WQB-04.ncCFHeader?&time%3E=2020-06-06T16:00:00Z
For the IOOS 1.2 Platform check, we need to get this dimension information, but I don't think we need to read any of the array/table data itself, at least, so maybe there's a solution here. Just wanted to note these issues though.
Not sure on your data type questions, sorry!
The file download size won't be excessive since it's just metadata (like it might be for the equivalent .ncCF format), but the server may still need to read a lot of source files to generate the response.
You are right. I never thought about that but the creation of that info on a file slice request can be demanding on the server side!
Maybe we could use the dataset_id info response? It has probably a smaller server side footprint (Bob can probably say more about that). There are no ncHeader
like responses for it though. There is a nc
response but that is just the same table as CSV or JSON, which are lighter downloads.
I've been playing with constructing a nc-like object from that reponse: https://nbviewer.jupyter.org/gist/ocefpaf/ae0d650af68c0670e5f09d35c887129c
It is probably a long way from what compliance-checker needs though. And again, no data test would ne run, only metadata tests would work.
@daltonkell @benjwadams Does the attribute dictionary response in @ocefpaf's notebook look useful for the IOOS 1.2 checker for ERDDAP datasets? We may still need dimension info in order to test the platform concept check, potentially.
Also, @benjwadams would this meet our needs in ioos/catalog-ckan#208 if it were built into erddapy directly?
@mwengren @ocefpaf's use of the https://geoport.usgs.esipfed.org/erddap/info/1051-A/index.html is very resourceful. Without the dimension information though, we are unable to test of the dataset is CF-DSG-compliant, which, if I recall correctly, was a pretty critical step in the IOOS-1.2 spec. It also doesn't really answer the problem of assuming attribute encoding like I mentioned earlier.
I wonder if we should work upstream with ERDDAP developers to augment the info response with this metadata instead of working around it. What do you think?
@ocefpaf I'm 100% for collaboration. Perhaps we could reach out via the Google Group?
ERDDAP Google Group would be a good call to begin with. I haven't seen much issue traffic in the ERDDAP GtiHub repo - probably partly to do with Bob being the primary/solo developer. He does respond typically on either however.
Hi all, checking in on this issue as we're working on a data ingestion pipline that will involve frequently running compliance-checker IOOS 1.2 checks against ERDDAP endpoints. Any thoughts on reducing the burden on remote ERDDAPs? Maybe just a non-ideal cchecker flag to skip any checks requiring lot of data loading until the situation can be improved on the ERDDAP side?
@daltonkell Can you look into options for resolving this dimension checking issue while you're also investigating the CF FeatureType issues in this PR https://github.com/ioos/compliance-checker/pull/858.
They may not be related exactly, but this one has been lingering for awhile and a flag option as in @shane-axiom's suggestion would help make automated scans of ERDDAP servers for Metadata Profile 1.2 compliance a lot more performant.
If there isn't an ERDDAP-based fix for this in the works and we don't have a good workaround like time dimension filtering in CC to reduce ERDDAP sever CPU time to determine dataset DSG dimensionality, providing the option to skip those tests might be the best way to go.
This issue is to capture discussion in https://github.com/ioos/compliance-checker/pull/799#discussion_r411518186 so it doesn't get lost.
From those comments:
@mwengren said:
@daltonkell said:
This would be a more permanent solution to #804 in that it would hopefully reduce the file sizes requested from ERDDAP
.ncCF
output formats.