ioos / compliance-checker

Python tool to check your datasets against compliance standards
http://ioos.github.io/compliance-checker/
Apache License 2.0
108 stars 58 forks source link

Use .ncCFHeader ERDDAP format for Platform dimension tests and other IOOS 1.2 checks #805

Open mwengren opened 4 years ago

mwengren commented 4 years ago

This issue is to capture discussion in https://github.com/ioos/compliance-checker/pull/799#discussion_r411518186 so it doesn't get lost.

From those comments:

@mwengren said:

Can we implement the dimension check, and maybe most or all of the IOOS checks, by requesting the .ncCFHeader response instead of the full .ncCF output? For some ERDDAP datasets, if there aren't limits placed on the request, any of the .nc, .ncCF, .ncCFMA, or even .csv output types could end up requesting a lot of data that could contribute to poor performance.

That may be too big a change though. If so, let's go with .ncCF and see what results are for next RC.

@daltonkell said:

I think including this in the next RC is appropriate, but we'll have to leave the .ncCFHeader request out for another edition. It's a great idea and perhaps we can get some good contributions for it, but it requires finding a way to "switch off" the checks which examine data (which would obviously fail).

I'm currently working on merging concepts from this PR and Ben's latest, #800, because his implements a useful abstraction for handling any remote netCDF resource.

This would be a more permanent solution to #804 in that it would hopefully reduce the file sizes requested from ERDDAP .ncCF output formats.

daltonkell commented 4 years ago

@mwengren

I've spent time examining this issue and how it relates to #804. I am close to finding a solution which would enable the checking of an ERDDAP dataset through only its metadata by parsing the .ncCFHeader file, which is similar to an ncdump -h output. For example, the WQB-04.ncCFHeader:

   1 netcdf WQB-04_3a1c_1e61_386e.nc {                                                                                             
   2   dimensions:                                                                                                                 
   3     timeseries = 1;                                                                                                           
   4     obs = 293;                                                                                                                
   5   variables:                                                                                                                  
   6     float latitude(timeseries=1);                                                                                             
   7       :_CoordinateAxisType = "Lat";                                                                                           
   8       :actual_range = 19.7341f, 19.7341f; // float                                                                            
   9       :axis = "Y";                                                                                                            
  10       :comment = "instrument is in fixed location";                                                                           
  11       :ioos_category = "Location";                                                                                            
  12       :long_name = "Latitude";                                                                                                
  13       :short_name = "lat";                                                                                                    
  14       :standard_name = "latitude";                                                                                            
  15       :units = "degrees_north";                                                                                               
  16       :valid_range = 19.7341f, 19.7341f; // float

Concept

Create a netCDF4.Dataset object in write mode using only the information given in the .ncCFHeader file. No IOOS checks examine the array data (correct me if I'm wrong, @benjwadams), and thus a metadata-only Dataset would provide the necessary interface for the Compliance Checker to do its work.

Challenges

  1. Interpreting the encoding of numeric metadata

Many variables require numeric attributes, such as valid_min, valid_range, _FillValue, etc. In the .ncCFHeader file, these attributes are encoded as characters:

:valid_range = 19.7341f, 19.7341f; // float

Humans reading this understand it to be an array of floating point values, but a machine only reads it as characters encoded into a text file.

Assuming that these numeric attributes are numeric arrays could be a large misstep. If the attribute is actually encoded in the file as a character array, it should be marked as an incorrect datatype. Thus, the question:

"Is it safe to assume comma-separated strings of numeric-only (or suffixed by f for float, d for double, etc) characters are encoded into an ERDDAP .ncCF file as numeric arrays?"

Because ERDDAP is so strict with certain aspects of its typing, this may be the case.

Tagging @ocefpaf in here just in case he wants to drop some more knowledge on us ;)

mwengren commented 4 years ago

@daltonkell Another consideration is that requesting the generic .ncCFHeader output type in ERDDAP, without any filtering, can cause serious load and/or wait time on the ERDDAP server, depending on the dataset size. The file download size won't be excessive since it's just metadata (like it might be for the equivalent .ncCF format), but the server may still need to read a lot of source files to generate the response.

I talked about looking for workarounds like using a generic time query filter value in: https://github.com/ioos/compliance-checker/issues/804#issue-614304533.

For the WQB-04 dataset, the obs dimension response varies depending on what filter criteria are passed for time in these two requests:

https://pae-paha.pacioos.hawaii.edu/erddap/tabledap/WQB-04.ncCFHeader?&time%3E=2020-07-06T16:00:00Z https://pae-paha.pacioos.hawaii.edu/erddap/tabledap/WQB-04.ncCFHeader?&time%3E=2020-06-06T16:00:00Z

For the IOOS 1.2 Platform check, we need to get this dimension information, but I don't think we need to read any of the array/table data itself, at least, so maybe there's a solution here. Just wanted to note these issues though.

Not sure on your data type questions, sorry!

ocefpaf commented 4 years ago

The file download size won't be excessive since it's just metadata (like it might be for the equivalent .ncCF format), but the server may still need to read a lot of source files to generate the response.

You are right. I never thought about that but the creation of that info on a file slice request can be demanding on the server side!

Maybe we could use the dataset_id info response? It has probably a smaller server side footprint (Bob can probably say more about that). There are no ncHeader like responses for it though. There is a nc response but that is just the same table as CSV or JSON, which are lighter downloads.

I've been playing with constructing a nc-like object from that reponse: https://nbviewer.jupyter.org/gist/ocefpaf/ae0d650af68c0670e5f09d35c887129c

It is probably a long way from what compliance-checker needs though. And again, no data test would ne run, only metadata tests would work.

mwengren commented 4 years ago

@daltonkell @benjwadams Does the attribute dictionary response in @ocefpaf's notebook look useful for the IOOS 1.2 checker for ERDDAP datasets? We may still need dimension info in order to test the platform concept check, potentially.

Also, @benjwadams would this meet our needs in ioos/catalog-ckan#208 if it were built into erddapy directly?

daltonkell commented 4 years ago

@mwengren @ocefpaf's use of the https://geoport.usgs.esipfed.org/erddap/info/1051-A/index.html is very resourceful. Without the dimension information though, we are unable to test of the dataset is CF-DSG-compliant, which, if I recall correctly, was a pretty critical step in the IOOS-1.2 spec. It also doesn't really answer the problem of assuming attribute encoding like I mentioned earlier.

ocefpaf commented 4 years ago

I wonder if we should work upstream with ERDDAP developers to augment the info response with this metadata instead of working around it. What do you think?

daltonkell commented 4 years ago

@ocefpaf I'm 100% for collaboration. Perhaps we could reach out via the Google Group?

mwengren commented 4 years ago

ERDDAP Google Group would be a good call to begin with. I haven't seen much issue traffic in the ERDDAP GtiHub repo - probably partly to do with Bob being the primary/solo developer. He does respond typically on either however.

srstsavage commented 3 years ago

Hi all, checking in on this issue as we're working on a data ingestion pipline that will involve frequently running compliance-checker IOOS 1.2 checks against ERDDAP endpoints. Any thoughts on reducing the burden on remote ERDDAPs? Maybe just a non-ideal cchecker flag to skip any checks requiring lot of data loading until the situation can be improved on the ERDDAP side?

mwengren commented 3 years ago

@daltonkell Can you look into options for resolving this dimension checking issue while you're also investigating the CF FeatureType issues in this PR https://github.com/ioos/compliance-checker/pull/858.

They may not be related exactly, but this one has been lingering for awhile and a flag option as in @shane-axiom's suggestion would help make automated scans of ERDDAP servers for Metadata Profile 1.2 compliance a lot more performant.

If there isn't an ERDDAP-based fix for this in the works and we don't have a good workaround like time dimension filtering in CC to reduce ERDDAP sever CPU time to determine dataset DSG dimensionality, providing the option to skip those tests might be the best way to go.