ioos / ioos_qc

:ballot_box_with_check: :ocean: IOOS QARTOD and other Quality Control tests implemented in Python
https://ioos.github.io/ioos_qc/
Apache License 2.0
46 stars 27 forks source link

WIP: New configs #33

Closed jessicaaustin closed 3 years ago

jessicaaustin commented 4 years ago

Major refactor of how test configurations are specified. Allows for storage of generic configuration objects that specify test configs based on variable, region, and time window. This is useful for applying test configs to moving platforms, such as gliders, animal tracks, etc.

The previous config object, QcConfig, still exists and functions the same, but is now known as StreamConfig.

TODO what are the major changes around storage of results?

kwilcox commented 4 years ago

@jessicaaustinaxiom The PR is in decent shape for a usability review. Please take a look!

jessicaaustin commented 4 years ago

@kwilcox awesome! I will take a look next week and please yell at me if I don't.

kwilcox commented 3 years ago

:scream_cat:

jessicaaustin commented 3 years ago

@kwilcox can you please rebase from master? Looks like this is 10 commits behind and I got conflicts when I tried.

jessicaaustin commented 3 years ago

@kwilcox some general thoughts on the configs:

For example, say you have the following config, based on variable names:

    config = """
        streams:
            atmp:
                qartod:
                    gross_range_test:
                        suspect_span: [-30, 30]
                        fail_span: [-80, 140]
            sal:
                qartod:
                    gross_range_test:
                        suspect_span: [0, 30]
                        fail_span: [0, 40]
    """

Here's a version that's more generic:

    config = """
        streams:
            some_stream_id_that_doesnt_really_matter_1:
                attrs:
                    standard_name: air_temperature
                    units: degree_Celsius
                qartod:
                    gross_range_test:
                        suspect_span: [-30, 30]
                        fail_span: [-80, 140]
            some_stream_id_that_doesnt_really_matter_2:
                attrs:
                    standard_name: sea_water_practical_salinity
                    units: 1
                qartod:
                    gross_range_test:
                        suspect_span: [0, 30]
                        fail_span: [0, 40]
    """

This could be implemented by having the StreamConfig constructor look for a specific attrs key and store it separately from the test config. Then specific stream implementations, like NetcdfStream, could use the attrs to match a var, rather than the var name. A fallback could be if the attrs are not specified, then it uses the stream_id as the variable name just like you have already.

Thoughts?

kwilcox commented 3 years ago

oooo great points, this is exactly what we needed to discuss. Few counter points for discussion:

Here's a version that's more generic:

I see this as less generic because it is now specific to a stream format that understands what a standard_name is.

but it would be nice to specify other attributes that should "categorize" a config -- attributes that define what the variable is. for example, standard_name + units. or instrument make_model

I see that as a job for someone using/implementing the ioos_qc library. The QC functions don't know what the data represents. The keys for each stream in a ContextConfig should be used to locate/read the data from any Stream based object. That ends up being variable names in netCDF files, column names in dataframes, and potentially a topic name in a kafka stream.

I do like the idea of an optional attrs field that any Stream based object could take advantage of (and would be required to document any usage of the field) but it wouldn't be required and would fall back to using they lookup key. It also becomes easier if we just make streams a list, I never made the jump because time.

streams:
  - key: some_stream_id
    attrs:
      name: My Fancy Description
      summary: Some summary of the checks/variable/whatever 
      standard_name: air_temperature
      units: degree_Celsius
    checks:
      qartod:
          gross_range_test:
          suspect_span: [-30, 30]
          fail_span: [-80, 140]
  - key: some_stream_id
    attrs:
      standard_name: sea_water_practical_salinity
      units: 1
    checks:
      qartod:
        gross_range_test:
          suspect_span: [0, 30]
          fail_span: [0, 40]
jessicaaustin commented 3 years ago

I see that as a job for someone using/implementing the ioos_qc library. The QC functions don't know what the data represents. The keys for each stream in a ContextConfig should be used to locate/read the data from any Stream based object

Right, this is the main point that we need to clarify -- how much is this the job of ioos_qc vs the user.

In master right now, QcConfig handles defining a set of tests, but it's outside the scope to figure out which QcConfig to apply to a given stream of data. I saw this PR as an opportunity to expand what ioos_qc is capable of, so that you could pull a config from one place, the data from another place, and they would just work together. I think this is really powerful, and builds upon the community standards that promote self-describing datasets. For example, a PI or a manufacturer like Saildrone or an IOOS RA could publish a "default" config for their data that people could start with and build upon. I mean, right now ContextConfig allows you to specify configs based on spatial and temporal constraints, so why is it such a stretch to have it based on variable attributes as well?

Now this this is expanding the scope of this PR, which is primarily about making ioos_qc capable of running on moving platforms. So maybe we should table the conversation. BUT I would like to leave that option open for the future... so then the question is, are we painting ourselves into a corner? Specifically, do we make streams a list now? Or are we ok with adding a special attrs key someday, or implementing some other way of expanding upon which configs are applied where?

Some other points:

I do like the idea of an optional attrs field that any Stream based object could take advantage of (and would be required to document any usage of the field) but it wouldn't be required and would fall back to using they lookup key

I agree with this -- attrs wouldn't have to be required, and it would fall back to just using the variable name as the lookup key (which is the current behavior in this PR).

I see this as less generic because it is now specific to a stream format that understands what a standard_name is.

The stream format doesn't need to know what a standard_name is, it just needs to have a concept of attrs and match those up to the attrs in the config. So it could be easily implemented for NetcdfStream and for XarrayStream, but not for PandasStream or NumpyStream. (though I suppose we could modify those constructors to pass in a dict of attrs if we really wanted)

kwilcox commented 3 years ago

Let's add support for the attrs key now and leave everything else the same. I think there is lots of value in adding some additional context and seeing how useful when implementing the streams. It does solve the use case of having "generic" configs that can be applied to certain types of streams but doesn't prevent a QC config from being "just metadata" and not generic.

jessicaaustin commented 3 years ago

@kwilcox where do we stand with this?

I think this are the outstanding tasks:

kwilcox commented 3 years ago

Picked back up in https://github.com/ioos/ioos_qc/pull/39