Closed jessicaaustin closed 3 years ago
@jessicaaustinaxiom The PR is in decent shape for a usability review. Please take a look!
@kwilcox awesome! I will take a look next week and please yell at me if I don't.
:scream_cat:
@kwilcox can you please rebase from master? Looks like this is 10 commits behind and I got conflicts when I tried.
@kwilcox some general thoughts on the configs:
StreamConfig
(formely QcConfig
) -- this is what i'm using currently for sensor qc
ContextConfig
is meant to help with...ContextConfig
-- this defines StreamConfig's per variable
stream_id
is the variable namestandard_name
+ units
. or instrument make_model
For example, say you have the following config, based on variable names:
config = """
streams:
atmp:
qartod:
gross_range_test:
suspect_span: [-30, 30]
fail_span: [-80, 140]
sal:
qartod:
gross_range_test:
suspect_span: [0, 30]
fail_span: [0, 40]
"""
Here's a version that's more generic:
config = """
streams:
some_stream_id_that_doesnt_really_matter_1:
attrs:
standard_name: air_temperature
units: degree_Celsius
qartod:
gross_range_test:
suspect_span: [-30, 30]
fail_span: [-80, 140]
some_stream_id_that_doesnt_really_matter_2:
attrs:
standard_name: sea_water_practical_salinity
units: 1
qartod:
gross_range_test:
suspect_span: [0, 30]
fail_span: [0, 40]
"""
This could be implemented by having the StreamConfig
constructor look for a specific attrs
key and store it separately from the test config. Then specific stream implementations, like NetcdfStream
, could use the attrs
to match a var, rather than the var name. A fallback could be if the attrs
are not specified, then it uses the stream_id
as the variable name just like you have already.
Thoughts?
oooo great points, this is exactly what we needed to discuss. Few counter points for discussion:
Here's a version that's more generic:
I see this as less generic because it is now specific to a stream format that understands what a standard_name is.
but it would be nice to specify other attributes that should "categorize" a config -- attributes that define what the variable is. for example, standard_name + units. or instrument make_model
I see that as a job for someone using/implementing the ioos_qc
library. The QC functions don't know what the data represents. The keys for each stream in a ContextConfig
should be used to locate/read the data from any Stream
based object. That ends up being variable names in netCDF files, column names in dataframes, and potentially a topic name in a kafka stream.
I do like the idea of an optional attrs
field that any Stream
based object could take advantage of (and would be required to document any usage of the field) but it wouldn't be required and would fall back to using they lookup key. It also becomes easier if we just make streams
a list, I never made the jump because time.
streams:
- key: some_stream_id
attrs:
name: My Fancy Description
summary: Some summary of the checks/variable/whatever
standard_name: air_temperature
units: degree_Celsius
checks:
qartod:
gross_range_test:
suspect_span: [-30, 30]
fail_span: [-80, 140]
- key: some_stream_id
attrs:
standard_name: sea_water_practical_salinity
units: 1
checks:
qartod:
gross_range_test:
suspect_span: [0, 30]
fail_span: [0, 40]
I see that as a job for someone using/implementing the ioos_qc library. The QC functions don't know what the data represents. The keys for each stream in a ContextConfig should be used to locate/read the data from any Stream based object
Right, this is the main point that we need to clarify -- how much is this the job of ioos_qc vs the user.
In master
right now, QcConfig
handles defining a set of tests, but it's outside the scope to figure out which QcConfig to apply to a given stream of data. I saw this PR as an opportunity to expand what ioos_qc is capable of, so that you could pull a config from one place, the data from another place, and they would just work together. I think this is really powerful, and builds upon the community standards that promote self-describing datasets. For example, a PI or a manufacturer like Saildrone or an IOOS RA could publish a "default" config for their data that people could start with and build upon. I mean, right now ContextConfig
allows you to specify configs based on spatial and temporal constraints, so why is it such a stretch to have it based on variable attributes as well?
Now this this is expanding the scope of this PR, which is primarily about making ioos_qc capable of running on moving platforms. So maybe we should table the conversation. BUT I would like to leave that option open for the future... so then the question is, are we painting ourselves into a corner? Specifically, do we make streams
a list now? Or are we ok with adding a special attrs
key someday, or implementing some other way of expanding upon which configs are applied where?
Some other points:
I do like the idea of an optional attrs field that any Stream based object could take advantage of (and would be required to document any usage of the field) but it wouldn't be required and would fall back to using they lookup key
I agree with this -- attrs wouldn't have to be required, and it would fall back to just using the variable name as the lookup key (which is the current behavior in this PR).
I see this as less generic because it is now specific to a stream format that understands what a standard_name is.
The stream format doesn't need to know what a standard_name
is, it just needs to have a concept of attrs
and match those up to the attrs
in the config. So it could be easily implemented for NetcdfStream
and for XarrayStream
, but not for PandasStream
or NumpyStream
. (though I suppose we could modify those constructors to pass in a dict of attrs
if we really wanted)
Let's add support for the attrs
key now and leave everything else the same. I think there is lots of value in adding some additional context and seeing how useful when implementing the streams. It does solve the use case of having "generic" configs that can be applied to certain types of streams but doesn't prevent a QC config from being "just metadata" and not generic.
@kwilcox where do we stand with this?
I think this are the outstanding tasks:
QcConfig
etc in existing notebooksStores
yetattrs
-- this could be a subsequent PRPicked back up in https://github.com/ioos/ioos_qc/pull/39
Major refactor of how test configurations are specified. Allows for storage of generic configuration objects that specify test configs based on variable, region, and time window. This is useful for applying test configs to moving platforms, such as gliders, animal tracks, etc.
The previous config object,
QcConfig
, still exists and functions the same, but is now known asStreamConfig
.TODO what are the major changes around storage of results?