ioos / qartod

Apache License 2.0
6 stars 7 forks source link

IOOS RA QARTOD Collaboration #1

Open robragsdale opened 10 years ago

robragsdale commented 10 years ago

This started with GLOS' to document the DMAC aspects of certification and acknowledging that integration of QARTOD processes and standards across the IOOS RA enterprise could be done more efficiently and reduce overall resources to the IOOS RAs. This grew into and opportunity to collaborate on implementation.

Tad Slawecki (GLOS) posed two questions to gather more information before a Webinar on 17 September 2014 at 3 p.m. ET to discuss the issue.

  1. How would QARTOD best fit into your RA's workflow?

GLOS: TBD. Our buoy data is currently ingested into a separate "OBS" database, and then passed on to 52N. We are in the process of instituting NODC submittals built from the OBS database contents, so nicest for us would be to apply QARTOD before OBS; if we transition to 52N-based NODC submittals, we might drop the OBS database and apply QARTOD before 52N.

NOW: PI -> OBS -> 52N ALT1: PI -> QARTOD -> OBS -> 52N ALT2: PI -> QARTOD -> 52N

  1. What type of collaboration would you like to see?

GLOS: Very open to different ideas, main thing is not to duplicate efforts across RAs. t a minimum, a shared resource that shows which RA is working on/has implemented a particular test? If we identify a common architecture for QARTOD application, perhaps implementation of different tests for different parameters can be undertaken by different RAs?

robragsdale commented 10 years ago

NERACOOS would definitely be interested in collaborating on this.

We ingest data primarily from non-federal buoys. We have long time series of hourly observations going back to 2001 and face the issue of implementing any new QA/QC processes on this historical data as well as the continuing near realtime data.

Another issue we face is that we have diverse data providers submitting data. It's the data providers, with varying technical abilities, who will need to implement QARTOD processes.

Eric Bridger

robragsdale commented 10 years ago

SECOORA

Glad to look at openly shared qartod-flagging related scripts or projects on github,etc. One reference point I'd guess would be using the existing/developing IOOS SOS web services for processing/flagging various observed types in combination with localized climatology based lookups.

Jeremy Cothran

To add on to what Jeremy said, we're in the same boat as NERACOOS in regards to pulling in third party, no federal data. Some provide QAQC, others don't. The QAQC may not be specific to their research needs and not fall in line with QARTOD.

What I would like to see is a central wiki page on IOOS that detailed the QAQC for the various parameters. Currently there are PDFs for specific observations, however I'd like to see a one stop shop focused on the implementation.

I'm happy to chip in to provide sample code for doing QAQC. Years ago I created the QARTOD google group at one of the meetings and added a simple python script that implemented a very basic range checking QAQC with the flags of the day. https://code.google.com/p/qartod/source/browse/trunk/qaqc/python/rangetests/rangeTests.py

Dan Ramage

On 1. our workflow generally is below PI->OBS->QARTOD->netCDF(ncSOS/NODC compatible) My 2 cents, for a simple/modular QARTOD approach just say github supported (insert preferred language here) implementations of the existing pseudo-code references which support a handful of usual suspectQARTOD filter time-series sources (SOS/XML feed, JSON, CSV, etc) and targets(XML,JSON,CSV,etc). language choice: SOS/XML or JSON or CSV or ... -> QARTOD script/filter -> SOS/XML or JSON or CSV On a side note, in regards to SOS filters, it would be nice to have say daily/monthly avg/min/max summary filters for developing products caching/using summary level overview data. So maybe an initial server pass against platform SOS sources saves summary level info as JSON files to repeatedly pass back to a client browser application.

Jeremy Cothran

robragsdale commented 10 years ago

PacIOOS

Thanks for initiating this. I have some more practical (trivial?) concerns. These center around how we identify Q/C'd data in the file. I supposed one way is just to add something in the metadata that says "these data have been processed following the QARTOD guidelines". More realistically, it's probably best to include flags (variables) in the data files. Ideally we'd all use the same vocabulary and process for this. Is it worth trying to "standardize" this?

Related, if a particular datum is flagged for some reason, do we adjust/delete/preserve it? Depending on the answer to this question, we then potentially have yet another variable to consider. As an example, we do delayed mode Q/C on data coming off moored buoys. In the data file we include three variables for each measured quantity at each time-step: the raw measurement (e.g., temp_raw), the adjusted/final value (e.g., temp), and the Q/C flag (e.g., temp_qd). This essentially triples the size of our files. The addition of similar real-time Q/C procedures will likewise increase file size. Added complexity here is that the CF standard names will be the same for both "temp_raw" and "temp", so only a query on long_name will reveal the difference.

So, issues that we are considering here:

  1. What is the best-practice for variable names and dimensions? Presumably we'd have a flag/variable indicating whether real-time Q/C has been done (not sure if this needs to be per variable, per time-step), and a flag/variable for each test (e.g., timing_flag=1, syntax_flag=2, location_flag=2, etc.) or a single variable with bits assigned for each test (e.g., qartod_flag = 122). The variables in this second option are likely to be per variable, per time-step.
  2. Do we keep flagged datum as is or adjust it? For example, a temperature reading of 1000 could be replaced with a missing value or kept at 1000 with a flag of "4" indicating "fail" (as a note, we like to use the valid_range attribute, since some of our display tools do auto-scale for figures).

Jim

I think there are answers to most of your questions. Maybe not all. It would take a bit of reading and comparing references but I'd guess that most of our needs will be met by the existing QARTOD flagging scheme (and references therein) and the CF conventions. I think that for workflows that are based on CF netCDF files from fairly early on, then recording the lineage is fairly straightforward. As for systems that are based on database management systems and ad hoc or bespoke formats things might be more tricky.
I hear the concern about files increasing in size but I'm not sure how much of a problem this really is. I hope it's manageable because I feel like we should be more concerned with accurately tracking the lineage of data processing than making minimal file sizes through clever encoding. That said, I'd opt for multiple qc variables, associated to geophysical variables through correct usage of CF ancillary or auxillary variables. One qc variable per test and then one summary variable that indicates the PI or data center subjective summary of the N test applied. This is the scheme suggested in the reference above and in the IODE/IOC document on which it was based.
As for your specific questions, I think CF has the capability of recording all of this and it's just up to us to decide as part of the software development effort.
One thing that you might want to consider is that it may make more sense for a single group to develop the software with the requirements coming from this group. What sort of package would be most useful? Jeremy, Can you clarify your workflow? What is PI->OBS->QARTOD->netCDF(ncSOS/ NODC compatible)??? Is OBS the Xenia RDBMS that you've used for most of your ingestion? Jim, I assume that your version of this workflow would include encoding the data into CF netCDF a little earlier in the workflow. True? I always think of this problem as being addressed through a Unix pipe pattern similar to NCO or GMT where CF DSG netCDF files are the input and output of each step in the chain. e.g. raw2nc in.txt -oraw.nc spkie_check.py raw.nc -o spike.nc range_check.py spike.nc -ospike_range.nc

etc summary.py last_step.nc -o final_data.nc

Thoughts? -Derrick

Hi Derrick,

Thanks for the QARTOD flagging scheme document. It does seem to address many of my questions. As for the work flow, we typically get raw ASCII or binary off the instrument. This then gets converted to netCDF via some method. I was thinking of adding the Q/C checks to the convert-to-netCDF routine, but I think your suggestion below of having a separate set of utilities to operate on, and create new, netCDF makes a lot of sense.

Jim

robragsdale commented 10 years ago

NERACOOS

Very interesting discussion so far. The link to the QARTOD Flag document was very useful. Somehow I hadn't looked at it very closely before. I think it covers all the possible QA/QC needs.

Re: NERACOOS. We are very much in the middle of implementing a new data framework centered on CF 1.6 DSG NetCDF files for time series observations. Historically we've ingested CF 1.0 NetCDF, various OOSTethys SOS's, various CSV, FTP, etc.

NERACOOS, while partially funding numerous sub-regional observing systems has always acted as a Data Aggregation Center and does not do any QA/QC processing itself. I believe that this should continue to be performed by data providers. NERACOOS does take QC flags into account when sending data to our users.

I also think that, as the document explains, all data should be retained, i.e. suspect, bad, NaN's should not be deleted.

Various scripts in various languages implementing QARTOD tests would be a big help.

The basic data flow with our major data provider Univ. of Maine is: PI -> QARTOD -> NetCDF 1.6 -> Postgis -> TDS -> ERDDAP -> Web products. We anticipate all our major data providers adopting this model some time this year. E.g. Univ. of Conn. PI -> QARTOD -> MySQL -> NetCDF1.6 -> TDS -> NERACOOS DAC

Some issues we face.

We will always ingest data from various providers via text, csv, excel, etc. etc. Some of this has had no QA/QC done. Tagging this as no QA/QC done will work.

Data providers have historical files with existing qc flags and values, described in their NetCDF headers, which include valid ranges, etc. If and how to reprocess these will be a major hurdle.

Also, all our data providers do post deployment recalibration, updating, etc. By moving to a more NetCDF centric approach one hope is to overcome the need to update an OBS database and use the NetCDF historical files as the "gold standard".

QARTOD has only addressed certain data types so far so what to do with other data types remains.

Here is the Postgres quality values we are using currently.

id | handle | description
----+----------+--------------------------------------------------- 0 | NONE | Quality unknown or not yet determined 1 | BAD | Data is bad 2 | SUSPECT | Quality might be bad: needs review before release 3 | GOOD | Data is acceptable for distribution 4 | BEST | Data has passed rigorous checks -9 | MISSING | The data has been reported as missing -8 | REPLACED | Data of a higher quality control level

Eric

robragsdale commented 10 years ago

NANOOS

Thanks for pushing this forward, Tad.

  1. How would QARTOD best fit into your RA's workflow?

NANOOS: TBD. We ingest data from many sources into a separate OBS relational database; data from that database is then passed on (copied) to 52N. For some RA-supported data we're also slowly developing netcdf-based archives available through THREDDS.

As Eric described for NERACOOS, NANOOS "while partially funding numerous sub-regional observing systems has always acted as a Data Aggregation Center and does not do any QA/QC processing itself." We haven't decided how this will evolve, but we plan to make decisions and take initial steps over the next 12 months. A lot of what Eric said about the kind of data NERACOOS gets from other providers also applies to us, including the bits about ingesting data in all types of formats and with a wide range of QA/QC, including none; and also, many of "our data providers do post deployment recalibration, updating, etc." Some of the recalibration may happen while sensors are still deployed.

NOW: PI "QA/QC" -> OBS (MySQL) -> 52N ALT1: PI QARTOD (1) --> OBS (RDBMS) -> 52N (2) --> NetCDF DSG -> TDS/ncSOS ALT2: PI -> QARTOD (1) --> OBS (RDBMS) -> 52N (2) --> NetCDF DSG -> TDS/ncSOS ALT3: Some combination of ALT1 & ALT2

  1. What type of collaboration would you like to see?

NANOOS is open to different ideas. I'll mostly echo Tad's statements. It may be that a subset of RA's shares enough of a similar development/software preference (eg, Python) that some common tools or algorithm implementation could be developed. These should probably be lean and self-contained modules that could be integrated into our workflows w/o too much pain.

BTW, IOOS DMAC has embraced github over the last 10-12 months (https://github.com/ioos). But I see there's no https://github.com/ioos/qartod yet. That could be a common area to share code, documentation and resources. On a related note, Google gave me these two interesting, small repos when searching on "github" and "qartod": https://github.com/thinkobscure/DO-QAQC https://github.com/USF-COT/adcp_qartod_qaqc Both are Python based, and both are under a year old. Then there's the older stuff from SECOORA's Dan Ramage, which he mentioned in Tad's previous QARTOD email thread on ioostech: https://code.google.com/p/qartod/source/browse/trunk/qaqc/python/rangetests/rangeTests.py

Derrick, thanks for pointing out the IOOS QARTOD Flagging Scheme document. I can't remember if I had seen it, but it looks like good stuff.

Cheers, -Emilio