ices-taf / conjoin

Contaminants Joint Assessment for OSPAR, HELCOM and AMAP
MIT License
3 stars 0 forks source link

I want to filter a subset of input data from an ICES system in a simple way #10

Open neil-ices-dk opened 2 years ago

neil-ices-dk commented 2 years ago

web service

HansMJ commented 2 years ago

What are the needed sub-setting parameters? Country, datetime, spatial, contaminant...

neil-ices-dk commented 2 years ago

What are the needed sub-setting parameters? Country, datetime, spatial, contaminant...

to be defined i would say

RobFryer commented 2 years ago

There are lots of strands to this, and it applies to all organisations. And I think this needs to be split into several topics.

First, there are filters such as a defined list of contaminants or species. At a more subtle level, there are filters by combinations of these things: for example, we are not interested in PAH concentrations in fish. Or we are only interested in particular tissues for each species type.

Second, there are the ad-hoc edits (which are a filtering of sorts) and which I can't see elsewhere in the list of activities. These are needed for several reasons. To correct known errors, or to delete (filter) incorrect records. Or to filter out data to make time series more consistent. For example, the UK had a change of method for bile metabolites around 2006, which resulted in a shift in measurements. The earlier values aren't 'wrong' as such, but including them in time series leads to spurious trends which are due to changes in analytical methods rather than changes in the environment. Another example: there was a huge spike in PCB concentrations at some Norwegian stations in one year due to power washing some industrial buildings. Measurements for that year are correct, but are hugely inconsistent with the rest of the time series. Another example, some countries report metal concentrations consistently in one matrix and sporadically in another matrix, and we filter the data to leave the consistent time series.

The third issue is all the data cleaning that my code does. This is part or the pre-processing code which I spent nearly two sessions going through with Colin, and getting this more maintainable would have huge benefits. For example: checking that species / matrices are valid converting data to required bases and units getting rid of data with inadmissible or bonkers values (zero concentrations, uncertainties of over 100%) merging supporting (auxiliary) measurements with responses dealing with determinands which have been submitted in several ways (for example CHRTR is treated as CHR; NAPC1 is either submitted as this or as two separate determinands NAP1M and NAP2M which must be summed) and the list goes on

neil-ices-dk commented 2 years ago

There are lots of strands to this, and it applies to all organisations. And I think this needs to be split into several topics.

First, there are filters such as a defined list of contaminants or species. At a more subtle level, there are filters by combinations of these things: for example, we are not interested in PAH concentrations in fish. Or we are only interested in particular tissues for each species type.

Second, there are the ad-hoc edits (which are a filtering of sorts) and which I can't see elsewhere in the list of activities. These are needed for several reasons. To correct known errors, or to delete (filter) incorrect records. Or to filter out data to make time series more consistent. For example, the UK had a change of method for bile metabolites around 2006, which resulted in a shift in measurements. The earlier values aren't 'wrong' as such, but including them in time series leads to spurious trends which are due to changes in analytical methods rather than changes in the environment. Another example: there was a huge spike in PCB concentrations at some Norwegian stations in one year due to power washing some industrial buildings. Measurements for that year are correct, but are hugely inconsistent with the rest of the time series. Another example, some countries report metal concentrations consistently in one matrix and sporadically in another matrix, and we filter the data to leave the consistent time series.

The third issue is all the data cleaning that my code does. This is part or the pre-processing code which I spent nearly two sessions going through with Colin, and getting this more maintainable would have huge benefits. For example: checking that species / matrices are valid converting data to required bases and units getting rid of data with inadmissible or bonkers values (zero concentrations, uncertainties of over 100%) merging supporting (auxiliary) measurements with responses dealing with determinands which have been submitted in several ways (for example CHRTR is treated as CHR; NAPC1 is either submitted as this or as two separate determinands NAP1M and NAP2M which must be summed) and the list goes on

related to #15