Add option for inclusion of DBD source column to MultiDBD.get

MCazaly commented 1 year ago

This enables MultiDBD.get to return a third column with the underlying DBD objects that each data point is sourced from, which can be used for integrity checks.

smerckel commented 1 year ago

Hi MCazaly,

Could you supply me with some example code that shows how you want to use this? Will your .get() method from MultiDBD always be called with the flag set to True to return the DBD source column, or just occasionally, or perhaps only on demand?

Thanks

MCazaly commented 1 year ago

Hi @smerckel ,

Here's an example from our Slocum processing chain, abbreviated for clarity:

data = {}
for category in dbd.parameterNames.keys():  # Category is the type of data as in the DBD files, e.g. sci or eng
    # Filter parameters to what has been specified
    parameters = {parameter for parameter in dbd.parameterNames[category] if parameter in whitelist}
    if parameters:  # At least one parameter remains after filtering
        # Load data for required parameters and add to data dictionary
        data[category] = {
            parameter: dbd.get(
                parameter,
                discardBadLatLon=False,  # We want NaNs rather than omitted cycles
                include_source=True,  # Required for filref matching
                return_nans=True,
            )
            for parameter in parameters
        }

    else:  # No parameters remain after filtering
        logging.info(f"No %s parameters to load.", category)
return data

Note that "filref" refers to a unique identifier for a single "raw" source file of any type, including SBDs and TBDs.

The resulting dict is subsequently used to generate a file in a NetCDF4-based internal holding format, where each sensor on a glider is represented as a group, with each group containing a Filref channel, thereby attributing every recorded value to a source file, which our data managers use for auditing and QA purposes.

In our case, we always have this flag set to True.

EDIT: We hold a list of all our received source files by deployment and filename, and each file is assigned a unique filref as soon as we receive it, which is then looked up using the filename from the returned DBD object during processing.

smerckel commented 1 year ago

Hi @MCazaly,

I returned to look at this issue again. Somehow I dislike the way it is now: get() returns a 3xN matrix of type object, when include_sources is set to True. I have been pondering quite a bit how to solve this properly while keeping your workflow intact as much as possible. Would it be acceptable if get(... include_sources=True) returns for each parameter requested, a 2xN matrix of type float with time and parameter values, and a tuple? Your code example above, remains then intact, except somewhere down the line you probably need to change one line. I suppose you would do

t,v,srcs = data[parameter]

which would then turn into

(t,v), srcs = data[parameter]

If this is ok for you, then let me know, and I will incorporate this feature in the main branch.

Thanks.

MCazaly commented 1 year ago

Hi @smerckel,

Thanks for getting back to me. I think that could work - in this case would srcs still have the same indices? i.e. would srcs[5] give me the DBD that value v[5] came from?

smerckel commented 1 year ago

Hi @MCazaly ,

Exactly, srcs is a list with DBD objects, which is of the same length as the time and value arrays returned. I think it stores references only, so the memory penalty is not so large. I need to investigate that. Alternatively, you could make srcs a function, which computes what DBD to return given the index. My guess is that if for all parameters you process, also want to get the source, then using a function becomes inefficient. If you need to call the function just in a few cases, then, it will save memory.

So the question is basically,: for every MultiDBD.get() call, how often would you consult the srcs list?

If the answer is that srcs is accessed for each time, value data pair, then, how it is now is probably best. If it is only accessed once in a while, if there is a problem, then a function might be better.

Then either

srcs[5] gives you a DBD instance, using the list implementation

or

srcs(5) gives you a DBD instance using the function implementation.

Up to you :-)

MCazaly commented 1 year ago

Hi @smerckel,

I do believe the list would only contain references to the DBD object, so any increase in memory usage should be minimal.

As for your question, we would be interrogating the entire srcs list essentially every time, so a list is preferable over a function for us.

Thanks for your time on this.

smerckel commented 1 year ago

The feature has been incorporated in the upcoming release 0.4.14.

MCazaly commented 1 year ago

Excellent. Thanks @smerckel!

smerckel / dbdreader

Add option for inclusion of DBD source column to MultiDBD.get #14