Closed MCazaly closed 1 year ago
Hi MCazaly,
Could you supply me with some example code that shows how you want to use this? Will your .get() method from MultiDBD always be called with the flag set to True to return the DBD source column, or just occasionally, or perhaps only on demand?
Thanks
Hi @smerckel ,
Here's an example from our Slocum processing chain, abbreviated for clarity:
data = {}
for category in dbd.parameterNames.keys(): # Category is the type of data as in the DBD files, e.g. sci or eng
# Filter parameters to what has been specified
parameters = {parameter for parameter in dbd.parameterNames[category] if parameter in whitelist}
if parameters: # At least one parameter remains after filtering
# Load data for required parameters and add to data dictionary
data[category] = {
parameter: dbd.get(
parameter,
discardBadLatLon=False, # We want NaNs rather than omitted cycles
include_source=True, # Required for filref matching
return_nans=True,
)
for parameter in parameters
}
else: # No parameters remain after filtering
logging.info(f"No %s parameters to load.", category)
return data
Note that "filref" refers to a unique identifier for a single "raw" source file of any type, including SBDs and TBDs.
The resulting dict is subsequently used to generate a file in a NetCDF4-based internal holding format, where each sensor on a glider is represented as a group, with each group containing a Filref channel, thereby attributing every recorded value to a source file, which our data managers use for auditing and QA purposes.
In our case, we always have this flag set to True.
EDIT: We hold a list of all our received source files by deployment and filename, and each file is assigned a unique filref as soon as we receive it, which is then looked up using the filename from the returned DBD object during processing.
Hi @MCazaly,
I returned to look at this issue again. Somehow I dislike the way it is now: get() returns a 3xN matrix of type object, when include_sources is set to True. I have been pondering quite a bit how to solve this properly while keeping your workflow intact as much as possible. Would it be acceptable if get(... include_sources=True) returns for each parameter requested, a 2xN matrix of type float with time and parameter values, and a tuple? Your code example above, remains then intact, except somewhere down the line you probably need to change one line. I suppose you would do
t,v,srcs = data[parameter]
which would then turn into
(t,v), srcs = data[parameter]
If this is ok for you, then let me know, and I will incorporate this feature in the main branch.
Thanks.
Hi @smerckel,
Thanks for getting back to me. I think that could work - in this case would srcs
still have the same indices? i.e. would srcs[5]
give me the DBD that value v[5]
came from?
Hi @MCazaly ,
Exactly, srcs is a list with DBD objects, which is of the same length as the time and value arrays returned. I think it stores references only, so the memory penalty is not so large. I need to investigate that. Alternatively, you could make srcs a function, which computes what DBD to return given the index. My guess is that if for all parameters you process, also want to get the source, then using a function becomes inefficient. If you need to call the function just in a few cases, then, it will save memory.
So the question is basically,: for every MultiDBD.get() call, how often would you consult the srcs list?
If the answer is that srcs is accessed for each time, value data pair, then, how it is now is probably best. If it is only accessed once in a while, if there is a problem, then a function might be better.
Then either
srcs[5] gives you a DBD instance, using the list implementation
or
srcs(5) gives you a DBD instance using the function implementation.
Up to you :-)
Hi @smerckel,
I do believe the list would only contain references to the DBD object, so any increase in memory usage should be minimal.
As for your question, we would be interrogating the entire srcs list essentially every time, so a list is preferable over a function for us.
Thanks for your time on this.
The feature has been incorporated in the upcoming release 0.4.14.
Excellent. Thanks @smerckel!
This enables MultiDBD.get to return a third column with the underlying DBD objects that each data point is sourced from, which can be used for integrity checks.