Open lukecampbell opened 10 years ago
I would actually argue that function authors are responsible for managing fill values. That's why we have an external interface to retrieve a parameter's fill value.
Coverage model should be responsible for storing and regurgitating data, not making assumptions on what the user wants to do with it. The example listed above actually demonstrates why I believe that to be crucial. Observations 1, 5, 9, 10, 14, 15, and 20 each have a fill for either temperature or pressure but not both.
A function writer may want to do calculations if only temperature is valid, but not pressure. Another function writer may want to do calculations only if both temperature and pressure are valid. By leaving it up to the function writer to decide what to do with the data, we allow the function writers the greatest flexibility.
I think it's unreasonable to ask them that.
I would agree it's probably the user's responsibility which is me. And I will absolutely deal with it if it's necessary but, I'm just going to end up using the same handler and code in every place I use the coverage model, I would prefer to just put that code in the coverage model to reduce code duplication.
However, in the interest of not supporting a monolithic data model, I will agree that in the purest interest of the coverage model as something that stores and reads data, it doesn't go there. Unfortunately, it has been overloaded with the responsibility to also manage data processing.
So how do I manage that? Remove every index that has fill values? It the case of your array, all values (time, temperature, and pressure) would have to be masked at every index where any of the parameters have fill values.
Valid data is masked because a function parameter might not process it correctly. Are you suggesting yet another argument flag for the get_parameter_values method?
On Jun 2, 2014, at 10:49 AM, Luke Campbell notifications@github.com wrote:
I think it's unreasonable to ask them that.
The functions have no idea indication what valid values are, or rather what consists of a fill value, it changes from one parameter to the next. These functions are just math functions, they have no real connection to OOI, they can be used for anything in general ocean sciences and are not specific to the parameter functions I would agree it's probably the user's responsibility which is me. And I will absolutely deal with it if it's necessary but, I'm just going to end up using the same handler and code in every place I use the coverage model, I would prefer to just put that code in the coverage model to reduce code duplication.
However, in the interest of not supporting a monolithic data model, I will agree that in the purest interest of the coverage model as something that stores and reads data, it doesn't go there. Unfortunately, it has been overloaded with the responsibility to also manage data processing.
— Reply to this email directly or view it on GitHub.
Something like this maybe
def get_wrapper(parameter, time_bounds, fill_value_mask=True):
ptype = self.get_parameter_context(parameter).param_type
if not isinstance(ptype, ParameterFunctionType):
return
args, mask = build_arg_map(parameter, time_bounds, fill_value_mask)
# Get the base data type and shape
value_encoding = ptype.value_encoding # I think it's here
shape = self.get_values('time', time_bounds).shape
retval = np.empty(shape, dtype=value_encoding)
if fill_value_mask:
retval[mask] = ptype.function._callable(*args)
else:
retval[:] = ptype.function._callable(*args)
def build_arg_map(parameter, time_bounds, fill_value_mask=True):
ptype = self.get_parameter_context(parameter).param_type
arg_list = ptype.function.arg_list
arg_map = ptype.function.param_map
array_map = {}
mask = None
for k,v in arg_map.iteritems():
context = self.get_parameter_context(k)
array_value = self.get_values(k, time_bounds)
if mask is None:
mask = ~np.isclose(array_value, context.fill_value, equal_nan=True)
else:
mask = mask & ~np.isclose(array_value, context.fill_value, equal_nan=True)
array_map[k] = array_value
if fill_value_mask:
for k,v in array_map.iteritems():
array_map[k] = v[mask]
return array_map, mask
Unless I'm missing something, the above appears to be just how to do the masking. My question/statement is more along these lines: get_parameter_values is currently THE interface for retrieving data. If a parameter function is requested in get_parameter_values, in order to keep alignment and eliminate bad data for the parameter function, I will have to throw away all indexes for which a fill value exists for ANY parameter - because I don't have insight into which parameters a parameter function might use. This could result in valid or important data being dropped from the array returned from get_parameter_values. Are we okay with that, or is there another solution?
With what I've pasted, it doesn't throw away indices. It creates an array in-memory initially of all fill values. Using the mask it fills in the array.
The key here is that we never call the function with an array that contains missing values. It doesn't treat those missing values in any special way
in a simple scenario we have a parameter function sum:
def sum(x,y):
return x + y
In our coverage, if x is [-9999, 1, 2, 3] and y is [10, 11, 12, 13] the output of sum should be [-9999, 1, 2, 3] not [-9989, 12, 13, 14]
so the full return dictionary would be
{
'time' : [t0, t1, t2, t3],
'x' : array([-9999., 1., 2., 3.], dtype=float32),
'y' : array([10., 11., 12., 13.], dtype=float32),
'sum' : array([-9999., 12., 13., 14.], dtype=float32)
}
Behind the scenes, sum only ever got called like this:
sum(x=np.array([1., 2., 3.]), y=np.array([11., 12., 13.]))
Okay. I see, now, what you are requesting. Really, this is just - if any of the args have a fill-value at an index, insert the function parameter's fill-value at that index for the function parameter array.
I'm still not a parameter function expert. Can you create some tests that exercise the faulty functionality for both mathematical expressions and python function types so I can figure out the best approach? It doesn't need to be right away. I won't get to this for a few days.
Yeah, I'll make some use-case tests.
I will add a new method, get_valid_indexes(), to NumpyDictParameterData - the object returned from Coverage.get_parameter_data(). NumpyDictParameterData.get_data() will continue to return a dictionary of data with fill values. get_fill_index_dict() will return a dictionary of numpy boolean arrays used to indicate if the value at an index of the data arrays is valid or if it fill.
It should be up to the parameter function to use this information to perform calculations at valid indexes only. The array returned from the parameter function should be the the same size as the array returned from the referenced parameter array and determine where fill values should be placed.
Example usage:
ndpd = cov.get_parameter_values(['cond'], fill_indexes=True, ...)
data = ndpd.get_data()['cond']
valid_indexes = ndpd.get_valid_indexes()['cond']
calculated_cond_array = np.empty(data.size, dtype=data.dtype)
calculated_cond_array[:] = fill_value
calculated_cond_array[valid_indexes] = data[valid_indexes]*2+1
This looks good, could you push something like this so I can test it out?
In order to be precise in the fill value mapping, I need to know which indexes returned from a function parameter evaluation are fills. This means that the interface needs to be changed all the way through so the actual function being called can create and return the mapping. @lukecampbell, What are your preferences on this?
I was under the impression that we would filter out the indexes that are fills before we execute the function(s). Parameter functions shouldn't produce fill values on their own, I think.
I’m talking about the abstract case where a parameter function receives non-fill data but determines that an index should be a fill. Are you saying that in all real cases, a parameter function will never return fill?
That's correct. To my knowledge. A fill value in it's purest sense represents a lack of information from a sensor. Whereas in the parameter function's case, it always has information about a domain OR in the case where it doesn't we use NaN, but not as a fill value but in the absence of real value, such as a division by zero or where the mapping of the domain does not result in a real result (finite domain problems like tangent).
Before evaluating a parameter function, you'd like all fill values removed. I interpret this as: Remove ALL observations where ANY of the parameters in the parameter function's param map has a fill.
Is that your expectation?
I believe that's accurate, yes.
Hiding data is a restriction on a parameter function's ability to make decisions. I believe you're okay with this, but I want to make sure we are all on the same page.
Just to clarify, this behavior should be optional I think.
I can make it optional to hide data. In the non-optional case, should we pass fill masks to the parameter function, or should we force the function to call get_parameter_values to get the fill mask?
I don't think either are an option. Most parameter functions, that I'm aware of, can't or don't accept callbacks to the coverage model and they don't explicitly accept masked arrays or another parameter to identify the fill value. Some might but most don't, I think. If the option to mask the fill values is not enabled, I would say we pass the arrays as-is into the parameter functions.
I can't think of another option, but I'm open to suggestions.
I think maybe the use case got lost somewhere in this thread.
If we have a sensor that is observing u and v vectors and we have a function that determines wind direction given the u and v vectors, the function is not meant to understand the data model or even the concept of flagged values. It's only intended purpose, architecturally, is to compute wind direction given u and v.
def wind_direction(u,v):
return 180. / np.pi * np.atan2(u,v)
Now if we have maybe the u-component of the velocity sensor on the instrument fail or go offline for whatever, the u vector will start publishing fill_values (or nothing at all which we then represent as fill_values).
Essentially my goal is this:
u = np.array([ 1, 0, 2, -9999, -9999], dtype=np.float32)
v = np.array([0.5, 1, 2, 1, 0], dtype=np.float32)
wind_direction_var = np.ones(u.shape[0]) * -9999
not_fill = (u != -9999) & (v != -9999)
wind_direction_var[not_fill] = wind_direction(u[not_fill], v[not_fill])
It's not my goal to hide data, I'm just not trying to run a function with missing information which would result in a resultant array with improper information. Instead of the result being the intended
array([63.43495178, 0., 45., -9999., -9999.])
without dealing with the fill values the result would be
array([63.43495178, 0., 45., -89.99427032, -90.])
Which are real values, and it would not be desired, I think.
The intent hasn't been lost. I understand your use case. I am writing a library that should be able to support multiple use cases. I am just trying to define the library's restrictions when it comes to parameter functions.
For now, parameter functions are pretty simple, they can't interface with the coverage model and they don't deal with fill values directly (the only exceptions I'm aware of are QC, but that's because I'm the author and they're not used through the coverage model).
The functions and or the function authors aren't responsible for managing fill values, so if fill values are used as inputs incorrect outputs will be produced. The coverage clients aren't responsible either.
It would be good if we can add in correct fill value support within the coverage model. http://nbviewer.ipython.org/github/lukecampbell/notebooks/blob/master/Masked%20Arrays.ipynb