ooici / coverage-model

BSD 2-Clause "Simplified" License
2 stars 9 forks source link

Managing fill values within the parameter function framework #177

Open lukecampbell opened 10 years ago

lukecampbell commented 10 years ago

The functions and or the function authors aren't responsible for managing fill values, so if fill values are used as inputs incorrect outputs will be produced. The coverage clients aren't responsible either.

It would be good if we can add in correct fill value support within the coverage model. http://nbviewer.ipython.org/github/lukecampbell/notebooks/blob/master/Masked%20Arrays.ipynb

caseybryant commented 10 years ago

I would actually argue that function authors are responsible for managing fill values. That's why we have an external interface to retrieve a parameter's fill value.

Coverage model should be responsible for storing and regurgitating data, not making assumptions on what the user wants to do with it. The example listed above actually demonstrates why I believe that to be crucial. Observations 1, 5, 9, 10, 14, 15, and 20 each have a fill for either temperature or pressure but not both.
A function writer may want to do calculations if only temperature is valid, but not pressure. Another function writer may want to do calculations only if both temperature and pressure are valid. By leaving it up to the function writer to decide what to do with the data, we allow the function writers the greatest flexibility.

lukecampbell commented 10 years ago

I think it's unreasonable to ask them that.

  1. The functions have no idea indication what valid values are, or rather what consists of a fill value, it changes from one parameter to the next.
  2. These functions are just math functions, they have no real connection to OOI, they can be used for anything in general ocean sciences and are not specific to the parameter functions

I would agree it's probably the user's responsibility which is me. And I will absolutely deal with it if it's necessary but, I'm just going to end up using the same handler and code in every place I use the coverage model, I would prefer to just put that code in the coverage model to reduce code duplication.

However, in the interest of not supporting a monolithic data model, I will agree that in the purest interest of the coverage model as something that stores and reads data, it doesn't go there. Unfortunately, it has been overloaded with the responsibility to also manage data processing.

caseybryant commented 10 years ago

So how do I manage that? Remove every index that has fill values? It the case of your array, all values (time, temperature, and pressure) would have to be masked at every index where any of the parameters have fill values.

Valid data is masked because a function parameter might not process it correctly. Are you suggesting yet another argument flag for the get_parameter_values method?

On Jun 2, 2014, at 10:49 AM, Luke Campbell notifications@github.com wrote:

I think it's unreasonable to ask them that.

The functions have no idea indication what valid values are, or rather what consists of a fill value, it changes from one parameter to the next. These functions are just math functions, they have no real connection to OOI, they can be used for anything in general ocean sciences and are not specific to the parameter functions I would agree it's probably the user's responsibility which is me. And I will absolutely deal with it if it's necessary but, I'm just going to end up using the same handler and code in every place I use the coverage model, I would prefer to just put that code in the coverage model to reduce code duplication.

However, in the interest of not supporting a monolithic data model, I will agree that in the purest interest of the coverage model as something that stores and reads data, it doesn't go there. Unfortunately, it has been overloaded with the responsibility to also manage data processing.

— Reply to this email directly or view it on GitHub.

lukecampbell commented 10 years ago

Something like this maybe

def get_wrapper(parameter, time_bounds, fill_value_mask=True):
    ptype = self.get_parameter_context(parameter).param_type
    if not isinstance(ptype, ParameterFunctionType):
        return

    args, mask = build_arg_map(parameter, time_bounds, fill_value_mask)

    # Get the base data type and shape
    value_encoding = ptype.value_encoding # I think it's here
    shape = self.get_values('time', time_bounds).shape

    retval = np.empty(shape, dtype=value_encoding)
    if fill_value_mask:
        retval[mask] = ptype.function._callable(*args)
    else:
        retval[:] = ptype.function._callable(*args)

def build_arg_map(parameter, time_bounds, fill_value_mask=True):
    ptype = self.get_parameter_context(parameter).param_type

    arg_list = ptype.function.arg_list
    arg_map = ptype.function.param_map

    array_map = {}
    mask = None
    for k,v in arg_map.iteritems():
        context = self.get_parameter_context(k)
        array_value = self.get_values(k, time_bounds)
        if mask is None:
            mask = ~np.isclose(array_value, context.fill_value, equal_nan=True)
        else:
            mask = mask & ~np.isclose(array_value, context.fill_value, equal_nan=True)

        array_map[k] = array_value

    if fill_value_mask:
        for k,v in array_map.iteritems():
            array_map[k] = v[mask]

    return array_map, mask
caseybryant commented 10 years ago

Unless I'm missing something, the above appears to be just how to do the masking. My question/statement is more along these lines: get_parameter_values is currently THE interface for retrieving data. If a parameter function is requested in get_parameter_values, in order to keep alignment and eliminate bad data for the parameter function, I will have to throw away all indexes for which a fill value exists for ANY parameter - because I don't have insight into which parameters a parameter function might use. This could result in valid or important data being dropped from the array returned from get_parameter_values. Are we okay with that, or is there another solution?

lukecampbell commented 10 years ago

With what I've pasted, it doesn't throw away indices. It creates an array in-memory initially of all fill values. Using the mask it fills in the array.

The key here is that we never call the function with an array that contains missing values. It doesn't treat those missing values in any special way

in a simple scenario we have a parameter function sum:

def sum(x,y):
    return x + y

In our coverage, if x is [-9999, 1, 2, 3] and y is [10, 11, 12, 13] the output of sum should be [-9999, 1, 2, 3] not [-9989, 12, 13, 14]

so the full return dictionary would be

{
  'time' : [t0, t1, t2, t3],
  'x' : array([-9999., 1., 2., 3.], dtype=float32),
  'y' : array([10., 11., 12., 13.], dtype=float32),
  'sum' : array([-9999., 12., 13., 14.], dtype=float32)
}

Behind the scenes, sum only ever got called like this:

sum(x=np.array([1., 2., 3.]), y=np.array([11., 12., 13.]))
caseybryant commented 10 years ago

Okay. I see, now, what you are requesting. Really, this is just - if any of the args have a fill-value at an index, insert the function parameter's fill-value at that index for the function parameter array.

I'm still not a parameter function expert. Can you create some tests that exercise the faulty functionality for both mathematical expressions and python function types so I can figure out the best approach? It doesn't need to be right away. I won't get to this for a few days.

lukecampbell commented 10 years ago

Yeah, I'll make some use-case tests.

caseybryant commented 10 years ago

I will add a new method, get_valid_indexes(), to NumpyDictParameterData - the object returned from Coverage.get_parameter_data(). NumpyDictParameterData.get_data() will continue to return a dictionary of data with fill values. get_fill_index_dict() will return a dictionary of numpy boolean arrays used to indicate if the value at an index of the data arrays is valid or if it fill.

It should be up to the parameter function to use this information to perform calculations at valid indexes only. The array returned from the parameter function should be the the same size as the array returned from the referenced parameter array and determine where fill values should be placed.

Example usage:

ndpd = cov.get_parameter_values(['cond'], fill_indexes=True, ...)
data = ndpd.get_data()['cond']
valid_indexes = ndpd.get_valid_indexes()['cond']
calculated_cond_array = np.empty(data.size, dtype=data.dtype)
calculated_cond_array[:] = fill_value
calculated_cond_array[valid_indexes] = data[valid_indexes]*2+1
lukecampbell commented 10 years ago

This looks good, could you push something like this so I can test it out?

caseybryant commented 10 years ago

In order to be precise in the fill value mapping, I need to know which indexes returned from a function parameter evaluation are fills. This means that the interface needs to be changed all the way through so the actual function being called can create and return the mapping. @lukecampbell, What are your preferences on this?

lukecampbell commented 10 years ago

I was under the impression that we would filter out the indexes that are fills before we execute the function(s). Parameter functions shouldn't produce fill values on their own, I think.

caseybryant commented 10 years ago

I’m talking about the abstract case where a parameter function receives non-fill data but determines that an index should be a fill. Are you saying that in all real cases, a parameter function will never return fill?

lukecampbell commented 10 years ago

That's correct. To my knowledge. A fill value in it's purest sense represents a lack of information from a sensor. Whereas in the parameter function's case, it always has information about a domain OR in the case where it doesn't we use NaN, but not as a fill value but in the absence of real value, such as a division by zero or where the mapping of the domain does not result in a real result (finite domain problems like tangent).

caseybryant commented 10 years ago

Before evaluating a parameter function, you'd like all fill values removed. I interpret this as: Remove ALL observations where ANY of the parameters in the parameter function's param map has a fill.

Is that your expectation?

lukecampbell commented 10 years ago

I believe that's accurate, yes.

caseybryant commented 10 years ago

Hiding data is a restriction on a parameter function's ability to make decisions. I believe you're okay with this, but I want to make sure we are all on the same page.

lukecampbell commented 10 years ago

Just to clarify, this behavior should be optional I think.

caseybryant commented 10 years ago

I can make it optional to hide data. In the non-optional case, should we pass fill masks to the parameter function, or should we force the function to call get_parameter_values to get the fill mask?

lukecampbell commented 10 years ago

I don't think either are an option. Most parameter functions, that I'm aware of, can't or don't accept callbacks to the coverage model and they don't explicitly accept masked arrays or another parameter to identify the fill value. Some might but most don't, I think. If the option to mask the fill values is not enabled, I would say we pass the arrays as-is into the parameter functions.

I can't think of another option, but I'm open to suggestions.

lukecampbell commented 10 years ago

I think maybe the use case got lost somewhere in this thread.

If we have a sensor that is observing u and v vectors and we have a function that determines wind direction given the u and v vectors, the function is not meant to understand the data model or even the concept of flagged values. It's only intended purpose, architecturally, is to compute wind direction given u and v.

def wind_direction(u,v):
    return 180. / np.pi * np.atan2(u,v)

Now if we have maybe the u-component of the velocity sensor on the instrument fail or go offline for whatever, the u vector will start publishing fill_values (or nothing at all which we then represent as fill_values).

Essentially my goal is this:

u = np.array([  1, 0, 2, -9999, -9999], dtype=np.float32)
v = np.array([0.5, 1, 2,     1,     0], dtype=np.float32)
wind_direction_var = np.ones(u.shape[0]) * -9999
not_fill = (u != -9999) & (v != -9999) 
wind_direction_var[not_fill] = wind_direction(u[not_fill], v[not_fill])

It's not my goal to hide data, I'm just not trying to run a function with missing information which would result in a resultant array with improper information. Instead of the result being the intended

array([63.43495178, 0., 45., -9999., -9999.])

without dealing with the fill values the result would be

array([63.43495178, 0., 45., -89.99427032, -90.])

Which are real values, and it would not be desired, I think.

caseybryant commented 10 years ago

The intent hasn't been lost. I understand your use case. I am writing a library that should be able to support multiple use cases. I am just trying to define the library's restrictions when it comes to parameter functions.

lukecampbell commented 10 years ago

For now, parameter functions are pretty simple, they can't interface with the coverage model and they don't deal with fill values directly (the only exceptions I'm aware of are QC, but that's because I'm the author and they're not used through the coverage model).