Feature: helper functions to handle missing observations

thouska / spotpy

A Statistical Parameter Optimization Tool

https://spotpy.readthedocs.io/en/latest/

MIT License

249 stars 150 forks source link

Feature: helper functions to handle missing observations #79

Open jmp75 opened 6 years ago

jmp75 commented 6 years ago

It seems objective functions under spotpy.objectivefunctions do not handle missing values (NaN) in observations out of the box. In effect this currently results in algorithms spinning wheels with nonsense fitness values.

there should at least be a helper function to help censor modelled and corresponding observed data points out of the numpy arrays
existing objective functions could be made to censor missing data points by default,
or there should be facilities to pipeline array preprocessing into objective functions. It could get wider in scope.

item (1) is a given, 2 and 3 are for discussion. I started drafting something in a fork but before investing substantial time on 2 and 3 would like a discussion.

kbstn commented 6 years ago

Hi, i had the same issue here. My idea was to make objectivefunction able to take pandas DataFrames. Until now it requests numpy arrays.

If it would be able to take pd.DataFrame we could use the advantage of having an index.

With index we could:

bring simulation and evaluation list to the same lenght and same index
dropping indices wehere evaluation contains NaN by keeping index ( df_sim = dfsim[dfsim.index.isin(dfev.index)])
access them like arrays (df_sim.values) an use them for objectivefunctions

this month i dont have time to work on this issue, just wanted to share my ideas

thouska commented 6 years ago

Thanks for your ideas. I think it is a very good idea to have a helper function in spotpy.objectivefunctions. I l like the way, how the different nans are masked and removed in the fork of @jmp75 . I think, if we build up on this, we could enable point 2, to have a exluction of missing observation data point by default. This could be activated if the given simulation and evalution lists have not the same lenght (this is checked for every objective function in line 17). Would be could, if we find a way, which does rely on pandas, in order to keep the dependencies as low as possilbe. However, Pandas support would be nice to have.

philippkraft commented 6 years ago

I've used another way to handle this in the cmf 1d example. This approach needs numpy arrays but not pandas, which is a pain to keep as a dependency.

juancastilla commented 6 years ago

I faced this issue while calibrating a groundwater model (MODFLOW) that may or may not converge depending on the parameters that are sampled by Spotpy. Whenever the model does not converge for a specific parameter set, I've added a simple if statement (if simulation == NaN) that returns "9999" or anything produces a ridiculously low likelihood. This has solved the NaN issue for me and I assume it has the added benefit of telling the sampler to steer away from regions in the parameter space where the model does not converge.

Pandas support would certainly make these issues easier to deal with and provide flexibility with plotting and managing the massive output files :)

huard commented 5 years ago

+1 for automated censoring of nans.