Add support for sum of models and custom models

rhugonnet commented 1 year ago

Early draft! Stopped here to discuss the way forward, I prefer having your point of view before adjusting too many things in Variogram! :slightly_smiling_face:

For now:

Added another test to check custom model definition (there were some already): test passed immediately,
Added a test to check custom model fitting: test didn't pass, required the adjustment below,
Added a boolean class attribute Variogram._is_model_custom defined during set_model(). This allows to separate the definition of bounds and p0 in Variogram.fit(), defined based the length of args from inspect.

However, doing it the current way, bounds and p0 will be quite poorly defined (at this point the script cannot tell which parameter matches which dimension). The user would likely have to pass them manually to reach a good fitting performance. Additionally, the current implementation will raise issues in Variogram.describe() which then passes on to Variogram.parameters.

One way we could move forward would be to divide in 2 cases:

Add specific support for models that are a combination of models existing in skgstat/models, which would allow ._get_bounds(), .describe(), .parameters to function normally and make it easy for the user;
Add basic support for custom models, with a warning to the user that he needs to pass his own bounds and p0 to reach good fitting performance (in case those aren't passed).

To implement 1., however, I see several questions:

How to conveniently pass the combination of existing models to the model argument? I'm not sure of the best option. A dictionary such as model={"op": np.sum, "models": ["spherical", "gaussian", "matern"]} could do the job? But maybe this will get overly complex. Thinking about it, for something more complex than a sum, I'm not sure we'll be able to benefit much from knowing the nature of the models to infer "bounds" and "p0" efficiently anyways. We could simply leave it to case number 2, and add support just the sum for convenience, for which we can easily pass a list ["spherical", "gaussian", "matern"]?
How to return .parameters? Right now it's a list, should we simply concatenate into a longer one keeping always the same parameter order? Or switch to a dictionary?
For describe(), I guess we can simply detect that it is a combination of models and return a more detailed list of parameters.

To implement 2.:

We have to allow bounds and p0 to be passed by the user. I guess this should be both in Variogram.fit() and __init__()? We could add fit_bounds and fit_p0 in Variogram.__init__. But there's already a lot of arguments, so could also be in kwargs. What do you think?

That's all I noticed for now, but there might be a few additional things down the line! :sweat_smile:

Resolves #159

codecov[bot] commented 1 year ago

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (c2d63d1) 89.75% compared to head (9690d39) 90.74%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #160 +/- ## ========================================== + Coverage 89.75% 90.74% +0.99% ========================================== Files 23 23 Lines 2274 2475 +201 ========================================== + Hits 2041 2246 +205 + Misses 233 229 -4 ``` | [Files](https://app.codecov.io/gh/mmaelicke/scikit-gstat/pull/160?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Mirko+M%C3%A4licke) | Coverage Δ | | |---|---|---| | [skgstat/DirectionalVariogram.py](https://app.codecov.io/gh/mmaelicke/scikit-gstat/pull/160?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Mirko+M%C3%A4licke#diff-c2tnc3RhdC9EaXJlY3Rpb25hbFZhcmlvZ3JhbS5weQ==) | `90.32% <100.00%> (+0.52%)` | :arrow_up: | | [skgstat/Variogram.py](https://app.codecov.io/gh/mmaelicke/scikit-gstat/pull/160?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Mirko+M%C3%A4licke#diff-c2tnc3RhdC9WYXJpb2dyYW0ucHk=) | `96.36% <98.30%> (+0.73%)` | :arrow_up: | ... and [7 files with indirect coverage changes](https://app.codecov.io/gh/mmaelicke/scikit-gstat/pull/160/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Mirko+M%C3%A4licke)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mmaelicke commented 1 year ago

Hi @rhugonnet,

Nice, thanks for all the effort! I am currently on vacation, but will be back next Tuesday. Is that early enough for you, if I only add a detailed response then?

A few quick thoughts: passing p0 and bounds might be a viable extension for the user in any case. As you can also manually fit a variogram model using fit_range on init, or range on fit, I think we can add these parameters like that. I would use the kwargs on init for that.

Maybe an option to define a sum-combination of pre-defined functions would be to add something like: model='spherical+gaussian'. That would be convenient and does not break the current semantics, as the parameter accepts strings for pre-defined models anyway. It would need to internal refactoring, as the Variogram.model property returns a function wrapped with models.variogram, ie. to set the right __name__ attribute on the function object (to be used in describe() and plotting). We would need some wrapper here, that can construct the actual model, and fill the __name__ property with something meaningful.

I would like to think about the parameters for a moment longer, and give you some thoughts on how to handle the parameters then. I think it would be problematic to change the structure of Variogram.parameters as a number of interfaces ie. to gstools.CovModel or gstatsim.Interpolate rely on that list. But I am sure, we can come up with something useful here. Have to think about that after my vacation.... :)

Thanks again and you will hear from me soon, Mirko

rhugonnet commented 1 year ago

Sounds good, I like all ideas!

No problem, priority to holiday and I'm busy on other things next week anyway, will continue the week after next :slightly_smiling_face: Enjoy your vacation @mmaelicke! :wink:

rhugonnet commented 1 year ago

Sorry, was still working on your comments! Just finalized the logic to support model names such as "spherical+gaussian" across the Variogram methods that used _model.__name__. Now I need to add more tests + doc, for now all existing tests pass. I'll do that tomorrow!

rhugonnet commented 1 year ago

Alright, I think this PR is now at a good stage! :smile:

In short: Variogram now fully supports model sums such as "spherical+gaussian", as well as custom models from callable!

To make this work for custom models: You've already seen 80% of the changes in the first draft, I just had to add a couple other exceptions in describe() and parameters, now it all works!

To make this work for sums: I introduced a _model_name class attribute, and added two non-public functions: _get_argpos_sum_models to get argument positions and _build_sum_models to create the summed model. Those are then used by set_model(), describe(), fit() and parameters to properly set/get the arguments of each model in the sum. By default, describe() returns a dict with effective_range_1, sill_1, nugget_1, effective_range_2, and so on... to make it "human-readable". And parameters returns a list of those in order given by the model name ("spherical+gaussian" = spherical first, gaussian second).

Wrote a lot of tests, and added descriptions in the function docstrings!

Points left to finalize:

How to interpret use_nugget for a sum of models? The user might only want to use nuggets for a subset of them. Should we then allow a boolean iterable to be passed for use_nugget? (right now untouched; all nugget or none)
Some tests are failing for only specific versions of Python (3.6 has a 4th decimal different in an array comparison for the mean residual test: https://github.com/mmaelicke/scikit-gstat/actions/runs/5933246824/job/16088390540, Python 3.9 has slightly different outputs of scipy.curve_fit: https://github.com/mmaelicke/scikit-gstat/actions/runs/5933108779/job/16088046636). Any idea how to fix those?
There's a couple new warnings in test_variogram when calling the stable model (divide by zero). Will investigate this...
Should I write a little example in the documentation? Maybe something here on sum of models + custom model: https://scikit-gstat.readthedocs.io/en/latest/userguide/variogram.html#variogram-models, and something here on passing custom bounds/p0 for the fitting https://scikit-gstat.readthedocs.io/en/latest/technical/fitting.html?

mmaelicke commented 1 year ago

Hi,

I really like your changes! I think it would be helpful if you could add an example, or maybe even a little tutorial notebook to the docs for others.

Concerning all the other points: I will think them through and give you my opinion on them. Py 3.9 and 3.10 failing is happening for all branches right now, I will investigate what is going on there. Maybe that is related to bugfixes in sicpy, which might change the parameters estimated by curve_fit. Maybe we should run unittests way more fuzzy, as curve_fit is not really SciKit-GStat's business. Not sure yet.

I'll come back to this soon

rhugonnet commented 1 year ago

Thanks for the feedback, will start on the tutorial and wait for the other points!

For the curve_fit error on Python>3.9, they're fixing it in SciPy: https://github.com/mmaelicke/scikit-gstat/issues/161#issuecomment-1689154412. I guess it's good if dependent packages like SciKit-GStat have tests relying on it, so we can report to them when this happens (which hopefully should be quite rare... they really need to improve their test suite, that was quite a big error!).
For the error on mean_residual in Python 3.6: it looks like a floating precision error due to an old NumPy adjustment that happened between 1.19 (Python 3.6) and 1.21 (Python 3.7). I couldn't track what change exactly. We could simply change the assertAlmostEqual precision to 3 decimals here and for the 2 other statistical tests: https://github.com/mmaelicke/scikit-gstat/blob/main/skgstat/tests/test_variogram.py#L989?

rhugonnet commented 1 year ago

Hi @mmaelicke! Hope you had a great summer.

I finally finished this PR today:

Added an example for sum of models and custom model in the User Guide, all building normally!
Fixed the Python 3.6 CI error by changing the floating precision of the test,
Fixed the logic for a nugget in a sum: I realized that I was a bit stupid, a sum of nuggets or a single nugget is the same! So the use_nugget parameter can stay a boolean, and the sum of models function can ignore all nuggets except the one of the last model to keep the logic in fit() working! :smile:

Added some tests to ensure this works as intended. Everything is passing on SciPy 1.11.1 except this fix: #162, still waiting for them to release 1.11.3 to resolve the other failures. :sweat_smile:

I'm thinking I could also add a tutorial/technical note on fitting multiple range using both RasterMetricSpace + sum of models. But maybe in a different PR where we also add RasterMetricSpace to the API?

mmaelicke commented 1 year ago

I will add my review as soon as possible.

I'm thinking I could also add a tutorial/technical note on fitting multiple range using both RasterMetricSpace + sum of models. But maybe in a different PR where we also add RasterMetricSpace to the API?

That sounds great! And yes, I would also do that in a separate PR.

rhugonnet commented 1 year ago

Just did some tests in this PR changing the requirements.txt, and it looks like the tests that fail for the stable_entropy binning come from updates in NumPy 1.25. Additionally, it looks like the fix in SciPy 1.11.3 did not fix the issue for some of our cases... We still need 1.11.1 to pass. They're discussing some additional issues introduced in 1.11.2 here: https://github.com/scipy/scipy/issues/19309. I'll investigate and report a reproducible example of our issue in SciPy if I can get one!

rhugonnet commented 1 year ago

Otherwise all passing with NumPy 1.24 and SciPy 1.11.1! :sweat_smile:

rhugonnet commented 1 year ago

For the SciPy issue persisting with 1.11.3, I maanged to make a reproducible example and opened an issue on SciPy directly, see: https://github.com/mmaelicke/scikit-gstat/issues/161#issuecomment-1749490016

Investigating the NumPy one...

rhugonnet commented 1 year ago

It looks like the NumPy issue started with 1.25, but I couldn't reproduce it locally or track down why exactly... Maybe we could simply reduce the floating precision of the stable_entropy bin estimation to 0 decimals? It's not very far off

mmaelicke commented 1 year ago

It looks like the NumPy issue started with 1.25, but I couldn't reproduce it locally or track down why exactly... Maybe we could simply reduce the floating precision of the stable_entropy bin estimation to 0 decimals? It's not very far off

I was also not successful in finding the exact problem. I guess we go for less precision, but open an issue to remind ourselves to test again with newer numpy versions?

rhugonnet commented 1 year ago

Yes, I agree! I removed the NumPy version pinning, opened #166 and changed the test precision to 0 decimals.

For SciPy: they are going to revert the changes to how it was in 1.11.1 and before, and release a 1.11.4, but it might take some time. In the meantime, should we merge this PR with the version pinning in requirements.txt to <=1.11.1 and open an issue to unpin once they make the release? It would allow to test the other opened PRs without having this issue in the CI!

mmaelicke commented 1 year ago

I agree. Thanks for all the input, I am really glad that you are aware of all the changes happening in scipy.

mmaelicke commented 1 year ago

Hey @rhugonnet,

The pre-commit is now merged into main with #167. I just resolved the merge conflicts. There are some minor fixes, pre-commit complains about. I can't push on your fork, so could you fix the pre-commit stuff? Then we can finally merge and have a working repo again.

Thanks for all your effort

rhugonnet commented 1 year ago

@mmaelicke Perfect, thanks! All done, we can merge :slightly_smiling_face: I'll open an issue to remind ourselves to unpin SciPy when their bug is fixed.

If you want to work directly on the PRs I opened, I think normally you can push on someone's fork (maybe depends on repo params? but usually works for me):

git remote add rhugonnet https://github.com/rhugonnet/scikit-gstat.git
git fetch rhugonnet --prune
git checkout rhugonnet/combine_models
# Will go in "detached HEAD" mode
# Make the changes and commit them
git push rhugonnet HEAD:combine_models

mmaelicke commented 1 year ago

If you want to work directly on the PRs I opened, I think normally you can push on someone's fork (maybe depends on repo params? but usually works for me):

Ah cool, I was unaware of that. Nice to see all the checks green again. I'll merge and go home :)

mmaelicke / scikit-gstat

Add support for sum of models and custom models #160

Codecov Report