mmaelicke / scikit-gstat

Geostatistical variogram estimation expansion in the scipy style
https://mmaelicke.github.io/scikit-gstat/
MIT License
225 stars 55 forks source link

Add support for sum of models and custom models #160

Closed rhugonnet closed 1 year ago

rhugonnet commented 1 year ago

Early draft! Stopped here to discuss the way forward, I prefer having your point of view before adjusting too many things in Variogram! :slightly_smiling_face:

For now:

However, doing it the current way, bounds and p0 will be quite poorly defined (at this point the script cannot tell which parameter matches which dimension). The user would likely have to pass them manually to reach a good fitting performance. Additionally, the current implementation will raise issues in Variogram.describe() which then passes on to Variogram.parameters.

One way we could move forward would be to divide in 2 cases:

  1. Add specific support for models that are a combination of models existing in skgstat/models, which would allow ._get_bounds(), .describe(), .parameters to function normally and make it easy for the user;
  2. Add basic support for custom models, with a warning to the user that he needs to pass his own bounds and p0 to reach good fitting performance (in case those aren't passed).

To implement 1., however, I see several questions:

To implement 2.:

That's all I noticed for now, but there might be a few additional things down the line! :sweat_smile:

Resolves #159

codecov[bot] commented 1 year ago

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (c2d63d1) 89.75% compared to head (9690d39) 90.74%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #160 +/- ## ========================================== + Coverage 89.75% 90.74% +0.99% ========================================== Files 23 23 Lines 2274 2475 +201 ========================================== + Hits 2041 2246 +205 + Misses 233 229 -4 ``` | [Files](https://app.codecov.io/gh/mmaelicke/scikit-gstat/pull/160?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Mirko+M%C3%A4licke) | Coverage Δ | | |---|---|---| | [skgstat/DirectionalVariogram.py](https://app.codecov.io/gh/mmaelicke/scikit-gstat/pull/160?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Mirko+M%C3%A4licke#diff-c2tnc3RhdC9EaXJlY3Rpb25hbFZhcmlvZ3JhbS5weQ==) | `90.32% <100.00%> (+0.52%)` | :arrow_up: | | [skgstat/Variogram.py](https://app.codecov.io/gh/mmaelicke/scikit-gstat/pull/160?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Mirko+M%C3%A4licke#diff-c2tnc3RhdC9WYXJpb2dyYW0ucHk=) | `96.36% <98.30%> (+0.73%)` | :arrow_up: | ... and [7 files with indirect coverage changes](https://app.codecov.io/gh/mmaelicke/scikit-gstat/pull/160/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Mirko+M%C3%A4licke)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mmaelicke commented 1 year ago

Hi @rhugonnet,

Nice, thanks for all the effort! I am currently on vacation, but will be back next Tuesday. Is that early enough for you, if I only add a detailed response then?

A few quick thoughts: passing p0 and bounds might be a viable extension for the user in any case. As you can also manually fit a variogram model using fit_range on init, or range on fit, I think we can add these parameters like that. I would use the kwargs on init for that.

Maybe an option to define a sum-combination of pre-defined functions would be to add something like: model='spherical+gaussian'. That would be convenient and does not break the current semantics, as the parameter accepts strings for pre-defined models anyway. It would need to internal refactoring, as the Variogram.model property returns a function wrapped with models.variogram, ie. to set the right __name__ attribute on the function object (to be used in describe() and plotting). We would need some wrapper here, that can construct the actual model, and fill the __name__ property with something meaningful.

I would like to think about the parameters for a moment longer, and give you some thoughts on how to handle the parameters then. I think it would be problematic to change the structure of Variogram.parameters as a number of interfaces ie. to gstools.CovModel or gstatsim.Interpolate rely on that list. But I am sure, we can come up with something useful here. Have to think about that after my vacation.... :)

Thanks again and you will hear from me soon, Mirko

rhugonnet commented 1 year ago

Sounds good, I like all ideas!

No problem, priority to holiday and I'm busy on other things next week anyway, will continue the week after next :slightly_smiling_face: Enjoy your vacation @mmaelicke! :wink:

rhugonnet commented 1 year ago

Sorry, was still working on your comments! Just finalized the logic to support model names such as "spherical+gaussian" across the Variogram methods that used _model.__name__. Now I need to add more tests + doc, for now all existing tests pass. I'll do that tomorrow!

rhugonnet commented 1 year ago

Alright, I think this PR is now at a good stage! :smile:

In short: Variogram now fully supports model sums such as "spherical+gaussian", as well as custom models from callable!

To make this work for custom models: You've already seen 80% of the changes in the first draft, I just had to add a couple other exceptions in describe() and parameters, now it all works!

To make this work for sums: I introduced a _model_name class attribute, and added two non-public functions: _get_argpos_sum_models to get argument positions and _build_sum_models to create the summed model. Those are then used by set_model(), describe(), fit() and parameters to properly set/get the arguments of each model in the sum. By default, describe() returns a dict with effective_range_1, sill_1, nugget_1, effective_range_2, and so on... to make it "human-readable". And parameters returns a list of those in order given by the model name ("spherical+gaussian" = spherical first, gaussian second).

Wrote a lot of tests, and added descriptions in the function docstrings!

Points left to finalize:

mmaelicke commented 1 year ago

Hi,

I really like your changes! I think it would be helpful if you could add an example, or maybe even a little tutorial notebook to the docs for others.

Concerning all the other points: I will think them through and give you my opinion on them. Py 3.9 and 3.10 failing is happening for all branches right now, I will investigate what is going on there. Maybe that is related to bugfixes in sicpy, which might change the parameters estimated by curve_fit. Maybe we should run unittests way more fuzzy, as curve_fit is not really SciKit-GStat's business. Not sure yet.

I'll come back to this soon

rhugonnet commented 1 year ago

Thanks for the feedback, will start on the tutorial and wait for the other points!

rhugonnet commented 1 year ago

Hi @mmaelicke! Hope you had a great summer.

I finally finished this PR today:

Added some tests to ensure this works as intended. Everything is passing on SciPy 1.11.1 except this fix: #162, still waiting for them to release 1.11.3 to resolve the other failures. :sweat_smile:

I'm thinking I could also add a tutorial/technical note on fitting multiple range using both RasterMetricSpace + sum of models. But maybe in a different PR where we also add RasterMetricSpace to the API?

mmaelicke commented 1 year ago

I will add my review as soon as possible.

I'm thinking I could also add a tutorial/technical note on fitting multiple range using both RasterMetricSpace + sum of models. But maybe in a different PR where we also add RasterMetricSpace to the API?

That sounds great! And yes, I would also do that in a separate PR.

rhugonnet commented 1 year ago

Just did some tests in this PR changing the requirements.txt, and it looks like the tests that fail for the stable_entropy binning come from updates in NumPy 1.25. Additionally, it looks like the fix in SciPy 1.11.3 did not fix the issue for some of our cases... We still need 1.11.1 to pass. They're discussing some additional issues introduced in 1.11.2 here: https://github.com/scipy/scipy/issues/19309. I'll investigate and report a reproducible example of our issue in SciPy if I can get one!

rhugonnet commented 1 year ago

Otherwise all passing with NumPy 1.24 and SciPy 1.11.1! :sweat_smile:

rhugonnet commented 1 year ago

For the SciPy issue persisting with 1.11.3, I maanged to make a reproducible example and opened an issue on SciPy directly, see: https://github.com/mmaelicke/scikit-gstat/issues/161#issuecomment-1749490016

Investigating the NumPy one...

rhugonnet commented 1 year ago

It looks like the NumPy issue started with 1.25, but I couldn't reproduce it locally or track down why exactly... Maybe we could simply reduce the floating precision of the stable_entropy bin estimation to 0 decimals? It's not very far off

mmaelicke commented 1 year ago

It looks like the NumPy issue started with 1.25, but I couldn't reproduce it locally or track down why exactly... Maybe we could simply reduce the floating precision of the stable_entropy bin estimation to 0 decimals? It's not very far off

I was also not successful in finding the exact problem. I guess we go for less precision, but open an issue to remind ourselves to test again with newer numpy versions?

rhugonnet commented 1 year ago

Yes, I agree! I removed the NumPy version pinning, opened #166 and changed the test precision to 0 decimals.

For SciPy: they are going to revert the changes to how it was in 1.11.1 and before, and release a 1.11.4, but it might take some time. In the meantime, should we merge this PR with the version pinning in requirements.txt to <=1.11.1 and open an issue to unpin once they make the release? It would allow to test the other opened PRs without having this issue in the CI!

mmaelicke commented 1 year ago

I agree. Thanks for all the input, I am really glad that you are aware of all the changes happening in scipy.

mmaelicke commented 1 year ago

Hey @rhugonnet,

The pre-commit is now merged into main with #167. I just resolved the merge conflicts. There are some minor fixes, pre-commit complains about. I can't push on your fork, so could you fix the pre-commit stuff? Then we can finally merge and have a working repo again.

Thanks for all your effort

rhugonnet commented 1 year ago

@mmaelicke Perfect, thanks! All done, we can merge :slightly_smiling_face: I'll open an issue to remind ourselves to unpin SciPy when their bug is fixed.

If you want to work directly on the PRs I opened, I think normally you can push on someone's fork (maybe depends on repo params? but usually works for me):

git remote add rhugonnet https://github.com/rhugonnet/scikit-gstat.git
git fetch rhugonnet --prune
git checkout rhugonnet/combine_models
# Will go in "detached HEAD" mode
# Make the changes and commit them
git push rhugonnet HEAD:combine_models
mmaelicke commented 1 year ago

If you want to work directly on the PRs I opened, I think normally you can push on someone's fork (maybe depends on repo params? but usually works for me):

Ah cool, I was unaware of that. Nice to see all the checks green again. I'll merge and go home :)