Open fkiraly opened 8 months ago
I would suggest to fix the random_state when calling sklearn's train_test_split to get a reproducible result. Best to use one which converges for our test case ;).
that fixes the failing test, though it might be worth checking if there is a user facing problem here.
I don't think so. Usually, Cyclic Boosting is quite robust and convergence issues rare. The data set used in the test might not be the best choice for quantile regression.
The data set used in the test might not be the best choice for quantile regression.
Should we then replace it with a better choice?
Maybe California housing?
Hm, there are also still timeouts appearing in the tests - seems like 10% or cases or so, it runs forever. Example:
FAILED skpro/tests/test_all_estimators.py::TestAllEstimators::test_fit_does_not_overwrite_hyper_params[CyclicBoosting-ProbaRegressorBasic] - Failed: Timeout >600.0s
Would be great if you could diagnose what is going on and try to make sure the algorithm always terminates. You can run the tests locally by using skpro.utils.check_estimator
.
Not a big issue as we can manage with the tests - it is also odd since the data sets used in testing are fairly small and similar to toy datasets that a typical user could start playing with (and scare them off if it takes ages to run, even if it's an accident).
Hm, that's weird, because we have a hard stop at 10 iterations (if I'm not missing anything).
Maybe they are just very long iterations then? 600 sec is 10min!
That should not happen with small data sets. Maybe something is messed up. Let me have a look these days.
That should not happen with small data sets. Maybe something is messed up. Let me have a look these days.
I suppose it's good then that we have stringent tests. FYI, I think we also just found a bug in sklearn
: https://github.com/sktime/skpro/pull/192
They do not seem to be testing their probabilistic interfaces systematically!
I couldn't get it to work locally in a hurry. Can you try to point me to the point in the Cyclic Boosting code at which it gets stuck? Then I might be able to get an idea what is happening.
Can you try to point me to the point in the Cyclic Boosting code at which it gets stuck? Then I might be able to get an idea what is happening.
The CI unfortunately does not say, there is no traceback if it keeps running forever, I would suggest you manually run the code in a loop and then kill at the first loop run that takes too long, that will give you a traceback. E.g., print out loop index, and kill when the next number does not appear for 10min.
The culprit are different tests, for instance test_cyclic_boosting_with_manual_paramaters
(it's written with a in the code base)
Can you try with multiplicative instead of additive mode in test_cyclic_boosting_with_manual_paramaters? The target range of this data set and the chosen QPD bound call for it.
sure: https://github.com/sktime/skpro/pull/200
in your estimation, is this more of a wild guess or a certain fix?
Hm, so far so good, I'll restart it a couple times for good measure.
The runtimes seem very long though, still:
407.99s call skpro/regression/tests/test_cyclic_boosting.py::test_cyclic_boosting_simple_use
292.31s call skpro/regression/tests/test_cyclic_boosting.py::test_cyclic_boosting_with_manual_paramaters
242.42s call skpro/regression/tests/test_all_regressors.py::TestAllRegressors::test_input_output_contract[CyclicBoosting]
213.67s call skpro/tests/test_all_estimators.py::TestAllEstimators::test_fit_returns_self[CyclicBoosting-ProbaRegressorSurvival]
212.68s call skpro/tests/test_all_estimators.py::TestAllEstimators::test_fit_updates_state[CyclicBoosting-ProbaRegressorBasic]
210.71s call skpro/tests/test_all_estimators.py::TestAllEstimators::test_fit_returns_self[CyclicBoosting-ProbaRegressorXcolMixIxYnp]
I think the main reason for this is that we have an individual QPD for each sample. That's a good thing, of course, but a bit expensive.
Another thing is that the quantile regression mode in Cyclic Boosting is not yet optimized in terms of runtime. (Its optimization is done differently to the older modes, which are very fast.) But I have opened an issue to speed it up. I'm sure it can be made a lot faster with a bit of tuning.
That's a good thing, of course, but a bit expensive.
What exactly is the bottleneck? Is it computations in the distribution, the fitting, prediction, etc? Perhaps I can help speed it up.
The Cyclic Boosting quantile regression and subsequent QPD "construction" are two independent steps. In fact, the QPD step can work with any quantile prediction, no matter how it was done (e.g., HistGradientBoostingRegressor in sklearn is a good alternative). We might even consider moving it in an independent package, if there is enough interest.
For Cyclic Boosting quantile regression, the main issue is the iterative optimization in training, that is not optimized for speed at all so far. I just made it work.
For QPD, I think most of the time is lost in the overhead of the for loop over the data set during inference (there is no training or fitting here). Maybe there is a way to "vectorize" this?
Your help is more than welcome!
In fact, the QPD step can work with any quantile prediction, no matter how it was done
Interesting - this could be a feature to add to the multiple quantile regressor. What is the abstract type of it? Converting a sequence of quantile predictions into a distribution?
For QPD, I think most of the time is lost in the overhead of the for loop over the data set during inference (there is no training or fitting here). Maybe there is a way to "vectorize" this?
Can you link to it?
PS: sporadic test failures due to not terminating are still there, even with `"multiplicative" setting: https://github.com/sktime/skpro/actions/runs/7733735315/job/21087278609?pr=200
Yes, symmetric quantile triplet in, distribution out.
the suggested fix did not work, any at suggestions?
Let me see if I can speed up the QPD loop. That should solve the issue.
@fkiraly I have found an upstream solution by allowing QPD to take full arrays rather than individual quantile values. I'm confident that will fix this here. Give me some days to work it out and build a new Cyclic Boosting release.
ok, thanks - then there's nothing to do in skpro
, just wait for the new release, right?
I'll move towards the 2.2.0 release then.
Slight adaptations will be needed: getting rid of the loop over QPD calls. But we can do that when I'm done in Cyclic Boosting.
Ok, Cyclic Boosting 1.4.0 is there. @setoguchi-naoki will go ahead and make the relevant changes here (He already knows what is to do.). In short:
Nice!
Related question: I was thinking of implementing a more general quantile parameterized distribution, i.e., not just with three anchors, but an arbitrary number. This is also done implicitly as part of MultipleQuantileRegressor
- would this be of interest to you? This might be a better default for predict_proba
if predict_quantiles
is implemented than the current normal approximation.
Don't forget: The more quantiles you need, the more expensive it gets ... at least with the usual pinball loss approaches.
Don't forget: The more quantiles you need, the more expensive it gets ... at least with the usual pinball loss approaches.
I'm considering a situation where you already have them - they do not necessarily need to be fitted independently.
The recently added cyclic boosting estimator sporadically fails tests due to failed convergence of the loss, e.g.,:
FYI @setoguchi-naoki, @felixwick