Open DerWeh opened 1 month ago
A related set of questions has also come up recently in the AutoGluon project with regards to timeouts and callbacks https://github.com/autogluon/autogluon/pull/4480#issuecomment-2365060505, so this would be good for us to solve. My default thinking until recently has been that other improvements like speed, memory, and model quality were more immediately important, but considering how useful it would be to AutoGluon I now think this is one of the more important things that could be added to EBMs. Is this something you'd be interested in looking into @DerWeh?
Whatever gets added will need to go somewhere in this loop: https://github.com/interpretml/interpret/blob/0ad00938c1aab4b06ffd400d358f912d78b54dca/python/interpret-core/interpret/glassbox/_ebm/_boost.py#L83
Allowing to pass a callback, which can also stop the iterations early, is in principle simple enough and quite general (as long as we define the stopping criteria beforehand). We could either define the important variables which are passed to the callback, or simple pass the locals, which might be more powerful but provides no stable API.
What complicates things is the multiprocessing EBMs default to. Thus, the callback needs to be pickled and runs in a different process...
Is there a reason, a particular reason you stick to multiprocessing instead of multi-threading by default? (See default="threads"
in joblib.Parallel) Personally, I observe no big difference in performance. The intensive parts are done in the C++ library, so the GIL is no issue (as long as the library is thread-safe). For a callback, the difference between multi-threading and multiprocessing might be rather significant.
On a side note, I also don't understand the benefit of the additional complexity of the JobLibProvider
you use. Joblib already offers a mechanism for different backends, see the ray
example and the custom backend API.
Getting a meaningful timeout for fewer jobs than outer bags is non-trivial in the current form (as you mentioned in the issue). Canceling the loop after a certain time is straight forward, but if the bags run one after another some bags might have finished, one aborted early, and some not run at all... One option would be of course giving each back a fraction of the time and passing the responsibility to the user. One could use a construct like:
# the previous while loop
for step_idx in call_back(max_steps):
...
# definition of a callback
def timeout_callback_generator(maxtime):
def timeout_callback(max_steps):
start_time = datetime.datetime.now()
for step_idx in range(max_steps):
yield step_idx
if start_time - datetime.datetime.now() > maxtime:
print("Maximal time exceeded, stoping before converged")
break
return timeout_callback
Just a general idea, not working code. Of course, this would be a "soft timeout". As we check the time after every iteration instead of killing it after a certain time. This would also be possible, but I doubt the benefit is worth the effort.
We use ctypes, which I thought didn't have an option to release the GIL, so I'm not sure why there isn't an impact when using default="threads".
I agree the API might be unstable and the process state handling would change once we move threading into C. I'm thinking the cost of potentially making a breaking change in the future is worth the benefit today. Hopefully, callback usage would be niche, and therefore not break too many users. We might also be able to version the function by looking at the number of parameters. In terms of the callback API, I was thinking that instead of looping and yielding inside the callback, the boost function would contain the loop, and we'd call the callback function on each loop iteration. That would allow us to pass things like the current validation metric, and the loop iteration count, etc. The callback could return a bool to terminate boosting. The callback would have to somehow hold per-outer-bag state, which could be done via a global dictionary if we pass in the bag index.
We do still have the messy issues with determining which work to include in the final model. That seems like a pretty fundamental problem. Maybe a simple heuristic would work, like only including completed models if any model reaches completion, otherwise include all partly completed models. Not ideal, but sometimes messiness is required to get something practical.
I think JobLibProvider isn't required and could be simplified away. I didn't write that section, so maybe there's something I'm missing, but I think you're correct on that.
We use ctypes, which I thought didn't have an option to release the GIL, so I'm not sure why there isn't an impact when using default="threads".
I am by no means an expert of ctypes
, I mostly use Cython to wrap code. But according to the documentation, CDLL indeed releases the GIL; this is hidden in the docstring of PyDLL:
Instances of this class behave like CDLL instances, except that the Python GIL is not released during the function call, and after the function execution the Python error flag is checked.
FYI: NumPy provides ctypeslib with some convenience functions simplifiying the usage of ctypes
, see also https://numpy.org/doc/stable/user/c-info.python-as-glue.html#index-2
So multithreading is fine, as long as the library is thread safe.
In terms of the callback API, I was thinking that instead of looping and yielding inside the callback, the boost function would contain the loop, and we'd call the callback function on each loop iteration.
This was also my first thought. However, to use a timeout, I think we need state (the start time). A generator seems the most natural, send would allow for providing values to the generator. But I really have to implement a prototype to see if this works out nicely or not. I'll probably try a few prototypes, which we can discuss.
We do still have the messy issues with determining which work to include in the final model. That seems like a pretty fundamental problem. Maybe a simple heuristic would work, like only including completed models if any model reaches completion, otherwise include all partly completed models. Not ideal, but sometimes messiness is required to get something practical.
My approach would have been to provide every bag with an equal time limit. Of course, this places additional burden on the user as he has to consider the number of bags and cores. Assumption n_jobs <= n_cores
. For the simple cases n_jobs == n_bags
, the user provides maxtime
. All bags process parallel and are aborted after maxtime
. All should have roughly the same quality, we should need no special treatment. For n_jobs == 1
, the user provides maxtime/n_bags
as time limit. Bags are processed one after another, each finishing after maxtime/n_bags
and the total fitting after maxtime
. Again, we don't need to worry about averaging. Of course, in this case, the simple rule more bags → more accurate model doesn't hold anymore, as more bags might result in less converged bags.
Use cases with n_jobs > 1
incommensurable with the number of bags are a bit tricky. maxtime/(n_bags//n_jobs)
should keep the time but wastes resources.
What are your thoughts? Another issues the iterative nature: we first fit the main effects and then add interactions. How do we handle this? First fit mains, and if time is left, spend it on interactions? Reserve a fixed budget for mains and interactions.
Last, but very important: how precise do we have to follow the timeout? Is it enough to check the time every iteration? Can we neglect everything else and focus only on the boosting? To really handle timeout, it would probably better to periodically write results to disk and have an external timeout manager that kills the fitting after the timeout, and triggers the recovering of the last result. This would also allow manually killing the process, or investigating intermediate steps, to see if everything is working as intended (exploiting the nice glass box nature of EBMs).
This could realized by (asynchronously) writing out a file tree like:
progress/
progress/ebm_configuration.json
progress/bagN/mains_iterM.npz
progress/bagN/interactiosn_iterL.npz
and providing a helper creating an EBM from the latest configuration. This would require again some heuristic, which results to include. One idea would be to write out the metric, that is boosted, with every result and include everything which is not worse than the median metric + X%.
Small addendum:
My default thinking until recently has been that other improvements like speed, memory, and model quality were more immediately important
I fully agree, the root issue is that fitting EBMs takes too long for big dataset (whatever too long means) forcing use to work around it. But unless you have ways to fitting up training by one or two orders of magnitudes, we're stuck.
Ah, very interesting. I didn't know that about CDLL. Yes, the library is thread-safe, so we can switch to using threads which simplifies things at least a little bit.
I think, probably, if my goal was to build the best model possible in a given amount of time, and I had to choose between shallow boosting all the bags, or deep boosting just a few, I think boosting just a few as deeply as possible would result in the better model more often. Since we can't currently use multiple cores to advance a single bag, I think if we had N available cores, the best strategy currently would be to boost N bags to completion and then move to the next N bags if time allows. The nice thing about this is that it aligns with our current processing order. For pairs, we currently choose them after all the mains are done. I don't think we want to change that since the pairs are chosen universally across all bags, so all the bags need to be done before we can do that in a consistent way. In theory perhaps having a fixed time budget for the mains to allow some pair boosting time would result in a better model within a given amount time, but that feels like it's getting rather complicated in terms of how these things would be specified.
I agree, the state holding methodology is non-deal. It is possible, but it isn't clear to me how obvious this will be to our users. Something like this works:
def callback_generator(maxtime):
start_time = datetime.datetime.now()
per_bag = {}
def callback(bag_index, step_index, metric):
prev_metrics = per_bag.get(bag_index, [])
prev_metrics.append(metric)
if start_time - datetime.datetime.now() > maxtime:
return False
if SOME_OTHER_CRITERIA:
return False
return True
return callback
Perhaps it would be more obvious to our users if we added a callback_args parameter along with a callback parameter to the EBM constructors. Then the user could do something like this:
def callback(bag_index, step_index, metric, **kwargs):
if kwargs["start_time"] - datetime.datetime.now() > kwargs["maxtime"]:
return False
if SOME_OTHER_CRITERIA:
return False
return True
I somewhat lean toward the first option since it keeps the main interface simpler by only adding a single new parameter to the constructor. We could always include an example in our documentation to show users how to handle state.
I'm not really familiar with send, so I'll read up on it. Would be interesting to see how it changes the feel of the API if used. I'll also think about the question about writing to disk and add some thoughts later to this thread. It doesn't feel like we've hit the best API yet, so let's keep discussing.
What is the recommended indication for the progress of fitting EBMs? In version 0.5.0, the logger provided the current boosting round and the value of the metric every 10 rounds. This was removed in version 0.6.0. Currently, I find no more progress information of boosting.
For large datasets, fitting an EBM can take several days. So any form of progress indication would be highly welcome. Ideally, we would be also able to save and resume intermediate results (in case of power outages, or we might realize that the results are already good enough for our purpose, or so bad that there is no point in further boosting).
To me, a progress indicator is quite important. It is very hard to estimate the runtime of boosting, as ideally we rely on the early stopping. Thus, the final runtime depends on the “difficulty” of the dataset, making in hard to extrapolate runtimes.