ibayer / fastFM

fastFM: A Library for Factorization Machines
http://ibayer.github.io/fastFM
Other
1.08k stars 204 forks source link

Fit callbacks #145

Open ghost opened 5 years ago

ghost commented 5 years ago

Pull request for Python and Cython changes for fit_callbacks.

ibayer commented 5 years ago

None inform C++ side somehow so no cycles are wasted invoking the callback.

I don't think invoking the callback will have any noticeable overhead if we do this ones per iterations (at least for any non super small toy examples) if I'm wrong we can fix it later. Let's not risk premature optimization here. If you are sure it's a problem then let's do an experiment first to proof it.

One way to check is to catch TypeError which is raised if a function is invoked with less or more than the required number of arguments.

I don't yet see when it's useful to pass data from cpp to python via call back function. However it could be useful if the return of the python function could be used to stop the solver (return continue true/false).

passing all parameters as dict and invoking the python callback via **kwargs for greater flexibility

I would just keep the function as simple as possible for now.

ibayer commented 5 years ago

I changed my mind a bit about passing parameter from cpp to python. Passing a string would be very flexible as this would allow us to serialize json. We already do this to parse parameter from python to cpp. It's a bit dirty, not type checking etc. but very flexible.

ghost commented 5 years ago

It's a bit dirty, not type checking etc. but very flexible.

The broad idea I had with such a cpp sends json string -> python parses json into dict is to have support for all the ideas you mentioned like MCMC traces, RMSE plots, stopping early, etc with a single fit callback instead of having, for example, two fit callbacks, one for ALS and for MCMC.

We can't type check at compile-time level anyways because everything is void* and object and it has to be deferred to runtime. During runtime, if types do not match an exception will be raised which brings me to my second thought.

callback corresponds to fit_callback_t. One way to check is to catch TypeError which is raised if a function is invoked with less or more than the required number of arguments.

In any possible setup the function can raise exceptions:

Once these exceptions happen, unless caught, cause the program to end abruptly. The possible issue here is memory management:

    # Callback raises exception and python crashes probably.
    # Exception raised when fit_callback_wrapper executes in C++ space.
    cpp_ffm.fit_with_callback(s, m, d, fit_callback_wrapper, (<void*> callback))

    # Never gets called => Memory leak.
    del d
    del d
    del m

Anyways, in essence, I'm saying we must sanitize input and deal with 3rd party errors on one hand and on another hand, we must be flexible so we don't duplicate code and effort in C++ side.

ibayer commented 5 years ago

The broad idea I had with such a cpp sends json string -> python parses json into dict is to have support ...

I agree that's pretty cool.

In any possible setup the function can raise exceptions:

Exceptions can cause all kind of trouble as you have pointed out. Do we really need them? We could simply return an error string if c++ can't deal with the python input.

For type errors or not having enough or too many arguments.

We only take a single serialized json string as input and send error sting back if json isn't valid or values can't be used.

For whatever reasons we can't control due to 3rd party users doing weird things in their callbacks.

I don't think there is much we can do if the user changes the data dimensions or similar things.

# Callback raises exception and python crashes probably.

If python crashes then the memory allocated by the extension should be realized from the OS as well, or?

ibayer commented 5 years ago

Okay, I think I got a bit confused let's summarized it again.

python: call_back(string message) -> bool

message can also be used by cpp to send error message, bool decides if iterations should be continued

cpp: call_back(string message) -> void

message can be used to request solver specific information

ghost commented 5 years ago

Exceptions can cause all kind of trouble as you have pointed out. Do we really need them? We could simply return an error string if c++ can't deal with the python input.

C++ is OK anyways the problem is the callback function implementation.

Let's say there are two possible callback types, one for MCMC which takes traces and one for ALS which takes the current iteration index and RMSE/accuracy.

Problem 1:

def my_callback(iteration, rmse):
     # do something which throws an exception, here indexing issues
     arr = []
     arr[5] = 3

We must ensure that the solver remains in a stable state and that the user is warned of the issues their callback has. The definition of a stable state is what I need. Should the solver stop because an error was found or continue pretending there is no issue in the callback while the error message is printed.

Problem 2:

def my_callback(iteration, rmse):
     # everything is ok here
     plot_data.append((iteration, rmse))
mcmc_model = mcmc.FMRegressor()
mcmc_model.fit(X, y, my_callback)

Problem is that for MCMC my_callback should have signature traces_last : Dict[str, float], iteration : int, rmse: float and when calling the callback in the wrapper a TypeError is raised.

If python crashes then the memory allocated by the extension should be realized from the OS as well, or?

I don't know, probably but IIRC the C standard (and we are dealing with Cython which is C) does not specify. I think Linux does release the memory, don't know for other OSs.

ibayer commented 5 years ago

Thanks for spelling out the details!

Problem 1

Should the solver stop because an error was found or continue pretending there is no issue in the callback while the error message is printed.

Keep solver going whenever possible (or return false and finish the cpp call properly). We catch all exceptions and turn them into warnings.

Problem 2

def my_callback(iteration, rmse):

def my_callback(message):
     # everything is ok here
     d = parse(message)
     plot_data.append((d['iteration'], d['rmse']))

Parsing a message string would avoid the need to specify the signature explicitly. I know the flexibility comes at a cost but it gives us the full flexibility to explore what can be done using callbacks. We can still refactor later and use strict typing.