nestordemeure / friedrich

A Rust implementation of Gaussian Process regression.
Apache License 2.0
56 stars 13 forks source link

Integrating friedrich into linfa #1

Open LukeMathWalker opened 4 years ago

LukeMathWalker commented 4 years ago

I have started to review friedrich with the aim of integrating it into linfa.

I imagine the integration using a "shim" crate in the linfa repository, named linfa-gaussian-processes, which re-exports all or part of friedrich's public API. This ensures that linfa is in control of naming conventions, interfaces and whatever is required to give the overall linfa ecosystem an homogeneous feeling as well as allowing you to retain ownership of friedrich and evolving it as you see fit.

The alternative would be to move friedrich inside linfa as linfa-guassian-processes, but that's a much more substantial step hence I don't think that's what we want to go for (at least for now).

In terms of requirements, apart from some minor improvements here and there that I should be able to submit as PRs against friedrich over the Christmas holidays, I'd like to understand the following:

nestordemeure commented 4 years ago

I agree with the shim proposition (as it would keep friedrich small for people not wanting to depend on the full linfa and make your interface homogenous). If you provide me with a clear description of the naming / interface convension or with a mock where all methods points to unimplemented!(), I can start working on it this Christmas.

I have actually tried both nalgebra and ndarray for my first prototypes but I did not found some operations I needed in ndarray-linalg (I cant remember which ones however). I believe that nalgebra is a better fit for my needs (and I have contributed some methods to nalgebra, that should be available in the next version, and make noticeable performance improvements available) but it is fairly easy to take ndarray types as inputs (as the nalgebra types are mostly an implementation detail that should be transparent to the user). I have not done it yet but I plan to add the trait implementation to use ndarray types as inputs/ouputs behind a feature flag (I can make it a priority as it would be a blocker for integration in linfa).

Both kind of benchmarks are high on my todo list, I will start that once I am fully happy with the internals and API (I am still wondering whether the current kernel trait and builder pattern can be improved). At the moment I have confirmed that it works as expected but not that it reproduce results from reference implementations.

I do think tests are needed but they are low priority for the moment (I want to confirm API stability before that).

(I should probably have a clean full and official TODO list somewhere)

LukeMathWalker commented 4 years ago

I am actually happy to work on the shim myself, to get familiar with the crate. I'll spend some time on it over the Christmas break and I'll pull you in for a review when I have something decent ready 👍

My main issue with nalgebra at the moment would be having to list both nalgebra and ndarray as dependencies in linfa - I don't consider it a deal breaker though, certainly not at this stage.

nestordemeure commented 4 years ago

Great!

I am currently working on adding ndarray's ArrayBase<f64, Ix2> as an optional input type which should give you all the building blocks needed (currently fighting with the Data trait...). By the way, is it preferable to use ArrayBase<f64, Ix2> or Array2<f64> (much simpler but less general).

I can understand that, nalgebra is a dependency that you clearly feel at compile time. Once I am fully happy with friedrich and if it is integrated into linfa I might retry ndarray-linalg but it is clearly low on my priority list.

LukeMathWalker commented 4 years ago

I am currently working on adding ndarray's ArrayBase<f64, Ix2> as an optional input type which should give you all the building blocks needed (currently fighting with the Data trait...). By the way, is it preferable to use ArrayBase<f64, Ix2> or Array2<f64> (much simpler but less general).

The first provides much better ergonomics to the user of the crate (they can pass in views, mutable views, owned arrays, etc.).

I can understand that, nalgebra is a dependency that you clearly feel at compile time. Once I am fully happy with friedrich and if it is integrated into linfa I might retry ndarray-linalg but it is clearly low on my priority list.

We are aligned on this, not top of the list for me either.

nestordemeure commented 4 years ago

I have added (but not tested) support for ndarray.

The next step on my list (which should be done within a week) is to reproduce the classical Mauna Loa example. As it has been implemented in sci-kit learn, it might give you a reference point to compare both implementations.

LukeMathWalker commented 4 years ago

Ok, I have started to write the shim, which means I looked closer at the whole crate and I am struggling to understand some design choices.

Let's start from my biggest confusion source right now - what do you exactly mean as Prior? Let's restrict the discussion to regression, for simplicity. In all libraries dealing with GP and most theoretical treatments, the term prior is used for the kernel function itself. The kernel function you choose embeds the assumptions that you have made on the function family we believe the unknown function to belong to. What do you mean by Prior in friedrich? It's also confusing that you can fit a prior. Why did you feel the need to separate Prior and Kernel?

LukeMathWalker commented 4 years ago

It's there any reference (book/article) I can look at for the training algorithm you are using?

nestordemeure commented 4 years ago

This article is quitte good at explaining what a gaussian process is and where does the prior come into play : gaussian processes are not so fancy (see the The Bayesian prior away from data section).

In short the kernel function gives a similarity metric between elements in the input space while the prior gives a default value that will be returned in the absence of information (in particular stationary kernels returns zero when the similarity to the training samples fall to 0 which is probably not what the user would expect).

The prior can be any regression model and, while having a good prior (close to the target function) makes the learning faster (it is useful for things such as transfer learning), I have seen some paper omiting it from the definition of the gaussian process (since it is trivial to add it on top of the equation).

Something like a linear prior can be included into the kernel but (to the best of my knowledge) it is not true for all type of priors while it is easy to include an explicit prior term in the equation used.

(don't hesitate to tell me if I am not clear)

nestordemeure commented 4 years ago

It's there any reference (book/article) I can look at for the training algorithm you are using?

The training of the prior or of the kernel ?

For the prior the training is up to each particular prior implementation (linear regression for a linear prior, etc).

For the kernel I believe I put some links in the sources. In short I use the ADAM optimizer (a form of gradient descent) to maximize the log-likelihood of the model given the training data (which has some nice regularising properties). If the kernel function has a scaling parameter, I use ideas from a paper (I don't have the link on hand but it is in the source, I can put it here if needed) to find the optimal scaling and perform the gradient descent on the corresponding problem.

Does it answer your question ?

nestordemeure commented 4 years ago

Reading the documentation for scikit learn's implementation, it seem that they set the prior to either 0 or the mean of the data but do not allow for alternatives :

The prior mean is assumed to be constant and zero (for normalize_y=False) or the training data’s mean (for normalize_y=True). The prior’s covariance is specified by passing a kernel object.

(note that when I say prior, I always mean prior mean)

LukeMathWalker commented 4 years ago

Ok, let me rephrase then. You are saying that most projects work under the assumption that m(x) in

GP

is 0, simplifying the GP definition to

SimpleGP

while friedrich lets you specify m(x). Now, both m(x) and k(x, x) are actually m(x, a) and k(x, x, b) where a and b are their respective hyper-parameters. friedrich lets you choose if you want to fit those hyper-parameters on the training data or no. Have I got it right?

nestordemeure commented 4 years ago

if p(x) is the prior and not the final distribution outputed by the trained model then yes, its a perfect description.

LukeMathWalker commented 4 years ago

Perfect. I have started to draft the shim - you can follow my progresses here: https://github.com/LukeMathWalker/linfa/blob/gp/linfa-gaussian-processes/src/hyperparameters.rs

nestordemeure commented 4 years ago

Great! I gave a quick look to your code, here are two remarks :

I see that you put a ??? next to convergence_fraction (I should probably find a better name for this parameter). It corresponds to the current stopping criteria for the gradient descent : given a variable x and its gradient gx, the algorithm stops if, for all variables, we have |gx| < convergence_fraction*|x|.

You set noise to 0.1 by default (with a comment saying that it correspond to 10% of the std). However friedrich takes absolute values for the noise (and not relative values such that 0.1 would be converted to 10% of the std). The source of the confusion might be that, if the user does not give a value for the noise, it will default to 10% of the std of the ouputs.

(on my side, I did not have as much time as expected but the example should be coming shortly)

LukeMathWalker commented 4 years ago

I see that you put a ??? next to convergence_fraction (I should probably find a better name for this parameter). It corresponds to the current stopping criteria for the gradient descent : given a variable x and its gradient gx, the algorithm stops if, for all variables, we have |gx| < convergence_fraction*|x|.

I figured that out later when doing the setters, but I didn't go back to edit the comment - I'll fix it up :+1:

You set noise to 0.1 by default (with a comment saying that it correspond to 10% of the std). However friedrich takes absolute values for the noise (and not relative values such that 0.1 would be converted to 10% of the std). The source of the confusion might be that, if the user does not give a value for the noise, it will default to 10% of the std of the ouputs.

I did it on purpose and I know it's broken right now :grin: I don't like the way the default works because it's not tunable by the user - e.g. it's cumbersome for them to say that noise should be 20% of the std, because the value they pass in is used as an absolute value. I have to figure out a way to make it work as percentage of std by default, still working on it - a PR against friedrich might come out of it.

nestordemeure commented 4 years ago

Making it a % of the std is not too hard (computing the std is cheap compared to the training) but I believe that using the std is only good to get an idea of the magnitude of the noise while, when you have an exact number given by an expert, its usually an absolute value (but it might be a moot point as the fit will change the amplitude of the noise anyway).

I am currious about your PR nevertheless

LukeMathWalker commented 4 years ago

Yeah, when I say make it work I mean from an API point of view :+1:

wegreenall commented 2 years ago

Have been following this with great interest. What is the current state of the integration of GPs into linfa? I was thinking of building a Gaussian process library in Rust but would rather contribute to something bigger, if this is still happening.

nestordemeure commented 2 years ago

Hi @wegreenall ! Currently there is a shim in the Linfa repository but the work has not been integrated in the main branch and, to my knowledge, might be incomplete.

Your best bet is to check with @LukeMathWalker to determine what is missing for integration in Linfa. Don't hesitate to ask me any question to help with the integration work.

Longer term, Friedrich uses nalgebra while linfa focuses on ndarray, both of which are large dependencies, so it might be interesting to start working on a new version of Friedrich that would not be dependent on nalgebra (however the integration is quitte deep so that would reprsent a lot of work).