Open LukeMathWalker opened 4 years ago
I agree with the shim proposition (as it would keep friedrich
small for people not wanting to depend on the full linfa
and make your interface homogenous).
If you provide me with a clear description of the naming / interface convension or with a mock where all methods points to unimplemented!()
, I can start working on it this Christmas.
I have actually tried both nalgebra
and ndarray
for my first prototypes but I did not found some operations I needed in ndarray-linalg
(I cant remember which ones however).
I believe that nalgebra
is a better fit for my needs (and I have contributed some methods to nalgebra
, that should be available in the next version, and make noticeable performance improvements available) but it is fairly easy to take ndarray
types as inputs (as the nalgebra
types are mostly an implementation detail that should be transparent to the user).
I have not done it yet but I plan to add the trait implementation to use ndarray
types as inputs/ouputs behind a feature flag (I can make it a priority as it would be a blocker for integration in linfa
).
Both kind of benchmarks are high on my todo list, I will start that once I am fully happy with the internals and API (I am still wondering whether the current kernel trait and builder pattern can be improved). At the moment I have confirmed that it works as expected but not that it reproduce results from reference implementations.
I do think tests are needed but they are low priority for the moment (I want to confirm API stability before that).
(I should probably have a clean full and official TODO list somewhere)
I am actually happy to work on the shim myself, to get familiar with the crate. I'll spend some time on it over the Christmas break and I'll pull you in for a review when I have something decent ready 👍
My main issue with nalgebra
at the moment would be having to list both nalgebra
and ndarray
as dependencies in linfa
- I don't consider it a deal breaker though, certainly not at this stage.
Great!
I am currently working on adding ndarray
's ArrayBase<f64, Ix2>
as an optional input type which should give you all the building blocks needed (currently fighting with the Data
trait...).
By the way, is it preferable to use ArrayBase<f64, Ix2>
or Array2<f64>
(much simpler but less general).
I can understand that, nalgebra
is a dependency that you clearly feel at compile time. Once I am fully happy with friedrich
and if it is integrated into linfa
I might retry ndarray-linalg
but it is clearly low on my priority list.
I am currently working on adding
ndarray
'sArrayBase<f64, Ix2>
as an optional input type which should give you all the building blocks needed (currently fighting with theData
trait...). By the way, is it preferable to useArrayBase<f64, Ix2>
orArray2<f64>
(much simpler but less general).
The first provides much better ergonomics to the user of the crate (they can pass in views, mutable views, owned arrays, etc.).
I can understand that,
nalgebra
is a dependency that you clearly feel at compile time. Once I am fully happy withfriedrich
and if it is integrated intolinfa
I might retryndarray-linalg
but it is clearly low on my priority list.
We are aligned on this, not top of the list for me either.
I have added (but not tested) support for ndarray
.
The next step on my list (which should be done within a week) is to reproduce the classical Mauna Loa example. As it has been implemented in sci-kit learn, it might give you a reference point to compare both implementations.
Ok, I have started to write the shim, which means I looked closer at the whole crate and I am struggling to understand some design choices.
Let's start from my biggest confusion source right now - what do you exactly mean as Prior
?
Let's restrict the discussion to regression, for simplicity.
In all libraries dealing with GP and most theoretical treatments, the term prior is used for the kernel function itself. The kernel function you choose embeds the assumptions that you have made on the function family we believe the unknown function to belong to.
What do you mean by Prior
in friedrich
? It's also confusing that you can fit
a prior.
Why did you feel the need to separate Prior
and Kernel
?
It's there any reference (book/article) I can look at for the training algorithm you are using?
This article is quitte good at explaining what a gaussian process is and where does the prior come into play : gaussian processes are not so fancy (see the The Bayesian prior away from data section).
In short the kernel function gives a similarity metric between elements in the input space while the prior gives a default value that will be returned in the absence of information (in particular stationary kernels returns zero when the similarity to the training samples fall to 0 which is probably not what the user would expect).
The prior can be any regression model and, while having a good prior (close to the target function) makes the learning faster (it is useful for things such as transfer learning), I have seen some paper omiting it from the definition of the gaussian process (since it is trivial to add it on top of the equation).
Something like a linear prior can be included into the kernel but (to the best of my knowledge) it is not true for all type of priors while it is easy to include an explicit prior term in the equation used.
(don't hesitate to tell me if I am not clear)
It's there any reference (book/article) I can look at for the training algorithm you are using?
The training of the prior or of the kernel ?
For the prior the training is up to each particular prior implementation (linear regression for a linear prior, etc).
For the kernel I believe I put some links in the sources. In short I use the ADAM optimizer (a form of gradient descent) to maximize the log-likelihood of the model given the training data (which has some nice regularising properties). If the kernel function has a scaling parameter, I use ideas from a paper (I don't have the link on hand but it is in the source, I can put it here if needed) to find the optimal scaling and perform the gradient descent on the corresponding problem.
Does it answer your question ?
Reading the documentation for scikit learn's implementation, it seem that they set the prior to either 0 or the mean of the data but do not allow for alternatives :
The prior mean is assumed to be constant and zero (for normalize_y=False) or the training data’s mean (for normalize_y=True). The prior’s covariance is specified by passing a kernel object.
(note that when I say prior, I always mean prior mean)
Ok, let me rephrase then.
You are saying that most projects work under the assumption that m(x)
in
is 0
, simplifying the GP definition to
while friedrich
lets you specify m(x)
.
Now, both m(x)
and k(x, x)
are actually m(x, a)
and k(x, x, b)
where a
and b
are their respective hyper-parameters.
friedrich
lets you choose if you want to fit those hyper-parameters on the training data or no.
Have I got it right?
if p(x)
is the prior and not the final distribution outputed by the trained model then yes, its a perfect description.
Perfect. I have started to draft the shim - you can follow my progresses here: https://github.com/LukeMathWalker/linfa/blob/gp/linfa-gaussian-processes/src/hyperparameters.rs
Great! I gave a quick look to your code, here are two remarks :
I see that you put a ???
next to convergence_fraction
(I should probably find a better name for this parameter).
It corresponds to the current stopping criteria for the gradient descent : given a variable x
and its gradient gx
, the algorithm stops if, for all variables, we have |gx| < convergence_fraction*|x|
.
You set noise to 0.1
by default (with a comment saying that it correspond to 10% of the std).
However friedrich takes absolute values for the noise (and not relative values such that 0.1
would be converted to 10% of the std).
The source of the confusion might be that, if the user does not give a value for the noise, it will default to 10% of the std of the ouputs.
(on my side, I did not have as much time as expected but the example should be coming shortly)
I see that you put a
???
next toconvergence_fraction
(I should probably find a better name for this parameter). It corresponds to the current stopping criteria for the gradient descent : given a variablex
and its gradientgx
, the algorithm stops if, for all variables, we have|gx| < convergence_fraction*|x|
.
I figured that out later when doing the setters, but I didn't go back to edit the comment - I'll fix it up :+1:
You set noise to
0.1
by default (with a comment saying that it correspond to 10% of the std). However friedrich takes absolute values for the noise (and not relative values such that0.1
would be converted to 10% of the std). The source of the confusion might be that, if the user does not give a value for the noise, it will default to 10% of the std of the ouputs.
I did it on purpose and I know it's broken right now :grin:
I don't like the way the default works because it's not tunable by the user - e.g. it's cumbersome for them to say that noise
should be 20% of the std, because the value they pass in is used as an absolute value.
I have to figure out a way to make it work as percentage of std by default, still working on it - a PR against friedrich
might come out of it.
Making it a % of the std is not too hard (computing the std is cheap compared to the training) but I believe that using the std is only good to get an idea of the magnitude of the noise while, when you have an exact number given by an expert, its usually an absolute value (but it might be a moot point as the fit will change the amplitude of the noise anyway).
I am currious about your PR nevertheless
Yeah, when I say make it work
I mean from an API point of view :+1:
Have been following this with great interest. What is the current state of the integration of GPs into linfa? I was thinking of building a Gaussian process library in Rust but would rather contribute to something bigger, if this is still happening.
Hi @wegreenall ! Currently there is a shim in the Linfa repository but the work has not been integrated in the main branch and, to my knowledge, might be incomplete.
Your best bet is to check with @LukeMathWalker to determine what is missing for integration in Linfa. Don't hesitate to ask me any question to help with the integration work.
Longer term, Friedrich uses nalgebra while linfa focuses on ndarray, both of which are large dependencies, so it might be interesting to start working on a new version of Friedrich that would not be dependent on nalgebra (however the integration is quitte deep so that would reprsent a lot of work).
I have started to review
friedrich
with the aim of integrating it intolinfa
.I imagine the integration using a "shim" crate in the
linfa
repository, namedlinfa-gaussian-processes
, which re-exports all or part offriedrich
's public API. This ensures thatlinfa
is in control of naming conventions, interfaces and whatever is required to give the overalllinfa
ecosystem an homogeneous feeling as well as allowing you to retain ownership offriedrich
and evolving it as you see fit.The alternative would be to move
friedrich
insidelinfa
aslinfa-guassian-processes
, but that's a much more substantial step hence I don't think that's what we want to go for (at least for now).In terms of requirements, apart from some minor improvements here and there that I should be able to submit as PRs against
friedrich
over the Christmas holidays, I'd like to understand the following:nalgebra
overndarray
? I'd like to standardize onndarray
, at least in the initial phase, to avoid having overly generic interfaces (e.g. inputs and outputs generic parameters). Is it because you neededndarray-linalg
and that requires a non-Rust dependency?