recommenders-team / recommenders

Best Practices on Recommendation Systems
https://recommenders-team.github.io/recommenders/intro.html
MIT License
19.08k stars 3.09k forks source link

Add biases etc. to Spark ALS #355

Open anargyri opened 5 years ago

anargyri commented 5 years ago

By some variable and data manipulation we can add more functionality to ALS. Specifically, we can incorporate the bias terms and one extra regularization parameter. This can be done as a wrapper around Spark's ALS() method.

yueguoguo commented 5 years ago

Good feature to have. This may be helpful but it is in Scala.

anargyri commented 5 years ago

Good feature to have. This may be helpful but it is in Scala.

I am not sure what he does; it seems he reimplements ALS, but there is still one reg. parameter. I would rather base any addition to MLLib ALS, since it is tested, and since we can get bias terms and two reg. parameters.

An alternative is to implement ourselves a complete algorithm in Scala (or as an iterative wrapper around ALS) that solves the same problem as Surprise SVD i.e. an algo that learns factors, bias terms, mean and involves 4 reg. parameters.

anargyri commented 5 years ago

What I suggested above is that there are the following options:

  1. The low hanging fruit is to use ALS() without any real coding, by noting that adding the bias terms and two reg. params (one for users, one for items) reduces to the simple formulation that Spark ALS solves, after a change of variables and inputs. That is, augment pu as [pu bu 1] and qi as [qi 1 bi] and do some renormalization of pu or qi to allow for different lambda_u, lambda_i.

  2. A complete formulation can be solved with a bit more coding (still only in pyspark) by emulating the complete ALS as a wrapper around Spark's ALS run only for one iteration. I.e. we use Spark ALS to solve for the pu and qi (adjusting the computation for the presence of the other terms), then our own code for the mean and biases (the mean is easy to compute as an average of other quantities and the biases from a least squares problem with two variables), and iterate like that.

  3. The third approach is to rewrite everything in Scala so that it allows for bias terms, mean and 4 lambdas. This should not be difficult because of the above in 2.

So, 2 and 3 are the same algorithm but 2 is implemented in pyspark calling Spark's ALS whereas 3 is a reimplementation of Spark ALS in Scala.