quora / qmf

A fast and scalable C++ library for implicit-feedback matrix factorization models
Apache License 2.0
462 stars 96 forks source link

QMF - a matrix factorization library

Build Status Hex.pm

Introduction

QMF is a fast and scalable C++ library for implicit-feedback matrix factorization models. The current implementation supports two main algorithms:

For evaluation, QMF supports various ranking-based metrics that are computed per-user on test data, in addition to training or test objective values.

For more information, see our blog post about QMF here: https://engineering.quora.com/Open-sourcing-QMF-for-matrix-factorization.

Building QMF

QMF requires gcc 5.0+, as it uses the C++14 standard, and CMake version 2.8+. It also depends on glog, gflags and lapack libraries.

Ubuntu

To install libraries dependencies:

sudo apt-get install libgoogle-glog-dev libgflags-dev liblapack-dev

To build the binaries:

cmake .
make

To run tests:

make test

Output binaries will be under the bin/ folder.

Usage

Here's a basic example of usage:

# to train a WALS model
./wals \
    --train_dataset=<train_dataset> \
    --test_dataset=<test_dataset> \
    --user_factors=<user_factors_file> \
    --item_factors=<item_factors_file> \
    --regularization_lambda=0.05 \
    --confidence_weight=40 \
    --nepochs=10 \
    --nfactors=30 \
    --nthreads=4

# to train a BPR model
./bpr \
    --train_dataset=<train_dataset> \
    --test_dataset=<test_dataset> \
    --user_factors=<user_factors_file> \
    --item_factors=<item_factors_file> \
    --nepochs=10 \
    --nfactors=30 \
    --num_hogwild_threads=4 \
    --nthreads=4

The input dataset files should adhere to the following format:

<user_id1> <item_id1> <weight1>
<user_id2> <item_id2> <weight2>
...

where weight is always 1 in BPR, but can be any integer in WALS (r_ui in the paper [1]).

The output files will be in the following format:

<{user|item}_id> [<bias>] <factor_0> <factor_1> ... <factor_k-1>
...

where the bias term will only be present for BPR item factors when the --use_biases option is specified.

In order to compute test ranking metrics (averaged per-user), you can add the following parameters to either binary:

In the case of BPR, a set of (user, positive item, negative item) triplets is sampled during initialization for both training and test sets (with a fixed seed, or as given by --eval_seed), and is used to compute an estimate of the loss after each epoch. This has no effect on training or on the computation of ranking metrics.

Options for WALS:

Options for BPR:

For more details on the command-line options, see the definitions in wals.cpp and bpr.cpp.

Credits

This library was built at Quora by Denis Yarats and Alberto Bietti.

License

QMF is released under the Apache 2.0 Licence.

References

[1] Hu, Koren and Volinsky. Collaborative Filtering for Implicit Feedback Datasets. In ICDM 2008.

[2] Rendle, Freudenthaler, Gantner and Schmidt-Thieme. BPR: Bayesian Personalized Ranking from Implicit Feedback. In UAI 2009.

[3] Niu, Recht, Ré and Wright. Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In NIPS 2011.