Train and test performance seem to be calculated differently.

breuderink commented 8 years ago

I was testing libFM, and one of my tests involved running libFM with the same train and test dataset:

libFM -task c -train train.libfm -test train.libfm

This seems to work, but the intermediate performance values are different for the train and test set, while the data comes from the same file:

...
#Iter= 97   Train=0.530437  Test=0.530998   Test(ll)=0.299652
#Iter= 98   Train=0.528048  Test=0.530657   Test(ll)=0.299651
#Iter= 99   Train=0.52756   Test=0.530803   Test(ll)=0.299649

I would expect that the train and test performance are exactly the same. Is this an indication of a bug? Or do I misunderstand what is being logged here?

ibayer commented 8 years ago

libFM uses a time dependent seed for the random initialization by default. "seed", "integer value, default=None" https://github.com/srendle/libfm/blob/master/src/libfm/libfm.cpp#L93 I think the results between runs should match if you set a seed.

breuderink commented 8 years ago

Using the same seed indeed prevents differences between runs. But what I try to report here is that the per-iteration training set and test set 'performance' differs, although I supplied the same data for both sets. I.e. in the snippet above, the train performance for iteration 99 is 0.52756, while the test performance on the same data is 0.530803. If I understand correctly, these numbers should be equal since the input data is equal.

This is based on my assumption that they are produced by computing some performance metric (like fraction correctly classified) on the predictions of the model (with parameters from that iteration), using either the training set and the validation set as input. But that assumption might be wrong.

ibayer commented 8 years ago

Can you check if this is also true with the option --method=ALS?

breuderink commented 8 years ago

Yes. With libFM -task c -train train.libfm -test train.libfm -method als there still is a small difference between the train and test scores.

ibayer commented 8 years ago

How small is the difference compared to the difference with MCMC? Is it it plausible that's just a small numerical error? Which error is correct (train or test)? You can use the last error and compare it against what you get when calculating the error yourself.

breuderink commented 8 years ago

I generated some artificial data with this Python script:

import random

with open('train.libfm', 'w') as f:
    for i in range(1000):
        # Write class.
        if i % 2 == 0:
            f.write('0')
        else:
            f.write('1')

        for j in range(100):
            f.write(' %d:%f' % (j, random.normalvariate(0, 1)))
        f.write('\n')

It generates alternating target labels, with 100 dense random features. The output looks like this:

...
#Iter= 97   Train=0.925 Test=0.997  Test(ll)=0.0801822
#Iter= 98   Train=0.913 Test=0.997  Test(ll)=0.0798717
#Iter= 99   Train=0.919 Test=0.997  Test(ll)=0.079558

It seems that it is overfitting, because the features are not informative. The difference is now relatively big. I have saved the output with the --out flag, and the results reported for Test= correspond to the accuracy calculated manually. So that part seems right. What could have caused the Train= score to deviate so much?

breuderink commented 8 years ago

I think that the test score is calculated here: https://github.com/srendle/libfm/blob/master/src/libfm/src/fm_learn_mcmc_simultaneous.h#L243, while the train score is mainly calculated here: https://github.com/srendle/libfm/blob/master/src/libfm/src/fm_learn_mcmc_simultaneous.h#L170-L172. The code path seems indeed different. So, what happens in the code path that computes the accuracy for the training set?

ibayer commented 8 years ago

libFM uses a few tricks like clipping prediction to highest / lowest vales. Maybe one of this tricks in only applied to the test predictions.

srendle commented 8 years ago

The printed train accuracy is calculated for one MCMC draw. The test accuracy over all draws (i.e., an average). I agree that this is misleading and both measures should report either the average or one draw.

In general, I would recommend to look at the log-file and not at std::out. The log file is more verbose and reports all test-values: one draw, all draws, all but 5 draws. It contains loglikelihood and accuracy for these measures.

breuderink commented 8 years ago

Thanks for the elaboration. I'll take a look at the log file to see if I understand it.

ChenKevin0123 commented 8 years ago

where can download train and test data？I can only find movie，rating，user，tags data on movielens.

srendle / libfm

Train and test performance seem to be calculated differently. #13