vasinov / jruby_mahout

JRuby Mahout is a gem that unleashes the power of Apache Mahout in the world of JRuby.
MIT License
162 stars 16 forks source link

Don't get any recommendations #3

Open 23tux opened 11 years ago

23tux commented 11 years ago

Hi,

I'm not sure, if this is a bug or I'm just doing wrong. I tried out your example, and played a little bit around. By the way, really cool work ;)

But the problem is, that I don't get any recommendations from the engine. I'm using Mahout 0.7 with JRuby 1.7.3. I have the following script:

require 'rubygems'
require 'ruby-debug'
require 'jruby_mahout'

recommender = JrubyMahout::Recommender.new("PearsonCorrelationSimilarity", 2, "GenericUserBasedRecommender", false)
recommender.data_model = JrubyMahout::DataModel.new("file", { :file_path => "tmp/test.csv" }).data_model
puts recommender.estimate_preference(1,1)
# -> NaN

The test.csv is fairly simple and looks like this (user 1 haven't rated item 1):

1,2,5.0
1,3,5.0
1,4,5.0
1,5,5.0
1,6,5.0
1,7,5.0
1,8,5.0
1,9,5.0
2,1,5.0
2,2,5.0
2,3,5.0
2,4,5.0
2,5,5.0
2,6,5.0
2,7,5.0
2,8,5.0
2,9,5.0
3,1,5.0
3,2,5.0
3,3,5.0
3,4,5.0
3,5,5.0
3,6,5.0
3,7,5.0
3,8,5.0
3,9,5.0
4,1,5.0
4,2,5.0
4,3,5.0
4,4,5.0
4,5,5.0
4,6,5.0
4,7,5.0
4,8,5.0
4,9,5.0
5,1,5.0
5,2,5.0
5,3,5.0
5,4,5.0
5,5,5.0
5,6,5.0
5,7,5.0
5,8,5.0
5,9,5.0

When I try to puts recommender.estimate_preference(1,1) I always get NaN which means that the recommender isn't able to generate a rating for that user-item tuple. But my neighborhood size is only 2, and there are only items which match "perfectly" to users' 1 profile. What I'm doing wrong? Do I have to calculate the similarities on my own?

Further, calling recommender.recommend(1, 1, nil) to get a list of 1 item for user 1 returns an empty array [].

I also tried it with the movie lens dataset by splitting it 80% training and 20% testset, same results. And here the recommender throws an exception at the recommender.estimate_preference(user,item) point:

GenericDataModel.java:213:in `getPreferencesFromUser': org.apache.mahout.cf.taste.common.NoSuchUserException: 318
    from GenericDataModel.java:245:in `getPreferenceValue'
    from FileDataModel.java:654:in `getPreferenceValue'
    from GenericUserBasedRecommender.java:107:in `estimatePreference'
    from NativeMethodAccessorImpl.java:-2:in `invoke0'
    from NativeMethodAccessorImpl.java:39:in `invoke'
    from DelegatingMethodAccessorImpl.java:25:in `invoke'
    from Method.java:597:in `invoke'
    from JavaMethod.java:470:in `invokeDirectWithExceptionHandling'
...

Hope you can help me, I would love to use your project for my master thesis experiments ;) (and of course, cite your work)

vasinov commented 11 years ago

Looking into it.

23tux commented 11 years ago

Have you already had time to take a look at it?

vasinov commented 11 years ago

So, I ran your example and got the same result. "NaN" is returned by Mahout itself, which means that it wasn't able to calculate a preference estimate based on your input (it could be because the sample size is too small, or all of your values are 5.0, or the combination of the recommender, similarity metric and neighborhood couldn't come up with anything). My hunch is that it's because of the dataset and similarity metric that was chosen.

I tried creating a Slope One recommender and it was able to generate an estimate for me:

recommender = JrubyMahout::Recommender.new(nil, nil, "SlopeOneRecommender", false)

resulted in: 5.0.

I think the same applies to recommend.

As for the last issue, can you provide a small sample of the data from the movie lens dataset? A couple of rows should be sufficient. I want to make sure you are formatting in properly before I am going to look into the exception issue.

23tux commented 11 years ago

Thanks for your answer! Sorry for the delay, my masterthesis is keeping me busy. I tried it out with the movielens dataset, 100,000 ratings. I splitted it into 50 rows for testset, and the rest for the training. I get now some estimations, but with a coverage of only 68%. And I also get the exception mentioned above. I think out of a dataset with 100,000 ratings, the pearson correlation should be able to produce more proper recommendations. Or am I wrong?

I have the following code, in which I pasted the testset. The trainingsset (without the rows of the testset) can be downloaded here: http://sketchit.de/movielens_without_testset.csv

require 'rubygems'
require 'ruby-debug'
require 'jruby_mahout'
require 'csv'

csv = "196  242 3   881250949
186 302 3   891717742
22  377 1   878887116
244 51  2   880606923
166 346 1   886397596
298 474 4   884182806
115 265 2   881171488
253 465 5   891628467
305 451 3   886324817
6   86  3   883603013
62  257 2   879372434
286 1014    5   879781125
200 222 5   876042340
210 40  3   891035994
224 29  3   888104457
303 785 3   879485318
122 387 5   879270459
194 274 2   879539794
291 1042    4   874834944
234 1184    2   892079237
119 392 4   886176814
167 486 4   892738452
299 144 4   877881320
291 118 2   874833878
308 1   4   887736532
95  546 2   879196566
38  95  5   892430094
102 768 2   883748450
63  277 4   875747401
160 234 5   876861185
50  246 3   877052329
301 98  4   882075827
225 193 4   879539727
290 88  4   880731963
97  194 3   884238860
157 274 4   886890835
181 1081    1   878962623
278 603 5   891295330
276 796 1   874791932
7   32  4   891350932
10  16  4   877888877
284 304 4   885329322
201 979 2   884114233
276 564 3   874791805
287 327 5   875333916
246 201 5   884921594
242 1137    5   879741196
249 241 5   879641194
99  4   5   886519097
178 332 3   882823437
"

@arr = CSV.parse(csv, col_sep: "\t")

def rec neighborhood_size, is_weighted
  puts "neighborhood: #{neighborhood_size}, is_weighted: #{is_weighted}"
  recommender = JrubyMahout::Recommender.new("PearsonCorrelationSimilarity", neighborhood_size, "GenericUserBasedRecommender", is_weighted)
  recommender.data_model = JrubyMahout::DataModel.new("file", { :file_path => "movielens_without_testset.csv" }).data_model

  fallout = 0

  @arr.each do |a|
    user = a[1].to_i
    item = a[0].to_i
    begin
      r = recommender.estimate_preference(user,item)
      fallout += 1 if r.nan?
    rescue Exception => e
      puts ""
    end
  end
  puts "Tuples: #{@arr.count}"
  puts "Fallout #{fallout}  ->  #{fallout/@arr.count.to_f*100.round(3)}%"
  puts "-----------------"
end

rec 5, false

I tried it with different neighborhood sizes, but it only varies about 5%. This is the output that is produced:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/23tux/projects/mahout/mahout-distribution-0.7/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/23tux/projects/mahout/mahout-distribution-0.7/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
missing class or uppercase package name (`org.postgresql.ds.PGPoolingDataSource')
log4j:WARN No appenders could be found for logger (org.apache.mahout.cf.taste.impl.model.file.FileDataModel).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception: 1014
Exception: 1042
Exception: 1184
Exception: 1081
Exception: 979
Exception: 1137
Tuples: 50
Fallout 34  ->  68.0%

Could it be something with this warnings that I get? Hope we can fix this problem ;)