jadianes / spark-movie-lens

An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset
Other
816 stars 395 forks source link

Update engine.py #10

Closed curato-research closed 8 years ago

curato-research commented 8 years ago

Fixing an issue reported several times, but not fixed entirely in issues #5 and #6 (and mentioned in issue #9 as a remaining bug).

The faulty logic mentioned in issue #5 was valid, but an error remained which led to the same items/movies recommended multiple times. This was due to the fact that the operations in line 80 resulted a non-unique RDD, meaning that the same movies are present multiple times. This is solved by adding the .distinct() operation, which removes duplicate entires.

Step-by-step:

  1. self.ratings_RDD contains all user ratings
  2. .filter(lambda rating: not rating[0] == user_id) eliminates all movies already rated by specified user, where rating[0] refers to the user_id column
  3. .map(lambda x: (user_id, x[1])) puts all movie_ids in a (user_id, movie_id) format in a table (This is where a movie can exist multiple times!)
  4. .distinct() removes all duplicates entries