practical-recommender-systems / moviegeek

A django website used in the book Practical Recommender Systems to illustrate how recommender algorithms can be implemented.
MIT License
901 stars 361 forks source link

Well, I mean, how many bugs have you implemented till now? I want to checkout. #24

Closed Heermosi closed 3 years ago

Heermosi commented 5 years ago

No need for exactly the locations, I want numbers for each directory.

brylie commented 5 years ago

@Heermosi have you found a specific bug to report?

Heermosi commented 5 years ago

Oh, yeah, our team worked on this branch for weeks to make it a working demo. And we discovered a lot of bugs, especially this one from [evaluation_runner_test.py]:

        # figure out what to do with result ;)
        self.assertLess(result['mae'], decimal.Decimal(1.7))
        self.assertLess(result['pak'], decimal.Decimal(0.7))
        self.assertLess(result['rak'], decimal.Decimal(0.7))

I'm almost certain you manually injected such "bugs",.... It's alright, but did you have a counter on how many such bugs exist?

As you requested, I'll list part below:

  1. templates/moviegeek/index.html & templates/moviegeek/detail.html

                            a.setAttribute('onclick', "add_impression({{user_id}}, 'more_details', " +
                                                      movie_id + ", '{{ session_id }}','{{ csrf_token }}')")

    This would collect incorrect movie id (stripped out of prev 0s), however would not be a bug if you turn column movie_id to type int, then it was yet another bug

  2. moviegeeks/views.py

    
    def genre(request, genre_id):
    
    if genre_id:
        selected = Genre.objects.filter(name=genre_id)[0]
        movies = selected.movies.all().order_by('-year')
    else:
        movies = Movie.objects.all().order_by('-year')
This would cause duplicated items browsing specific genre when the data set have some null in year column. Yes some null in populated dataset.(and some 0 star explicit scoring ......????)

3. The populated data set have a lot of incorrect movie ids. 
My team member is fixing on this...
I'm sure this can only be manually injected.

4. moviegeeks/views.py line 119
context_dict = {
    'genres': genres,
    'movies': mov.values(),
    'api_key': api_key,
}
The search page cannot display the recommend items correctly because of lacking some data, (userid)

5. builder/matrix_factorization_calculator.py
def predict(self, user, item):

    pq = np.dot(self.item_factors[item], self.user_factors[user].T)
    b_ui = self.all_movies_mean + self.user_bias[user] + self.item_bias[item]
    prediction = b_ui + pq

    if prediction > 10:
        prediction = 10
    elif prediction < 1:
        prediction = 10
    return prediction

Hmmm, what kind of bug is it..... a loop in prediction???

This is only part of bugs detected till now. I don't want to make this post toooooo long.
kimfalk commented 5 years ago

Thank you for looking at the code. Im sure the code is not perfect, so it is great that you are using the code and helping to find bug. I would recommend making a branch of the repo and commit your fixes to the code. Which can then be merged with a PR.

Having your fixes in a branch would also make it easier to understand where you believe the bugs are. I'm not sure I understand all of your message above.

Heermosi commented 5 years ago

Well, if you really want this , I'll make a full list of known bugs fixed... I originally thought this is a challenge to green hands.

As agreed I'll put the rest of the list of bugs...

  1. <builder/item_similarity_calculator.py> line 106 && <builder/lda_model_calculator.py> line 195
        cur.execute('truncate table similarity')

This would cause sequence number out of range after rounds of model building, use "restart identity"

  1. <builder/lda_model_calculator.py> line 3

import tqdm where in line 94

for d in tqdm(data): In fact it was using tqdm.tqdm to perform actions.

  1. <builder/lda_model_calculator.py> line 114-117
        index = similarities.MatrixSimilarity(corpus)

        self.save_lda_model(lda_model, corpus, dictionary, index)
        self.save_similarities(index, docs)

This means the lda model is not really in use.... I don't know it was a bug or not... calculate using the word vector is intended? Well, at least the coo_matrix conversion is not recommended because it would use up to 64 GB ram. You may always fetch a row at a time. And to the result generation: the result table is much too large, it could reach up to 130GB when used lda calculation... cut it or ... wait for 6-12 hours.

  1. recs/fwls_recommender.py Lacks an auto model loading process by default, the model is not loaded and only triggered by set_save_path This is only called in evaluator To make it available in web pages(analytics) it should be loaded automatically
kimfalk commented 5 years ago

You do realize its code used for as demo code for a book right?

undead747 commented 4 years ago

You do realize its code used for as demo code for a book right?

Sorry sir but I'm very worried, is it true that I need 130GB to save IDA results ?

kimfalk commented 4 years ago

Sorry sir but I'm very worried, is it true that I need 130GB to save IDA results? No, you should not.

Oh, yeah, our team worked on this branch for weeks to make it a working demo. And we discovered a lot of bugs, especially this one from [evaluation_runner_test.py]:

        # figure out what to do with result ;)
        self.assertLess(result['mae'], decimal.Decimal(1.7))
        self.assertLess(result['pak'], decimal.Decimal(0.7))
        self.assertLess(result['rak'], decimal.Decimal(0.7))

I'm almost certain you manually injected such "bugs",....

It's a test case to test everything is working. So I create a small dataset run it through the system and verify that the results look good.

faisalnazik commented 3 years ago

It is not unusual to have bugs. but of these types ... ?? btw thanx for the code.

brylie commented 3 years ago

I'm not sure this issue is productive. Should it be closed?