Currently books recommender looks at Libgen database dump, finds matches, and saves the top k matches to the database. If a user thumbs a book, it gets perma-saved to the database; otherwise, next recommender run wipes the previous matches & re-saves new matches.
Libgen can have many duplicate books for a single book. This because there's different formats, editions, etc. I'm using the primary-key of book_id, which is Libgen-specific; not book-specific. Instead we should be using ISBN or another identifier as the primary_key to prevent duplicates from being saved on recommendation.
[ ] Find a good unique ID for Libgen books. Likely ISBN, but those might be null, or maybe there's a better ID
[ ] Re-do current books table to use that as primary_key
[ ] Write migration to clear out currently-saved books to remove duplicates; keeping the entry with the most interactions (thumbs/etc)
[ ] Also consider a simple Levenshtein distance check on book_title + book_author in case ISBNs are different, but the books are duplicates. Sometimes the same book's uploaded twice, with a single character change in the title
Currently books recommender looks at Libgen database dump, finds matches, and saves the top k matches to the database. If a user thumbs a book, it gets perma-saved to the database; otherwise, next recommender run wipes the previous matches & re-saves new matches.
Libgen can have many duplicate books for a single book. This because there's different formats, editions, etc. I'm using the primary-key of book_id, which is Libgen-specific; not book-specific. Instead we should be using ISBN or another identifier as the primary_key to prevent duplicates from being saved on recommendation.