Analysing Books and Reviews, Maybe Social Media

ZaneReda commented 1 year ago

The current plan is to establish associations between the content of books as well as the community surrounding a given book (reviews). We need to establish whether or not our current data collection should incorporate the social platform that is found on Goodreads. Much like any other social media platform Goodreads has a follower and like system that connects various readers, this has opened the door for influencers and authors to directly communicate with their following, we should discuss whether or not there is value to us surrounding our goals. This could be a proof of concept for our own potential application, furthermore their are parameters for our recommendation system that could be drawn from such data that could improve the overall recommendation, if you are following an author or reader that has recently read a book that aligns with your genres of interest then we can assign a higher affinity to the given book. This would just be the beginning as we could build groups of individuals based on their following and recommend books that they could share common interests over. The most major drawback is our time frame, as of right now the current pipeline follows the steps outlined below, these steps show a general structure but do not necessarily need to flow exactly in this order, meaning we can start exploring user to user connections without necessarily having all the data for the books.

Finish Scrappers - We can currently gather all information on books and push them to our MongoDB, we are now looking at gathering the reviews to build user to user connections, these user to user connections are not planned to incorporate the social platform of the user but they could.
Gathering Meta Data from NLP - Using the descriptions of the books we may be able to find correlations between user preferences, using the reviews of the user we may be able to gather insights into the book/user. This will hopefully allow us to recommend things by the plot of the book not just its categories.
Model Exploration - The current plan is to use XGBoost since it seems to perform exceptionally well on most things. From there we will explore our other options.
Potential Use of Social Media Parameters - Could be incorporated into the main recco if we have time or added as a next steps section.

I will later attach a visual aid showing our current data layout as well as as where we plan to go. Lets discuss here what we think of the inclusion of social media and whether or not its feasible.

To get the conversation started, is this within the scope of our objective? How much value will be added with social media indicators? Is it realistic to use social media indicators outside of the social platform? New users who come to us looking for a recommendation will have no social media presence that we can track so this may only be applicable if we are the platform. Unless y'all have other ideas? If I missed something in the steps stage feel free to point it out.

ismadoukkali commented 1 year ago

Hey hey, Here my input:

Point 1. - The review scrapper has been finished. Tested it with a batch of 10 books and no errors were raised. Still, I say we test stress it, especially as reviews can come in many different formats and could potentially throw an error when appending it all to the dict.

Here the data the scraper collects: Mapped to Book_ID:

Total number of ratings
Total number of reviews
% 5 stars
% 4 stars
% 3 stars
% 2 stars
% 1 star
n 5 stars
n 4 stars
n 3 stars
n 2 stars
n 1 star
List of Review_IDs

Mapped to User:

Rating
Text
Date
Book_ID (href zane will provide)
Review_ID (href from the review, this the scrapper will provide)

ismadoukkali commented 1 year ago

For the rest of the points, I believe that the NLP Meta Data idea can be very insightful. We would need to check the review distribution and see if there is enough tokens to make inferences. My intuition tells me that there is as many reviews for books can be over 500 words long... Got to investigate that first I believe to set a scope on this and even before we do the Social Media Integration.

I propose the following mini-steps.

Scrape NLP Meta-Data (if its not already scraped)
Do an EDA of the reviews over a significant amount of data points. 15-25k +-
Evaluate then.

What do you think?

thehitchhikersguideto / bookworms

Analysing Books and Reviews, Maybe Social Media #28