Dataset snapshots - Githubissues

sdl60660 / letterboxd_recommendations

Scraping publicly-accessible Letterboxd data and creating a movie recommendation model with it that can generate recommendations when provided with a Letterboxd username

https://bit.ly/movie-recs-letterboxd

GNU General Public License v3.0

263 stars 18 forks source link

Dataset snapshots #1

Closed mlaugharn closed 3 years ago

mlaugharn commented 3 years ago

Would it be possible to redistribute snapshots of the dataset so that there wouldn't need to be duplicated scraping? e.g. via automated torrents or something

sdl60660 commented 3 years ago

Yep, exported CSVs of the movie/user collections live in data_processing/data and could be used to start to populate a local Mongo database. The reviews collection is enormous, so I couldn't include the export in this remote repo, but there are definitely a few better ways to do this.

Let me get back to you on this after the next data update (I've been updating the data for the live site's model monthly and I'll add something to the README, too.

mlaugharn commented 3 years ago

awesome thank u :)

sdl60660 commented 3 years ago

All set! You can find data up to the latest crawl here: https://www.kaggle.com/samlearner/letterboxd-movie-ratings-data

I've added some instructions on the README, as well, for using this data/running the rest of the code on your own, though obviously, you're free to do whatever you want with the data.