Open prust opened 6 years ago
@stacytao: I hope you don't mind, but I ported your clean_genres()
function to Javascript and included it in the wikipedia-movie-data project. I pulled the invalid_genres
and genre_replacements
into separate JSON files for easier reference (and possibly editing) of users of the library.
I also did more spot-checking and found a number of issues in the data. I ended up omitting the director
and notes
properties, since those will take a bit more work to get accurate, and turned the cast
and genres
into arrays, so they're more useful. I made the genre-cleaning case-insensitive, so it would pick up & fix more issues, and added some more obvious fixes to it. The unique set of genres for years 2000-2018 isn't 100% clean yet, but it's much better.
Update: I switched from a blacklist approach to a whitelist approach and added automatic genre-splitting on spaces & dashes, which made the genre-replacement metadata much shorter & more manageable. I also added more quality checks for all the data (1900-2018) and fixed a few more parsing issues. I ended up deriving the genre white-list from your original cleanup routine and adding four more genres to it (some of which aren't genres but may be useful for tagging/characterizing movies): Independent, Mystery, Noir, Short.
@stacytao: Thanks for using my wikipedia-movie-data. I just updated it to include 2018 movies in https://github.com/prust/wikipedia-movie-data/commit/f7b29fdd (I had to pull these from "2018 in film", since apparently they don't post the "List of American films of 2018" until after the end of the year).
To update, you should be able to just drop in the latest movies.json file (includes 1930-2018) or re-run the program if you want to pull movie data for specific years.
Please let me know if there's anything I can do to make the data more useful or easier to use in your project (sqlite format, SQL text format, etc).