nandhp / python-imdb

Python interface to IMDb plain-text data files
BSD 2-Clause "Simplified" License
41 stars 12 forks source link

Data Format #3

Open snsiox opened 9 years ago

snsiox commented 9 years ago

I'm wondering what format the imdb.zip data file is in, ideally for reading. I intend to do some processing which involves iterating through movies, and I can't find a way to do anything besides search with this. As far as I can tell, this program does a fine job of cleaning and organizing the raw files from the IMDb, but no way to iterate. Is this possible, and/or is the data format documented (or standard) in a way that I can iterate myself?

nandhp commented 9 years ago

The imdb.zip file contains roughly the same content as the *.list.gz files, with only minimal processing -- primarily the removal of data for video games and TV episodes (in order to to save space). Each data file is packed into the ZIP file in chunks, in order to allow relatively efficient seeking to arbitrary positions. To facilitate lookups by title, some data files encode the title of the first entry in the name of each chunk, while for other files a separate index is used. This format is implemented by chunkedfile.py, which includes a command-line interface for viewing the contents of a file in imdb.zip.

For enumerating the data, the imdb.zip file is not actually necessary. You should be able to iterate over the parsed contents of a data file using expressions like these:

import imdb.parsers
imdb.parsers.IMDbMoviesParser(dbdir='imdb').search()
imdb.parsers.IMDbRatingParser(dbfile='imdb.zip').search()

Where dbfile is the path to an imdb.zip file and dbdir is the path to a directory containing the needed*.list.gz file (or files). Note that imdb.zip does not contain movies.list.gz or aka-titles.list.gz, so you'll need to use dbdir for them. Also, some search methods return a dictionary, while others return a list or an iterator -- see the _make_result and search methods of each parser.

I hope this helps. Let me know if anything is unclear!