jojje / imdb-sqlite

GNU General Public License v2.0
40 stars 12 forks source link

Select specific TSV files for import #7

Open ItsMeMarc opened 4 days ago

ItsMeMarc commented 4 days ago

Hello,

As you already mentioned in your description, the database becomes very large. To keep the database smaller, it would be nice to be able to select which TSV files should be imported. In my case, for example, only the files title.basics.tsv.gz, title.akas.tsv.gz, and title.episode.tsv.gz are of interest (at least for now). Perhaps there is the possibility to implement this as a parameter. Thank you.

Best regards, Marc

jojje commented 4 days ago

That's a good idea. I can see that as useful if running on small rented VMs in the cloud or similar.

The tricky bit is how to surface that in the CLI as something self-explanatory and self evident. The challenge is not technical but related to users and their assumed preexisting understanding of the amazon/imdb datasets, the implicit relations between those different files as they pertain to the specific data (projection(s)) desired to be extractable.

For instance, can we assume:

  1. All users are already familiar with the TSV files, and know what each contain?
  2. They know the implicit relations between the partial bits of information those various files contain?
  3. They are able to figure out which specific files they need for their specific task?

Just loading all those files, as is currently done skirts those problems completely, because it offers a consistent dataset with everything a user could possibly want to extract. When starting to cherry-pick, it opens a can of worms from a user's perspective.

If we at least assume the user has read the readme for this project, then they have a mental model of how things relate. As such they should be able to figure out from the diagram which tables they need in order to get the data they're after. As such it would then follow that surfacing an option that allows specifying a subset of table names would be the preferable approach. The program would then just fetch the corresponding TSV files and create the subset of relations that data subset allows for.

What are your thoughts on a solution along that line?

ItsMeMarc commented 3 days ago

I also think that this should be done as intuitively as possible. In my opinion, importing only the title.basics.tsv.gz file makes no sense, as the links between series and episodes are then missing. Therefore, the title.episode.tsv.gz file should also always be imported. In addition, it also makes sense to take the title.akas.tsv.gz file into account so that you really have all the titles in the database. This could be the minimal option.

And then I see two additional options: Ratings and Crew/People

When I think about it, I would suggest a total of four options ​​for the import process:

  1. Complete (default)
  2. Titles only (see above)
  3. Titles with ratings
  4. Titles with crew/people

If you give users these four options ​​to choose from, they don't even need to know the dependencies. What do you think?