veekun / pokedex

more than you ever wanted to know about Pokémon
MIT License
1.44k stars 637 forks source link

Refactor pokemon_moves #237

Open magical opened 6 years ago

magical commented 6 years ago

The pokemon_moves table is the largest table in the pokedex by far, both in terms of number of rows (currently 465817!) and size of the csv file. It's also growing at a quadratic rate. It would be nice to shrink it down a bit.

The best idea I've come up with is to introduce the concept of a moveset. A moveset is a set of moves. A pokemon's move pool is the union of all its movesets. This is motivated by the fact that move pools tend to change slowly over time, and different move methods change at different rates. For example, tutored moves usually change significantly in every version, whereas egg moves tend to be relatively static - only only adding moves now and then and rarely removing them. TM moves tend to change a lot between generations but not so much between versions, and so on.

Specifically pokemon_moves is currently defined as

pokemon_moves
    pokemon_id -> pokemon.id
    version_group_id -> version_groups.id
    move_id -> moves.id
    pokemon_move_method_id -> pokemon_move_method.id
    level integer nullable
    order integer nullable

I propose we remove it and add three new tables.

pokemon_moveset_map
    pokemon_form_id -> pokemon_form.id
    version_group_id -> version_groups.id
    moveset_id -> pokemon_moveset.id

pokemon_movesets
    id integer primary key
    pokemon_species_id -> pokemon_species.id
    pokemon_move_method_id -> pokemon_move_method.id

pokemon_moveset_moves
    moveset_id -> pokemon_movesets.id
    move_id -> moves.id
    level integer nullable
    order integer nullable

The pokemon_moveset_moves table, which is roughly equivalent to the old pokemon_moves table, is one column smaller. It should also have fewer rows because we can now reuse movesets.

Note that this proposal eschews the pokemon table in favor of linking directly to the pokemon_forms table. We can do that because forms can share movesets, meaning that we won't have to duplicate thousands of rows in pokemon_moves for aesthetic-only forms. The pokemon_movesets.pokemon_species_id column is mostly for documentation purposes, but i also don't think it would be very useful to allow sharing movesets across species.

ISTR doing a rough analysis a while ago of how much space this would save, but I'll have to find it again.

jpirnat commented 6 years ago

I'm not a fan of this proposal.

The pokemon_moves table is currently keyed to pokemon_id, not pokemon_form_id, so there is currently no duplication of moveset data across aesthetic-only forms. This proposal would create that kind of duplication (albeit in a compressed form).

The suggested schema does make sense for the stated goal of data compression. I took a completely non-scientific look at a random Pokémon's moveset, and it looks like the suggested schema would mostly de-duplicate data within a single generation (e.g., one "moveset" for Eevee's level-up moves across all games in gen 4, one "moveset" for Eevee's egg moves across all games in gen 5, etc). But I question whether this optimization really gets us anything other than disk space.

Will this schema make it easier or harder to query movesets than the current pokemon_moves schema? Will existing queries on the new tables be as performant as queries on the old table? (I'm thinking about the queries that power veekun's Awesome Search.) Will the new schema make other queries simpler/better/faster that were previously complex or slow?

As a data buff I'm all for fully normalized and optimized databases, but in practice, I don't think this change would be worth it. Heck, if growing csv file sizes is the underlying issue, maybe the solution is to be able to split pokemon_moves.csv (and other big files for big tables) into pokemon_moves1.csv, pokemon_moves2.csv, etc.

magical commented 6 years ago

Thanks for your comments!

The pokemon_moves table is currently keyed to pokemon_id, not pokemon_form_id, so there is currently no duplication of moveset data across aesthetic-only forms. This proposal would create that kind of duplication (albeit in a compressed form).

This is a separate discussion, but the idea of dropping the pokemon table, leaving only pokemon_species and pokemon_forms, has been bouncing around IRC for quite some time now. One problem with this idea is that it would bloat the move table quite a bit—something that this proposal addresses. But again, that's a separate discussion and I'll open an issue for it when I'm ready.

While that's one of the motivations behind this proposal, it is not the only motivation; nor is it a prerequisite. If it helps, pretend that I didn't mention it, and imagine that pokemon_movesets is keyed on pokemon_id instead of pokemon_form_id.

Will this schema make it easier or harder to query movesets than the current pokemon_moves schema? Will existing queries on the new tables be as performant as queries on the old table? (I'm thinking about the queries that power veekun's Awesome Search.) Will the new schema make other queries simpler/better/faster that were previously complex or slow?

Queries will require an extra couple of joins. One could probably create a pokemon_moves view which simulated the old table if necessary.

I'm not sure how it would affect performance. Generally, PostgreSQL seems to be able to handle whatever we throw at it, so i'm not too worried.

[...] But I question whether this optimization really gets us anything other than disk space. [...] Heck, if growing csv file sizes is the underlying issue, maybe the solution is to be able to split pokemon_moves.csv (and other big files for big tables) into pokemon_moves1.csv, pokemon_moves2.csv, etc.

Good point. I'm not actually concerned about disk space, although my first post makes it sound like I am. My main concern is actually the time it takes to load the database from the CSV files (which i assume is directly proportional to their size). It takes almost a minute to load pokemon_moves on my machine, which is far longer than any other table. I tend to load&dump tables a lot while working on the pokedex, so anything we can do to speed up that process is a win in my eyes.