hummingbird-me / kitsu-tools

:hammer: The tools we use to build Kitsu, the coolest platform for anime and manga
https://kitsu.io
Apache License 2.0
2.09k stars 265 forks source link

Fix SQL dump used for database seeding #784

Closed DNA closed 7 years ago

DNA commented 7 years ago

The current SQL dump used on bin/seed has a few problems that need to be addressed:

After tinkering with the file trying to make it work, I've fixed it with the following command:

gzcat original-file.sql.gz | sed -e 's/SET row_security = off;//g' | sed -e 's/show_type/subtype/g'  | sed -e 's/manga_type/subtype/g' | sed 1,3970d | gzip > fixed-file.sql.gz

It's possible that this dump would also have some problems with the current data generated by rake db:seed, but I can confirm it right now, if someone could also check it would be nice :)

NuckChorris commented 7 years ago
  1. Postgres version doesn't affect SQL dumps I think 2/3. This was very much intentional: the dump contains the structure and all the migrations that had run at the time of dump, so you can migrate up to the latest. The tooling isn't quite fixed for it though
DNA commented 7 years ago

Version 9.5 generate a SET row_security = off; that 9.2 can't understand. Fortunately it's only this, could have been worse :P

And I have to disagree with the dump current intention. Rails already take care of all the migrations and database structure, having it on the dump is completely redundant, and with our current setup, it's only purpose is to overload stderr (i've saved the error log on a text file, is more than 110000 lines of error its generating)

Rails also has a seed feature, but since it's a lot of data, I agree that this external sql is a good thing, but it must be just it: data to feed a clean database already created by Rails.

NuckChorris commented 7 years ago

Rails doesn't handle this. You can't say "rails make me the database as it was on January 20th when we dumped it"

If you have rails handle the initial schema the dump will break any time we remove a column.

If we handle the schema as part of the dump, a snapshot of the schema is stored and van be safely migrated to present structure.

The only time it'll overload stderr is when the schema doesn't match the data, literally the problem I solved by including the schema.

It should feed a clean database created by rails: no tables, nothing.

DNA commented 7 years ago

Of course Rails can't go back in time. The main point is that the dump must be regenerated within each migration. It must follow the schema.rb that rails generate each time a migration is run. It doesn't matter if it's a sql dump, a csv file or any other means to store data.

If you wan to keep storing schema on that dump, the whole workflow for working with the database must be changed to prevent Rails from creating tables and seeding. bin/build and bin/seed are two examples of that.

Also, the whole dump must be recreated, because it isn't destroying a table if already exists, so it won't work if rails already created the structure of if you messed something and want to rebuild everything from scratch. And even this dump would be problematic, because a lot of "drop if exists" statements won't be supported on our current version of Postgres.

NuckChorris commented 7 years ago

It doesn't need to be recreated with this dumping approach. This approach provides a full snapshot in time of both schema and data (and the migrations which have been run on said schema and data), which can be used to create an exact replica of the production media database.

It can lag behind schema.rb because it includes the schema in it. You just run rake db:migrate and it'll be migrated to match the schema.rb. That means we can just take the dump weekly instead of after every migration, which is a hell of a lot easier.

Yes, the setup scripts need to be changed to match the new approach where we create an empty database without schema, load the data in, and then migrate. I've been meaning to do that for a while but I wanna say @cybrox said he'd try and work on that(?)

We don't need to make it use "drop if exists" because it should only ever be run on an empty database.

Also we should definitely upgrade the postgres version in development to 9.4 (or is it 9.5?) to match production. Dunno why it's on 9.2 still, that shit's old.

DNA commented 7 years ago

That's a lot of configuration over convention, but be it. I'll fix those scripts to work with that whole dump.

NuckChorris commented 7 years ago

Rails doesn't really have a convention for this. Believe me, if they did, I would use it. But db:seed is just meant to prep all environments (production, development, staging, test) with data. This data is just for one environment (development), so we're outside Rails conventions.