mezuro / kalibro_processor

Reimplementation of Kalibro processing
GNU Affero General Public License v3.0
2 stars 10 forks source link

Overhaul database schema #215

Open danielkza opened 8 years ago

danielkza commented 8 years ago

Overhaul database schema by adding indexes, foreign key constraints, cleaning up stale data, and fixing tests that relied on hardcoded/non-existing IDs.

The changes are a bit long but relatively straight-forward. I ended up changing the runner steps layout a lot because it was very hard to understand what was going on and how models were related, which proved necessary to remove the hardcoding issues.

The constraints also showed that running tests using the transaction DatabaseCleaner strategy was prone to failure - the processing jobs don't run in the same transaction as Cucumber and won't see much of the data created. Switching to the truncation strategy fixes it.

I tested the data cleanup migration quite extensively, using the database dump from production at mezuro.org. It would be wise to perform a backup anyway and be ready to restore it if anything goes wrong.

rafamanzo commented 8 years ago

Have you tested deletion instead of truncation? Which one is faster? What is the performance drop for leaving the transaction strategy? (https://github.com/DatabaseCleaner/database_cleaner#additional-activerecord-options-for-truncation)

Could you pull https://github.com/mezuro/kalibro_processor/pull/215/commits/4d0982ec1e0cacec5e5eab048eb3e402bbc7fc31 out into a new PR?

It'd be nice because I'm thinking about creating a new branch (v2.0.0 ?) where we can place these breaking changes and still hold the master following the v1 series until we finish such adjustments. What do you think?

rafamanzo commented 8 years ago

Looking at changes I have to ask you the same for those three changes:

rafamanzo commented 8 years ago

Does it fixes https://github.com/mezuro/kalibro_processor/issues/212?

danielkza commented 8 years ago

Have you tested deletion instead of truncation? Which one is faster? What is the performance drop for leaving the transaction strategy? (https://github.com/DatabaseCleaner/database_cleaner#additional-activerecord-options-for-truncation)

I don't think : transaction is even a choice, since it causes incorrect results. As I mentioned in the commit message, changes from the Cucumber transaction should not be visible to the processing job - the default isolation level in Postgres (READ COMMITED) does not allow it.

We can use a tag to mark the scenarios that require processing and set their cleaner strategy, but they are actually a majority, so I just switched the default.

I haven't tested :deletion at all, but I'd guess we don't create enough records for it to matter too much.

Could you pull 4d0982e out into a new PR?

Sure.

It'd be nice because I'm thinking about creating a new branch (v2.0.0 ?) where we can place these breaking changes and still hold the master following the v1 series until we finish such adjustments. What do you think?

AFAIK none of the changes should be breaking. I made sure that any invalid data is cleaned up in the migration, and that data couldn't be accessed anyway, either because it was improperly referenced, or because it contained duplicates where the code assumes only one record exists (and acts accordingly).

I would prefer to fork the current master into a v1 branch instead of making the dev branch the fork - otherwise we'll make outdated code the most prominent version.

danielkza commented 8 years ago

Does it fixes #212?

It does AFAIK, I actually had to fix a bunch of integrity violations, some that were dependent on the order used by

rafamanzo commented 8 years ago

I'm being conservative about calling those changes breaking ones, because of the side effect on current installations not the API. There is the small risk of data loss or updates that require human action.

I agree about leaving the up to date code on master and fork it to v1.

About #212 when testing the separate PRs I'll try to test it too to figure out at which point it got fixed.

Great work and suggestions. Thanks.

danielkza commented 8 years ago

About splitting this up more: the cleaning, foreign key and indexes changes are somewhat tied. The cleaning migration only exists to allow the foreign keys to be created. Some of the indexes are in fact unique, and others exist to provide the opposite side to some foreign key.

It will also be quite complicated to ensure the tests are reliable in each separate step: fixing them all was quite a bit of work in itself, and I never tested it with any schema other than the final version.

Also, considering we will need to do tests with the production DB backups, doing it piecemeal seems like more work than it's really worth.

rafamanzo commented 8 years ago

OK, makes sense mostly.

At least the timestamp removal can we extract?

As one final try, is it possible to extract the new indexes regarding performance issues? I'd like to test them separately.

rafamanzo commented 8 years ago

Also, giving a second look to the changes it came to me if you'd also be gentle and make the tree walking methods modifications into a separate PR as well? Sorry about the annoyance.

rafamanzo commented 8 years ago

Finally this branch conflicts with master now.

danielkza commented 8 years ago

At least the timestamp removal can we extract?

Should be fine.

As one final try, is it possible to extract the new indexes regarding performance issues? I'd like to test them separately.

I think the easiest way to do that is probably to just edit the migrations and comment out the foreign keys. The TreeMetricResult unique index can also be commented, as considering only performance, it should be covered by the individual indexes in the columns.

Also, giving a second look to the changes it came to me if you'd also be gentle and make the tree walking methods modifications into a separate PR as well? Sorry about the annoyance.

These actually were not meant to be there, I think I accidentally added them by being careless with git-stash. Very observant, thanks! :eyes:

Finally this branch conflicts with master now.

I'll rebase and make the other changes in one go.

danielkza commented 8 years ago

Rebased to master, updated with split of timestamp changes (#219) and removal of the accidental module result change.

I tested the : deletion strategy, and unfortunately it isn't reliable: it disables constraints before cleaning, but that will not work if the user is not a super-user in PostgreSQL, and the cleaning will fail with integrity violations. :truncation works because tables are cleaned in a cascading manner if references exist.

rafamanzo commented 8 years ago

I have cherry-picked b95d15f into master.

The transaction time is: rake cucumber 75.08s user 6.12s system 46% cpu 2:52.88 total The truncation time is: rake cucumber 95.90s user 6.67s system 37% cpu 4:36.68 total

I'm still unsure if that is significant given how often we run acceptance tests and that KalibroProcessor does not seem to be growing much in the short/midterm.

How hard would be to make this truncation strategy into an tag used only where needed? That would ensure minimum impact on test run time and that the application can grow, if necessary, without this being an issue. And probably make that another PR :/

What do you think?

danielkza commented 8 years ago

Using a tag to select the strategy seems like a good idea, but I think splitting up the strategy change may not be necessary - it doesn't matter until the integrity constraints are added.

danielkza commented 8 years ago

I implemented the selective usage of the truncation strategy. Any further comments @mezuro/core?

diegoamc commented 8 years ago

I don't understand why you closed this. There is no need for a consensus to accept a PR. As I said, if I'm the only one unhappy, please accept it!

danielkza commented 8 years ago

I guess sometimes we need a little pushback to spur creativity. I changed the migration to take backups of all the deleted data. Tell me what you think @diegoamc.

danielkza commented 8 years ago

Forgot to rebase to master, done now.

rafamanzo commented 8 years ago

I've tried to add just the indexes for the db_indexes branch. There my aggregation perfomance test results have increased from ~35s to ~58s (process times with 8 metrics). It is small enough so we can live with this extra time, but we still need to investigate further to show with other tests that there are real performance improvements when reading data probably by creating new tests (https://github.com/mezuro/kalibro_processor/issues/229).

I think we should hold this until we get the proper discussion related to data consistency restrictions (https://github.com/mezuro/kalibro_processor/issues/228) and performance gain of indexes (https://github.com/mezuro/kalibro_processor/issues/229)