populationgenomics / hail

Scalable genomic data analysis.
https://hail.is
MIT License
1 stars 1 forks source link

Add more detailed batch_database logging #319

Closed jmarshall closed 9 months ago

jmarshall commented 9 months ago

We have made two almost-successful attempts at a production Hail deployment today, and unfortunately both of them have happened to be running the batch_database job simultaneously. This job applies database migration scripts, and unfortunately it has neither good logging indicating which migrations are actually being applied nor good database locking to ensure absolutely that migrations cannot interfere with each other.

Our first attempt was batch 429683, which finished with the following:

applying migration 108 {'name': 'dedup-agg-batch-resources', 'script': '/io/sql/dedup_agg_batch_resources.py', 'online': True}
script_sha1 d0b6892aa0e0c26828ba14f916589413148626f9
applying migration 109 {'name': 'setup-job-groups', 'script': '/io/sql/setup-job-groups.sql', 'online': True}
script_sha1 345df98fdb81d942eb2a35d9d36249cfc2ff6c7a
applying migration 110 {'name': 'dedup-job-resources', 'script': '/io/sql/dedup_job_resources.py', 'online': True}
script_sha1 78e581a2d6194a703ae87d0460c51557a69df90d
applying migration 111 {'name': 'fix-job-groups-state-enum', 'script': '/io/sql/fix-job-groups-state-enum.sql', 'online': True}
script_sha1 113e720377d1974237b2528e28841dc4ac9c16d2
applying migration 112 {'name': 'populate-job-groups', 'script': '/io/sql/populate_job_groups.py', 'online': True}
script_sha1 f7815e65b5ebb91a1a927a6d2fb5ed77eeabebfd
Traceback (most recent call last):
  File "//create_database.py", line 272, in <module>
[…]
hailtop.utils.process.CalledProcessError: Command ['/bin/bash', '-c', 'python3 /io/sql/populate_job_groups.py'] returned non-zero exit status 1. […]

Our second attempt was batch 429692, which finished with the following:

applying migration 108 {'name': 'dedup-agg-batch-resources', 'script': '/io/sql/dedup_agg_batch_resources.py', 'online': True}
script_sha1 d0b6892aa0e0c26828ba14f916589413148626f9
applying migration 109 {'name': 'setup-job-groups', 'script': '/io/sql/setup-job-groups.sql', 'online': True}
script_sha1 345df98fdb81d942eb2a35d9d36249cfc2ff6c7a
applying migration 110 {'name': 'dedup-job-resources', 'script': '/io/sql/dedup_job_resources.py', 'online': True}
script_sha1 78e581a2d6194a703ae87d0460c51557a69df90d
pymysql.err.IntegrityError: (1062, "Duplicate entry '112' for key 'batch_migrations.PRIMARY'")

This 112 entry indicates that migration 110 has already been recorded as applied.

Considering our previous deployment which included up to migration 108 (dedup-agg-batch-resources), it should only be these last four migrations that were candidates for actually needing to be applied by today's jobs. (However the second job spent ~2 hours at 94 (dedup-billing-project-users-by-date) suggesting that it had not in fact been previously applied. :shrug:)

I have examined the last four scripts and state of the production database, and tentatively conclude that:

  1. setup-job-groups is mostly idempotent and the database state reflects its successful application;
  2. dedup-job-resources is probably vaguely idempotent, affects 147,377,473 rows, and might well have been successfully applied;
  3. fix-job-groups-state-enum has probably been successfully applied (though I can't see any triggers via the mysql client) and was not attempted by the second job;
  4. populate-job-groups returned nonzero in the first job and was not attempted by the second job;
  5. batch_migration_version.version is 113, indicating that 111 (fix-job-groups-state-enum) is the last migration that has been recorded as successfully applied.

Consequently the database is likely in a consistently partly migrated state, and we can safely just submit another batch in the expectation that it may run to completion this time — modulo the unexplained nonzero result from populate-job-groups. But we should clarify the logging as per this PR first.

In future until these migrations have better locking to ensure only one is running at a time, before starting a new production deploy batch we should manually ensure that no previous batch's batch_database job is still running.