Problem with UNLOGGED tables in prebuilt database tarball

holtgrewe commented 3 years ago

Describe the bug When postgres crashes (e.g., out of memory), it tries to recover from the write ahead log files (WAL) that are part of the prebuilt database tarball that we provide. Somehow these files appear to be corrupted when we built them. It's not a problem with Varfish oder postgres but with our data tarball.

To Reproduce Steps to reproduce the behavior:

Kill postgres (during a query or import for good measure)
Look into today's logs, e.g.,

[root@varfish-demo varfish-docker-compose]# cat volumes/postgres/data/pg_log/postgresql-Fri.log 
< 2021-06-04 10:16:34.119 CEST > LOG:  server process (PID 7271) was terminated by signal 9: Killed
< 2021-06-04 10:16:34.119 CEST > DETAIL:  Failed process was running: SELECT variants_smallvariant.id, variants_smallvariant.release, variants_smallvariant.chromosome, variants_smallvariant.chromosome_no, variants_smallvariant.start, variants_smallvariant."end", variants_smallvariant.bin, variants_smallvariant.reference, variants_smallvariant.alternative, variants_smallvariant.ensembl_gene_id AS ensembl_gene_id, variants_smallvariant.refseq_gene_id AS entrez_id, variants_smallvariant.refseq_transcript_id, variants_smallvariant.refseq_hgvs_c, variants_smallvariant.exac_frequency, variants_smallvariant.gnomad_exomes_frequency, variants_smallvariant.gnomad_genomes_frequency, variants_smallvariant.thousand_genomes_frequency, variants_smallvariant.exac_homozygous, variants_smallvariant.gnomad_exomes_homozygous, variants_smallvariant.gnomad_genomes_homozygous, variants_smallvariant.thousand_genomes_homozygous, variants_smallvariant.exac_heterozygous, variants_smallvariant.gnomad_exomes_heterozygous, variants_smallvariant.gnomad_genomes_heterozygous, variants_smallvariant.thousand_ge
< 2021-06-04 10:16:34.160 CEST > LOG:  terminating any other active server processes
< 2021-06-04 10:16:34.193 CEST > WARNING:  terminating connection because of crash of another server process
< 2021-06-04 10:16:34.193 CEST > DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
< 2021-06-04 10:16:34.193 CEST > HINT:  In a moment you should be able to reconnect to the database and repeat your command.
< 2021-06-04 10:16:34.194 CEST > LOG:  all server processes terminated; reinitializing
< 2021-06-04 10:16:34.306 CEST > LOG:  database system was interrupted; last known up at 2021-06-04 08:11:22 CEST
< 2021-06-04 10:16:35.435 CEST > LOG:  database system was not properly shut down; automatic recovery in progress
< 2021-06-04 10:17:03.823 CEST > LOG:  redo starts at 60/5DF53D58
< 2021-06-04 10:17:03.826 CEST > LOG:  invalid record length at 60/5DF5F610: wanted 24, got 0
< 2021-06-04 10:17:03.826 CEST > LOG:  redo done at 60/5DF5F5D8
< 2021-06-04 10:17:17.355 CEST > LOG:  database system is ready to accept connections

Expected behavior With correct WAL files, postgres should recover correctly.

Additional context We've seen this in two sites already. It's not clear to me yet if it would occur twice with the same data download. One site re-downloaded the data tarball and saw it a second time.

holtgrewe commented 3 years ago

The problem might be in the postgres configuration after all. Sent email to list:

https://www.postgresql.org/message-id/dd6486da33414ed48a9a314441716f3c%40bih-charite.de

holtgrewe commented 3 years ago

The helpful people at the postgres mailing list pointed me into the right direction. For some reason, the tarball contains an unlogged table while our internal deployment is not unlogged.

Now we have the main cause and can fix it. We also need to look into why we have diverging deployment.

varfish-org / varfish-server

Problem with UNLOGGED tables in prebuilt database tarball #161