vmware-archive / cfops

This is simply an automation that is based on the supported way to back up Pivotal Cloud Foundry
http://www.cfops.io
Apache License 2.0
35 stars 24 forks source link

ERT Restore fails using Postgres DB #81

Open jtgammon opened 8 years ago

jtgammon commented 8 years ago

CFOPS ERT restore is failing when the backend is postgres. We did a manual restore using the cfops backup data and it worked. Issue appears to be with the cfops restore. We believe it is the lack of the "--clean" option in the pg_restore command. The cfops restore resulted in the cc being unable to start. 1st instance of CC showed starting, 2nd instance of CC showed stopped, and CC worker showed failing.

We later tried a manual restore and it worked. Below are the working steps that we used:

ERT Restore log: ert.log.gz

error: 2016/04/27 10:50:21 E0427 10:50:21.097533 6309 execute_list.go:15] Process exited with: 1. Reason was: () 2016/04/27 10:50:19 D0427 10:50:19.678913 6309 execute_list.go:12] PGPASSWORD=270a0e10f2739c8f /var/vcap/packages/postgres-9.4.2/bin/pg_restore -h localhost -U vcap -x -p 2544 -c -d uaa /tmp/archive.backup Note: No "--clean" option

Working manual steps: bosh stop cloud_controller-partition-d46fdf5eca88a5c00fc7 0 bosh stop cloud_controller-partition-d46fdf5eca88a5c00fc7 1

ccdb vcap@10.47.104.7 030df4fe78359255

./pg_restore -U vcap -p 2544 -d ccdb --clean /tmp/ccdb.sql

bosh stop uaa-partition-d46fdf5eca88a5c00fc7 0 bosh stop uaa-partition-d46fdf5eca88a5c00fc7 1

uaadb vcap@10.47.104.8 270a0e10f2739c8f

./pg_restore -U vcap -p 2544 -d uaa --clean /tmp/uaadb.backup ./pg_restore -U vcap -p 2544 -v -d console --clean /tmp/consoledb.backup Note: Includes "--clean" option

NFS Remote restore: cat nfs_server.backup | ssh vcap@host "tar xvzf - -C /var/vcap/store" mysql/0 10.47.104.18

mysql -h localhost -u root -p < /tmp/mysql.backup mysql -u root -p -h localhost

flush flush privileges;

bosh start uaa-partition-d46fdf5eca88a5c00fc7 0 bosh start uaa-partition-d46fdf5eca88a5c00fc7 1

bosh start cloud_controller-partition-d46fdf5eca88a5c00fc7 0 bosh start cloud_controller-partition-d46fdf5eca88a5c00fc7 1

jtgammon commented 8 years ago

I built out a lab with postgres, and ran the backup restore. I did not have any pg_restore errors in the log, but seeing the cloud controller fail to start.

+----------------------------------------------------------------+----------+--------------------------------------------------------------+--------------+ | Job/index | State | Resource Pool | IPs | +----------------------------------------------------------------+----------+--------------------------------------------------------------+--------------+ | ccdb-partition-3f2d3e1323bb74aa36a1/0 | running | ccdb-partition-3f2d3e1323bb74aa36a1 | 10.65.187.82 | | clock_global-partition-3f2d3e1323bb74aa36a1/0 | running | clock_global-partition-3f2d3e1323bb74aa36a1 | 10.65.187.42 | | cloud_controller-partition-3f2d3e1323bb74aa36a1/0 | starting | cloud_controller-partition-3f2d3e1323bb74aa36a1 | 10.65.187.41 | | cloud_controller_worker-partition-3f2d3e1323bb74aa36a1/0 | failing | cloud_controller_worker-partition-3f2d3e1323bb74aa36a1 | 10.65.187.43 | | consoledb-partition-3f2d3e1323bb74aa36a1/0 | running | consoledb-partition-3f2d3e1323bb74aa36a1 | 10.65.187.84 | | consul_server-partition-3f2d3e1323bb74aa36a1/0 | running | consul_server-partition-3f2d3e1323bb74aa36a1 | 10.65.187.33 | | diego_brain-partition-3f2d3e1323bb74aa36a1/0 | running | diego_brain-partition-3f2d3e1323bb74aa36a1 | 10.65.187.45 | | diego_cell-partition-3f2d3e1323bb74aa36a1/0 | running | diego_cell-partition-3f2d3e1323bb74aa36a1 | 10.65.187.46 | | diego_cell-partition-3f2d3e1323bb74aa36a1/1 | running | diego_cell-partition-3f2d3e1323bb74aa36a1 | 10.65.187.47 | | diego_database-partition-3f2d3e1323bb74aa36a1/0 | running | diego_database-partition-3f2d3e1323bb74aa36a1 | 10.65.187.36 | | doppler-partition-3f2d3e1323bb74aa36a1/0 | running | doppler-partition-3f2d3e1323bb74aa36a1 | 10.65.187.48 | | etcd_server-partition-3f2d3e1323bb74aa36a1/0 | running | etcd_server-partition-3f2d3e1323bb74aa36a1 | 10.65.187.35 | | ha_proxy-partition-3f2d3e1323bb74aa36a1/0 | running | ha_proxy-partition-3f2d3e1323bb74aa36a1 | 10.65.187.32 | | loggregator_trafficcontroller-partition-3f2d3e1323bb74aa36a1/0 | running | loggregator_trafficcontroller-partition-3f2d3e1323bb74aa36a1 | 10.65.187.49 | | mysql-partition-3f2d3e1323bb74aa36a1/0 | running | mysql-partition-3f2d3e1323bb74aa36a1 | 10.65.187.40 | | mysql_proxy-partition-3f2d3e1323bb74aa36a1/0 | running | mysql_proxy-partition-3f2d3e1323bb74aa36a1 | 10.65.187.39 | | nats-partition-3f2d3e1323bb74aa36a1/0 | running | nats-partition-3f2d3e1323bb74aa36a1 | 10.65.187.34 | | nfs_server-partition-3f2d3e1323bb74aa36a1/0 | running | nfs_server-partition-3f2d3e1323bb74aa36a1 | 10.65.187.37 | | router-partition-3f2d3e1323bb74aa36a1/0 | running | router-partition-3f2d3e1323bb74aa36a1 | 10.65.187.38 | | uaa-partition-3f2d3e1323bb74aa36a1/0 | running | uaa-partition-3f2d3e1323bb74aa36a1 | 10.65.187.44 | | uaadb-partition-3f2d3e1323bb74aa36a1/0 | running | uaadb-partition-3f2d3e1323bb74aa36a1 | 10.65.187.83 | +----------------------------------------------------------------+----------+--------------------------------------------------------------+--------------+

cloud_controller-partition-3f2d3e1323bb74aa36a1-0-3430133b0907.zip

cloud_controller_worker-partition-3f2d3e1323bb74aa36a1-0-589f5057db19.zip

jtgammon commented 8 years ago

Saw this in logs:

cloud_controller_worker_ctl.err.log:[2016-05-13 15:31:40+0000] Sequel::DatabaseError: PG::UndefinedTable: ERROR: relation delayed_jobs does not exist cloud_controller_worker_ctl.err.log:[2016-05-13 15:31:40+0000] PG::UndefinedTable: ERROR: relation delayed_jobs does not exist cloud_controller_worker_ctl.err.log:[2016-05-13 15:31:40+0000] Delayed::FatalBackendError: Delayed::FatalBackendError cloud_controller_worker_ctl.err.log:[2016-05-13 15:31:40+0000] Sequel::DatabaseError: PG::UndefinedTable: ERROR: relation delayed_jobs does not exist cloud_controller_worker_ctl.err.log:[2016-05-13 15:31:40+0000] PG::UndefinedTable: ERROR: relation delayed_jobs does not exist

jtgammon commented 8 years ago

Wondering if we can try pg_restore without the -x option in cfops? Is the only difference between manual working method and cfops.

calebwashburn commented 8 years ago

Not sure what all is included in typical pg_dump but looks like -x is there to prevent restoring roles. Which seems like it should be there if restoring from scratch... Let me know if you have a place to test this and can add a draft release without this option.

jtgammon commented 8 years ago

I have my lab until Monday, so can test over the weekend. If that doesn't work I can ask for a 1 day extension.