Unable to start cloud controller workers after restore

aboik commented 8 years ago

Using v2.0.53 From /var/vcap/data/sys/log/cloud_controller_worker_ctl.err.log on cloud controller worker partitions:

[2016-01-30 02:36:54+0000] ------------ STARTING cloud_controller_worker_ctl at Sat Jan 30 02:36:54 UTC 2016 -------------- [2016-01-30 02:36:54+0000] chown: changing ownership of ‘/var/vcap/nfs/shared’: Operation not permitted [2016-01-30 02:36:54+0000] chown: changing ownership of ‘/var/vcap/nfs/shared’: Operation not permitted [2016-01-30 02:36:54+0000] chown: changing ownership of ‘/var/vcap/nfs/shared’: Operation not permitted [2016-01-30 02:36:56+0000] rake aborted! [2016-01-30 02:36:56+0000] Sequel::DatabaseError: PG::InvalidSchemaName: ERROR: no schema has been selected to create in [2016-01-30 02:36:56+0000] PG::InvalidSchemaName: ERROR: no schema has been selected to create in [2016-01-30 02:36:56+0000] Tasks: TOP => jobs:generic

Seeing the following error repeated in the postgres log for ccdb as well as a similar error for uaadb:

2016-01-30 02:05:02.413 GMT: STATEMENT: CREATE TABLE "schema_migrations" ("filename" text PRIMARY KEY) 2016-01-30 02:05:02.559 GMT: ERROR: relation "schema_migrations" does not exist at character 27 2016-01-30 02:05:02.559 GMT: STATEMENT: SELECT NULL AS "nil" FROM "schema_migrations" LIMIT 1 2016-01-30 02:05:02.560 GMT: ERROR: no schema has been selected to create in

xchapter7x commented 8 years ago

Can you send over some information so we can help dig a bit on this issue:

version of cfops
version of PivotalCF (both before and after restore)
steps and command ran as part of the restore process
ER persistence configuration (mysql, postgres)
what command yielded the above error?

thanks

aboik commented 8 years ago

CFOps version 2.0.53
PCF version: 1.6.4 (before and after restore)
Restore steps: I first deployed a new ops manager to an empty cluster (same ops man version as the one from the backed up platform) and ran the cfops restore command for the ops-manager tile. This completed without error and following that I clicked "Apply Changes", which also completed successfully. Then I ran the cfops restore command for the elastic-runtime tile which also showed no errors: LOG_LEVEL=debug ER_VERSION=1.6 ./cfops restore --du "pcfadmin" --dp "***" --omu "ubuntu" --omp "***" -d ~/backup_gtdcdev002/ -t "elastic-runtime" --omh "x.x.x.x"
ER is configured to use postgres for ccdb, uaadb, and consoledb

The above errors appeared after trying to start the cloud controller/cc workers. The bosh start <cc_job> and bosh start <cc_worker_job> commands I ran failed after a timeout period, and I investigated by ssh'ing to the cc worker vms and noticed the first error in the error log repeated - it kept trying to start the cloud controller worker and failed each time. I ssh'ed to the ccdb vm and noticed the second error in the /var/vcap/sys/log/postgres/postgresql.log, and a similar error appeared in the postgresql log on the uaadb vm. The consoledb had no such error in the postgres logs.

xchapter7x commented 8 years ago

the cc jobs should be stopped/started by cfops. what is the context in which we need to run bosh start or interact directly with bosh?

just trying to connect all the dots so i can more reliably reproduce your environment. let me know, thanks.

aboik commented 8 years ago

Well, I noticed after running cfops and waiting a while the cc jobs were still in a failing/starting state. I tried to start them manually to see what was preventing them from starting.

aboik commented 8 years ago

Closing this issue see https://github.com/pivotalservices/cfops/issues/55 for root cause.

vmware-archive / cfops

Unable to start cloud controller workers after restore #53