PCF Dev 0.17.0 "hangs" on start and does not complete

mar1ged commented 8 years ago

When I run PCF Dev it stays at 9 services for a long time and after 50 are started it does not proceed. The screen looks like this:

cf dev start
Using existing image
Allocating 4096 MB out of 48125 MB total system memory (44454 MB free).
Importing VM...
Starting VM...
Provisioning VM...
Waiting for services to start...
9 out of 50 running
9 out of 50 running
9 out of 50 running
9 out of 50 running
43 out of 50 running
50 out of 50 running

I am able to ssh into the box and cf api works, so perhaps this is just a problem of the start command not returning which causes the missing "PCF Dev is now running". When I interrupt the cf command with ^C both ssh and cf api still work.

I tried cf dev destroy already which does not solve the problem, I even cleaned the .pcfdev directory (except of the token file) and downloaded everything again.

EDIT: after various retries and reboots it suddenly worked. But the startup still takes 28 minutes (on a dual socket quadcore Xeon with 48 GB RAM and 8*SAS RAID ...) and stays at "9 out of 50 ..." for a long time. Because of this I cf dev destroyed the whole thing again and did a new cf dev start. After more than 40 minutes it still is at stage Waiting for services to start ..., which is even worse than before ;-) While this is message is on screen I can ssh into the box, but don't know what to check further ...

This is what I got from /var/pcfdev/run.log ...

+ source /var/pcfdev/common
++ monit=/var/vcap/bosh/bin/monit
+ domain=local.pcfdev.io
+ public_ip=192.168.11.11
+ [[ -x /var/pcfdev/pre-run ]]
+ /var/pcfdev/pre-run local.pcfdev.io
++ total_services
++ monit_summary
++ grep -E '^(Process|File|System)'
++ wc -l
+++ /var/vcap/bosh/bin/monit summary
++ output='monit: no status available -- the monit daemon is not running'
++ [[ monit: no status available -- the monit daemon is not running = *\e\r\r\o\r\ \c\o\n\n\e\c\t\i\n\g\ \t\o\ \t\h\e\ \m\o\n\i\t\ \d\a\e\m\o\n* ]]
++ echo 'monit: no status available -- the monit daemon is not running'
+ total=0
+ [[ 0 != 0 ]]
+ status runsvdir

... and ...

Process 'cf-mysql-route-registrar'  not monitored
Process 'quota-enforcer'            not monitored
Process 'rabbitmq-broker'           not monitored
Process 'rabbitmq-broker-route-registrar' not monitored
Process 'rabbitmq-management-route-registrar' not monitored
Process 'rabbitmq-server'           not monitored
System 'system_localhost'           not monitored = *\e\r\r\o\r\ \c\o\n\n\e\c\t\i\n\g\ \t\o\ \t\h\e\ \m\o\n\i\t\ \d\a\e\m\o\n* ]]
+ echo 'The Monit daemon 5.2.5 uptime: 44m

Process '\''pcfdev-nginx'\''              running
Process '\''mariadb_ctrl'\''              running
Process '\''galera-healthcheck'\''        running
Process '\''gra-log-purger-executable'\'' running
Process '\''garden'\''                    running
Process '\''etcd'\''                      running
Process '\''auctioneer'\''                not monitored
Process '\''cc_uploader'\''               not monitored
Process '\''converger'\''                 not monitored
Process '\''file_server'\''               not monitored

EDIT 2: I tried once again, now I have the same effect as at the beginning: the output stopped after50 out of 50 running` and nothing else is displayed. I started this cf more than 5 hours ago.

sclevine commented 8 years ago

Hi @mar1ged,

Once the OVA is downloaded, PCF Dev should start very quickly (usually in less than 10 minutes) on hardware that is much less powerful than the hardware you describe.

Are you running cf dev start from within another VM?

Is there any chance the I/O performance of the disk that holds the PCF Dev vmdk file is severely degraded? We've seen these types of errors on very slow disks.

-Stephen

mar1ged commented 8 years ago

@sclevine: No, I run this on bare metal (Hardware mentioned as above, CentOS 7.2 x64)

I guess I/O performance should not be a problem:

/dev/sda:
 Timing cached reads:   10342 MB in  2.00 seconds = 5175.38 MB/sec
 Timing buffered disk reads: 1728 MB in  3.00 seconds = 575.07 MB/sec

sclevine commented 8 years ago

Hi @mar1ged,

Can you provide the last 200 lines of run.log, with any sensitive info sanitized?

Thanks, Stephen

mar1ged commented 8 years ago

@sclevine The last hundreds of lines are identical to this:

++ available_buildpacks
++ cf curl /v2/buildpacks
++ jq '.resources | map(select(.entity.filename | length > 0)) | length'
+ [[ 7 -lt 8 ]]
+ sleep 1
++ available_buildpacks
++ cf curl /v2/buildpacks
++ jq '.resources | map(select(.entity.filename | length > 0)) | length'
+ [[ 7 -lt 8 ]]
+ sleep 1
++ available_buildpacks
++ cf curl /v2/buildpacks
++ jq '.resources | map(select(.entity.filename | length > 0)) | length'
+ [[ 7 -lt 8 ]]
+ sleep 1
++ available_buildpacks
++ cf curl /v2/buildpacks
++ jq '.resources | map(select(.entity.filename | length > 0)) | length'

The available_buildpacks block exists 1301 times in the log

sclevine commented 8 years ago

This means that a single buildpack (if I had to guess, the Java buildpack) cannot be registered with Cloud Controller. If you cf curl /v2/buildpacks, can you provide the output here?

If you're positive that your VM isn't running out of space (check df -h from inside the VM), you could tar up the /var/vcap/sys/log directory and send it to me at slevine@pivotal.io. Be aware that those logs may contain sensitive information that you may want to remove, depending on how you use your PCF Dev VM.

mar1ged commented 8 years ago

The output is (I run this inside the VM, correct ?):

No API endpoint set. Use 'cf login' or 'cf api' to target an endpoint.
FAILED
Error creating request:
Error performing request: Get /v2/buildpacks: unsupported protocol scheme ""

When I try to cf login from inside the VM I get:

vagrant@agent-id-pcfdev-0:~$ cf login

API endpoint> admin
a
    Something unexpected happened. This is a bug in cf.

    Please re-run the command that caused this exception with the environment
    variable CF_TRACE set to true.

    Also, please update to the latest cli and try the command again:
    https://github.com/cloudfoundry/cli/releases

    Please create an issue at: https://github.com/cloudfoundry/cli/issues

Is this the time when I send you the logs ? ;-)

sclevine commented 8 years ago

Sure, let me know if you need more detailed instructions.

Can you try cf curl /v2/buildpacks as the vcap user instead of the vagrant user?

mar1ged commented 8 years ago

Unfortunately this fails:

vagrant@agent-id-pcfdev-0:~$ su vcap
Password:
vcap@agent-id-pcfdev-0:/home/vagrant$ cf curl /v2/buildpacks
FAILED
Error read/writing config:  open /home/vcap/.cf/config.json: permission denied

sclevine commented 8 years ago

Apologies, this is a known issue in 0.17.0 that's fixed in the latest internal RCs.

You can fix the cf CLI with: $ chown -R vcap:vcap /home/vcap/.cf

mar1ged commented 8 years ago

Thanks @sclevine , now I get proper JSON and after a cf dev stop and cf dev start the whole process is faster (real 11m49.627s, not famous but ok ;-)) and no longer stops and the call returns to the command line properly.

Just out of curiosity I checked the run.log and saw this, please tell me if it is relevant and if I should open a new issue:

+ cf start apps-manager
Starting app apps-manager in org system / space system as admin...
Downloading staticfile_buildpack...
Downloaded staticfile_buildpack (2.5M)
Creating container
Successfully created container
Downloading app package...
Downloaded app package (1.3M)
Staging...
-------> Buildpack version 1.3.6
Downloaded [file:///tmp/buildpacks/dcb75c29aed8dc561d63627253ca3d82/dependencies/https___pivotal-buildpacks.s3.amazonaws.com_concourse-binaries_nginx_nginx-1.9.                                                                                14-linux-x64.tgz]
grep: Staticfile: No such file or directory
-----> Using root folder
-----> Copying project files into public/
-----> Setting up nginx
grep: Staticfile: No such file or directory
grep: Staticfile: No such file or directory
Exit status 0
Staging complete
Uploading droplet, build artifacts cache...
Uploading build artifacts cache...
Uploading droplet...
Uploaded build artifacts cache (109B)
Uploaded droplet (3.9M)
Uploading complete

Could not fetch instance count: Server error, status code: 503, error code: 2200 02, message: Instances information unavailable: Net::ReadTimeout

sclevine commented 8 years ago

Hi @mar1ged,

Are you unable to use Apps Manager after cf dev starting and encountering that error in the run.log?

Also, can you try out 0.18.0 and let me know if it improves these issues?

mar1ged commented 8 years ago

@sclevine I tried 0.18.0 today (after removing everything from the previous version and updating cf cli) and it again hangs after the 50 out of 50 running :-(

I can ssh into the box (which has now 192.168.22.11 btw) and it looks like we have the same issue with the buildpacks again:

+ sleep 1
++ available_buildpacks
++ cf curl /v2/buildpacks
++ jq '.resources | map(select(.entity.filename | length > 0)) | length'

Even after I chown and destroyed the VM (stop / start didn't work because it was defunct) the whole thing loops again.

mar1ged commented 8 years ago

@sclevine I gave the whole thing another try and purged everything from disk and downloaded the VM again.

Now the cf dev start has been at 46 out of 50 for more than half an hour and this is what I can see from the logs (I interrupted this today after 10 hours):

++ [[ The Monit daemon 5.2.5 uptime: 40m

Process 'pcfdev-nginx'              running
Process 'mariadb_ctrl'              running
Process 'galera-healthcheck'        running
Process 'gra-log-purger-executable' running
Process 'garden'                    running
Process 'etcd'                      running
Process 'auctioneer'                running
Process 'cc_uploader'               running
Process 'converger'                 running
Process 'file_server'               running
Process 'nsync_listener'            running
Process 'nsync_bulker'              running
Process 'rep'                       running
Process 'route_emitter'             running
Process 'ssh_proxy'                 running
Process 'stager'                    running
Process 'tps_listener'              running
Process 'tps_watcher'               running
Process 'bbs'                       running
Process 'syslog_drain_binder'       running
Process 'consul_agent'              not monitored
Process 'blobstore_nginx'           running
Process 'blobstore_url_signer'      running
Process 'metron_agent'              running
Process 'nats'                      running
Process 'route_registrar'           running
Process 'cloud_controller_clock'    running
Process 'cloud_controller_ng'       running
Process 'cloud_controller_worker_local_1' running
Process 'cloud_controller_worker_local_2' running
Process 'nginx_cc'                  running
Process 'cloud_controller_migration' running
Process 'cloud_controller_worker_1' running
Process 'uaa'                       running
Process 'doppler'                   running
Process 'gorouter'                  running
Process 'loggregator_trafficcontroller' Execution failed
Process 'syslog-configurator'       running
Process 'process-watcher'           running
Process 'cf-redis-broker'           running
Process 'broker-nginx'              running
Process 'cf-redis-route-registrar'  running
Process 'cf-mysql-broker'           Execution failed
Process 'cf-mysql-route-registrar'  running
Process 'quota-enforcer'            initializing
Process 'rabbitmq-broker'           running
Process 'rabbitmq-broker-route-registrar' running
Process 'rabbitmq-management-route-registrar' running
Process 'rabbitmq-server'           running
System 'system_localhost'           running = *\e\r\r\o\r\ \c\o\n\n\e\c\t\i\n\g\ \t\o\ \t\h\e\ \m\o\n\i\t\ \d\a\e\m\o\n* ]]
++ echo 'The Monit daemon 5.2.5 uptime: 40m

Process '\''pcfdev-nginx'\''              running
Process '\''mariadb_ctrl'\''              running
Process '\''galera-healthcheck'\''        running
Process '\''gra-log-purger-executable'\'' running
Process '\''garden'\''                    running
Process '\''etcd'\''                      running
Process '\''auctioneer'\''                running
Process '\''cc_uploader'\''               running
Process '\''converger'\''                 running
Process '\''file_server'\''               running
Process '\''nsync_listener'\''            running
Process '\''nsync_bulker'\''              running
Process '\''rep'\''                       running
Process '\''route_emitter'\''             running
Process '\''ssh_proxy'\''                 running
Process '\''stager'\''                    running
Process '\''tps_listener'\''              running
Process '\''tps_watcher'\''               running
Process '\''bbs'\''                       running
Process '\''syslog_drain_binder'\''       running
Process '\''consul_agent'\''              not monitored
Process '\''blobstore_nginx'\''           running
Process '\''blobstore_url_signer'\''      running
Process '\''metron_agent'\''              running
Process '\''nats'\''                      running
Process '\''route_registrar'\''           running
Process '\''cloud_controller_clock'\''    running
Process '\''cloud_controller_ng'\''       running
Process '\''cloud_controller_worker_local_1'\'' running
Process '\''cloud_controller_worker_local_2'\'' running
Process '\''nginx_cc'\''                  running
Process '\''cloud_controller_migration'\'' running
Process '\''cloud_controller_worker_1'\'' running
Process '\''uaa'\''                       running
Process '\''doppler'\''                   running
Process '\''gorouter'\''                  running
Process '\''loggregator_trafficcontroller'\'' Execution failed
Process '\''syslog-configurator'\''       running
Process '\''process-watcher'\''           running
Process '\''cf-redis-broker'\''           running
Process '\''broker-nginx'\''              running
Process '\''cf-redis-route-registrar'\''  running
Process '\''cf-mysql-broker'\''           Execution failed
Process '\''cf-mysql-route-registrar'\''  running
Process '\''quota-enforcer'\''            initializing
Process '\''rabbitmq-broker'\''           running
Process '\''rabbitmq-broker-route-registrar'\'' running
Process '\''rabbitmq-management-route-registrar'\'' running
Process '\''rabbitmq-server'\''           running
System '\''system_localhost'\''           running'
+ started=46
+ [[ 46 -lt 50 ]]
+ counter=1893
+ [[ 33 = 0 ]]
+ sleep 1
vcap@agent-id-pcfdev-0:/var/pcfdev$

mar1ged commented 8 years ago

@sclevine I destroyed the VM and started it again, now the whole thing is running:

time cf dev start
Using existing image.
Allocating 4096 MB out of 48125 MB total system memory (37528 MB free).
Importing VM...
Starting VM...
Provisioning VM...
Waiting for services to start...
9 out of 50 running
9 out of 50 running
9 out of 50 running
47 out of 50 running
50 out of 50 running
 _______  _______  _______    ______   _______  __   __
|       ||       ||       |  |      | |       ||  | |  |
|    _  ||       ||    ___|  |  _    ||    ___||  |_|  |
|   |_| ||       ||   |___   | | |   ||   |___ |       |
|    ___||      _||    ___|  | |_|   ||    ___||       |
|   |    |     |_ |   |      |       ||   |___  |     |
|___|    |_______||___|      |______| |_______|  |___|
is now running.
To begin using PCF Dev, please run:
        cf login -a https://api.local2.pcfdev.io --skip-ssl-validation
Admin user => Email: admin / Password: admin
Regular user => Email: user / Password: pass

real    27m28.979s
user    0m12.720s
sys     0m8.976s

Somehow this seems to be random: sometimes it fails, sometimes it succeeds. I was not able to find out what leads to which result. Unfortunately I can not log into the API:

 cf login -a https://api.local2.pcfdev.io --skip-ssl-validation -v
API endpoint: https://api.local2.pcfdev.io

REQUEST: [2016-07-27T08:19:22+02:00]
GET /v2/info HTTP/1.1
Host: api.local2.pcfdev.io
Accept: application/json
Content-Type: application/json
User-Agent: go-cli 6.20.0+25b1961 / linux

(cf just sits and waits)

sclevine commented 8 years ago

Closing this due to inactivity. Feel free to re-open if you're still having issues with the latest PCF Dev.

vmware-archive / pcfdev

PCF Dev 0.17.0 "hangs" on start and does not complete #93