Closed bgandon closed 7 years ago
@jhunt said:
I'd be curious what types of messages are making it into your logs to contribute to the 27G/day. Can you characterize these messages, and possible supply some samples?
I have one single error pattern that is repeated at a rate of 3 or 4 per milli-second:
# tail postgres/postgresql.log
SELECT uuid FROM tasks
WHERE target_uuid = $1 AND status = $2 LIMIT 1
2016-11-14 15:37:38.283 GMT: ERROR: invalid input syntax for uuid: ""
2016-11-14 15:37:38.283 GMT: STATEMENT:
SELECT uuid FROM tasks
WHERE target_uuid = $1 AND status = $2 LIMIT 1
2016-11-14 15:37:38.283 GMT: ERROR: invalid input syntax for uuid: ""
2016-11-14 15:37:38.283 GMT: STATEMENT:
SELECT uuid FROM tasks
WHERE target_uuid = $1 AND status = $2 LIMIT 1
After making progress with #48, deleting the /var/vcap/store/postgres-9.4/pg_log/startup.log
and recreating the SHIELD VM with bosh deploy --recreate
doesn't solve this issue #50. The same error keep being repeated in /var/vcap/data/sys/log/postgres/postgresql.log
.
That definitely sounds like a SHIELD API bug.
Let's add configuration for managing the log level to the BOSH release, and I'll re-post the above errors to the SHIELD project proper so that we can get eyes on that. It sounds like it might be a purge task gone awry, but if so we should handle that case a little better.
Just to double-triple-check, what does shield -k status
output?
Here it is:
# ./shield -k status
Using http://10.244.10.2 (main) as SHIELD backend
Name: (none)
API Version: 0.7.2
Oh yes indeed, in my SHIELD dashboard, I have hundreds of pending purge tasks owned by the system. Is there some programatic way to cancel them all at once?
No, but if you get on-box, you can shut down the daemon, delete the tasks from the table, and start up the daemon.
I think this is solved with an update to 0.7.3. Verifying on my local deployment now, and if it fixes, will release 6.4.2 with these updates.
Cleared out the pending PURGE tasks, and stopped the log load.
@bgandon check out 6.4.2 and see if you see similar results to what I saw in the lab (purge tasks clear out at about 5/sec, log slows to ~0% growth)
I used this to command pipeline to ensure that the purge tasks were in fact clearing out:
watch 'shield -k ls tasks -a | grep pending | wc -l'
I didn't deleted the 2046 zombies tasks from the tables yet. I wonder wether it is related but the shield-daemon startup leads to a runtime panic.
2016-11-15 10:34:37.156653258 +0000 UTC shieldd: INFO: starting shield daemon
2016-11-15 10:34:37.178467203 +0000 UTC shieldd: INFO: initial run of blobstore_db_daily2amazon [1805f7d2-e365-47b7-99b5-f38692897ee3] which runs daily at 4:00 is at 2016-11-16 04:00:00 +0000 UTC
2016-11-15 10:34:37.17966568 +0000 UTC shieldd: INFO: initial run of cf_pg_dbs_weekly [18855e6e-b82f-4fad-ac96-200da7f53574] which runs daily at 4:00 is at 2016-11-16 04:00:00 +0000 UTC
2016-11-15 10:34:37.184711572 +0000 UTC shieldd: WARNING: Found task 19b8657a-be84-40fc-aaee-dbe4f7de7a09 in 'running' state at startup; setting to 'failed'
2016-11-15 10:34:37.187947738 +0000 UTC shieldd: WARNING: Found task 19ec2722-9fc5-44df-8cee-7c7828c8e408 in 'running' state at startup; setting to 'failed'
2016-11-15 10:34:37.192851992 +0000 UTC shieldd: WARNING: Found task 1a018edc-adfe-4b99-8a36-875f4d0bef16 in 'running' state at startup; setting to 'failed'
2016-11-15 10:34:37.195714926 +0000 UTC shieldd: WARNING: Found task 1a2624c8-a9de-4d04-9d47-7eeeda8a9323 in 'running' state at startup; setting to 'failed'
2016-11-15 10:34:37.197959066 +0000 UTC shieldd: WARNING: Found task 1a2c925f-2b67-4506-a7d3-a006a8e21056 in 'running' state at startup; setting to 'failed'
2016-11-15 10:34:37.228207045 +0000 UTC shieldd: WARNING: Found task 1a2e471e-44d2-4b74-9294-cce0b1f1032e in 'pending' state at startup; rescheduling
2016-11-15 10:34:37.228267667 +0000 UTC shieldd: INFO: schedule task 1a2e471e-44d2-4b74-9294-cce0b1f1032e with deadline {{63614846077 228250346 0x1158da0}}
... 2045 other tasks being rescheduled ...
2016-11-15 10:34:38.163233982 +0000 UTC shieldd: INFO: sent a task to a worker
2016-11-15 10:34:38.165199595 +0000 UTC shieldd: INFO: sent a task to a worker
2016-11-15 10:34:38.166557868 +0000 UTC shieldd: INFO: sent a task to a worker
2016-11-15 10:34:38.167507989 +0000 UTC shieldd: INFO: sent a task to a worker
2016-11-15 10:34:38.170438808 +0000 UTC shieldd: INFO: sent a task to a worker
2016-11-15 10:34:38.214733544 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> Validating environment...
2016-11-15 10:34:38.220233964 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> =========================
2016-11-15 10:34:38.222810269 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> SHIELD_OP ... found
2016-11-15 10:34:38.224548479 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> SHIELD_STORE_PLUGIN ... found
2016-11-15 10:34:38.225601685 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> SHIELD_STORE_ENDPOINT ... found
2016-11-15 10:34:38.226621647 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> SHIELD_RESTORE_KEY ... found
2016-11-15 10:34:38.230144162 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> OK
2016-11-15 10:34:38.235910258 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e>
2016-11-15 10:34:38.237591756 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> Validating STORE plugin `s3`...
2016-11-15 10:34:38.23865921 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> ===============================
2016-11-15 10:34:38.244140075 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> ✓ s3_host s3.amazonaws.com
2016-11-15 10:34:38.246930932 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> ✓ access_key_id ************************
2016-11-15 10:34:38.249522602 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> ✓ secret_access_key ************************************************
2016-11-15 10:34:38.250665485 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> ✓ bucket ******-shield-backups
2016-11-15 10:34:38.251786298 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> ✓ prefix main
2016-11-15 10:34:38.253030031 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> ✓ signature_version 4
2016-11-15 10:34:38.253925117 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> ✓ socks5_proxy (no proxy will be used)
2016-11-15 10:34:38.255369435 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> ✓ skip_ssl_validation no, SSL WILL be validated
2016-11-15 10:34:38.257582006 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> OK
2016-11-15 10:34:38.258491061 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e>
2016-11-15 10:34:38.260585181 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> Running purge task
2016-11-15 10:34:38.262722604 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> ==================
2016-11-15 10:34:38.268212687 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> Validating environment...
2016-11-15 10:34:38.270326643 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> =========================
2016-11-15 10:34:38.273456091 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> SHIELD_OP ... found
2016-11-15 10:34:38.274958796 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> SHIELD_STORE_PLUGIN ... found
2016-11-15 10:34:38.278265641 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> SHIELD_STORE_ENDPOINT ... found
2016-11-15 10:34:38.279855787 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> SHIELD_RESTORE_KEY ... found
2016-11-15 10:34:38.28242874 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> OK
2016-11-15 10:34:38.285142772 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6>
2016-11-15 10:34:38.286467521 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> Validating STORE plugin `s3`...
2016-11-15 10:34:38.289595572 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ===============================
2016-11-15 10:34:38.290351814 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ✓ s3_host s3.amazonaws.com
2016-11-15 10:34:38.291372335 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ✓ access_key_id ************************
2016-11-15 10:34:38.293137922 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ✓ secret_access_key ************************************************
2016-11-15 10:34:38.293774316 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ✓ bucket ******-shield-backups
2016-11-15 10:34:38.295672602 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ✓ prefix main
2016-11-15 10:34:38.29734125 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ✓ signature_version 4
2016-11-15 10:34:38.298167891 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ✓ socks5_proxy (no proxy will be used)
2016-11-15 10:34:38.299058786 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ✓ skip_ssl_validation no, SSL WILL be validated
2016-11-15 10:34:38.300927739 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> OK
2016-11-15 10:34:38.302014469 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6>
2016-11-15 10:34:38.303040577 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> Running purge task
2016-11-15 10:34:38.304674025 +0000 UTC shieldd: INFO: 1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ==================
2016-11-15 10:34:38.314888452 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> Validating environment...
2016-11-15 10:34:38.317494649 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> =========================
2016-11-15 10:34:38.31869117 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> SHIELD_OP ... found
2016-11-15 10:34:38.322298891 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> SHIELD_STORE_PLUGIN ... found
2016-11-15 10:34:38.324207793 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> SHIELD_STORE_ENDPOINT ... found
2016-11-15 10:34:38.325751844 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> SHIELD_RESTORE_KEY ... found
2016-11-15 10:34:38.327386968 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> OK
2016-11-15 10:34:38.328753063 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7>
2016-11-15 10:34:38.329646473 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> Validating STORE plugin `s3`...
2016-11-15 10:34:38.330506359 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> ===============================
2016-11-15 10:34:38.332691832 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> ✓ s3_host s3.amazonaws.com
2016-11-15 10:34:38.334126092 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> ✓ access_key_id ************************
2016-11-15 10:34:38.334827299 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> ✓ secret_access_key ************************************************
2016-11-15 10:34:38.335642486 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> ✓ bucket ******-shield-backups
2016-11-15 10:34:38.337090715 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> ✓ prefix main
2016-11-15 10:34:38.33797295 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> ✓ signature_version 4
2016-11-15 10:34:38.338579452 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> ✓ socks5_proxy (no proxy will be used)
2016-11-15 10:34:38.340119592 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> ✓ skip_ssl_validation no, SSL WILL be validated
2016-11-15 10:34:38.342441395 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> OK
2016-11-15 10:34:38.344239668 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7>
2016-11-15 10:34:38.347269662 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> Running purge task
2016-11-15 10:34:38.348842767 +0000 UTC shieldd: INFO: 1aadf10d-e02c-48db-aef0-dab51d4297d7> ==================
2016-11-15 10:34:38.370778516 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> Validating environment...
2016-11-15 10:34:38.372595227 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> =========================
2016-11-15 10:34:38.373959367 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> SHIELD_OP ... found
2016-11-15 10:34:38.374670072 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> SHIELD_STORE_PLUGIN ... found
2016-11-15 10:34:38.375551469 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> SHIELD_STORE_ENDPOINT ... found
2016-11-15 10:34:38.377291914 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> SHIELD_RESTORE_KEY ... found
2016-11-15 10:34:38.378264663 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> OK
2016-11-15 10:34:38.379026212 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9>
2016-11-15 10:34:38.381498873 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> Validating STORE plugin `s3`...
2016-11-15 10:34:38.382997946 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ===============================
2016-11-15 10:34:38.390293287 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ✓ s3_host s3.amazonaws.com
2016-11-15 10:34:38.392498316 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ✓ access_key_id ************************
2016-11-15 10:34:38.394092373 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ✓ secret_access_key ************************************************
2016-11-15 10:34:38.395184691 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ✓ bucket ******-shield-backups
2016-11-15 10:34:38.403321758 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ✓ prefix main
2016-11-15 10:34:38.405575117 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ✓ signature_version 4
2016-11-15 10:34:38.406397568 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ✓ socks5_proxy (no proxy will be used)
2016-11-15 10:34:38.407114419 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ✓ skip_ssl_validation no, SSL WILL be validated
2016-11-15 10:34:38.408719372 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> OK
2016-11-15 10:34:38.409554673 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9>
2016-11-15 10:34:38.410567931 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> Running purge task
2016-11-15 10:34:38.411436477 +0000 UTC shieldd: INFO: 1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ==================
2016-11-15 10:34:38.425916809 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> Validating environment...
2016-11-15 10:34:38.431492936 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> =========================
2016-11-15 10:34:38.438327325 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> SHIELD_OP ... found
2016-11-15 10:34:38.440518665 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> SHIELD_STORE_PLUGIN ... found
2016-11-15 10:34:38.442454106 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> SHIELD_STORE_ENDPOINT ... found
2016-11-15 10:34:38.443315668 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> SHIELD_RESTORE_KEY ... found
2016-11-15 10:34:38.446924786 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> OK
2016-11-15 10:34:38.451229802 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d>
2016-11-15 10:34:38.453961434 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> Validating STORE plugin `s3`...
2016-11-15 10:34:38.455429778 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ===============================
2016-11-15 10:34:38.458977714 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ✓ s3_host s3.amazonaws.com
2016-11-15 10:34:38.461570502 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ✓ access_key_id ************************
2016-11-15 10:34:38.464491823 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ✓ secret_access_key ************************************************
2016-11-15 10:34:38.466374465 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ✓ bucket ******-shield-backups
2016-11-15 10:34:38.468434224 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ✓ prefix main
2016-11-15 10:34:38.471512268 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ✓ signature_version 4
2016-11-15 10:34:38.473320878 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ✓ socks5_proxy (no proxy will be used)
2016-11-15 10:34:38.474537793 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ✓ skip_ssl_validation no, SSL WILL be validated
2016-11-15 10:34:38.477743228 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> OK
2016-11-15 10:34:38.479771698 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d>
2016-11-15 10:34:38.486385108 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> Running purge task
2016-11-15 10:34:38.490178887 +0000 UTC shieldd: INFO: 1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ==================
2016-11-15 10:34:39.009166914 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e>
2016-11-15 10:34:39.011679245 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e> EXITING 0
2016-11-15 10:34:39.01546101 +0000 UTC shieldd: INFO: 1a2e471e-44d2-4b74-9294-cce0b1f1032e: archive 17046b7d-2767-43dd-863a-51a109037866 purged from storage
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x705ae6]
goroutine 1 [running]:
panic(0xb0fd20, 0xc8200100a0)
/goroot/src/runtime/panic.go:481 +0x3e6
github.com/starkandwayne/shield/db.(*DB).PurgeArchive(0xc82006bd40, 0xc8202ebfa0, 0x10, 0x10, 0x0, 0x0)
/tmp/build/01f561ac/gopath/src/github.com/starkandwayne/shield/db/archives.go:227 +0x406
github.com/starkandwayne/shield/supervisor.(*Supervisor).Run(0xc8201180c0, 0x0, 0x0)
/tmp/build/01f561ac/gopath/src/github.com/starkandwayne/shield/supervisor/supervisor.go:362 +0x2828
main.main()
/tmp/build/01f561ac/gopath/src/github.com/starkandwayne/shield/cmd/shieldd/main.go:67 +0x51e
I first set all tasks statuses to canceled
with an UPDATE tasks SET status='canceled' WHERE status='pending' OR status='running';
but those cancelled tasks were still being displayed as running in the dashboard, so I followed your advice and just deleted them with DELETE FROM tasks where status='canceled';
Anyway, the shield-daemon now starts successfully. There is no more errors in thepostgresql.log
that is small now, and startup.log
is tiny.
But while running new backups, I'm having issues when the CF blobstore backup (≈1.7 GB) is written to Amazon S3. The backup job fails with a Client.Timeout exceeded while awaiting headers
error:
2016-11-15 11:19:53.219658621 +0000 UTC shieldd: INFO: aaaf10c4-020b-4aad-805c-e6a5f7b1828e> Put https://******-shield-backups.s3-eu-central-1.amazonaws.com/main/2016/11/15/2016-11-15-111658-f064d2e4-b0c9-4197-b341-96eaa769d4b9?partNumber=1&uploadId=MMBxCCG9uQMZ9..fI3tEB5m5wjaprg93ZuBAAr_hleJuC94oxKUbUYpeJTxNbZ.bcd5O3sqBKQWpikKff2AZcJCw2ku0r4H.DgNdqDbjHw4TJETfmg4ep6sidbAcT7Ou: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2016-11-15 11:19:53.308476334 +0000 UTC shieldd: INFO: aaaf10c4-020b-4aad-805c-e6a5f7b1828e> bsdtar: Write error
2016-11-15 11:19:53.311802175 +0000 UTC shieldd: INFO: aaaf10c4-020b-4aad-805c-e6a5f7b1828e> Unable to exec '/var/vcap/packages/bsdtar/bin/bsdtar': exit status 1
2016-11-15 11:19:53.313477807 +0000 UTC shieldd: INFO: aaaf10c4-020b-4aad-805c-e6a5f7b1828e> TASK FAILED!! shield worker 5 failed to execute the command against the remote agent 10.244.0.130:5444 (Process exited with: 4. Reason was: ())
2016-11-15 11:19:53.314677122 +0000 UTC shieldd: INFO: aaaf10c4-020b-4aad-805c-e6a5f7b1828e>
2016-11-15 11:19:53.315581147 +0000 UTC shieldd: INFO: aaaf10c4-020b-4aad-805c-e6a5f7b1828e> EXITING 4
2016-11-15 11:19:53.316901706 +0000 UTC shieldd: INFO: aaaf10c4-020b-4aad-805c-e6a5f7b1828e> TASK FAILED!! No restore key detected in worker 5. Cowardly refusing to create an archive record
2016-11-15 11:19:53.317783941 +0000 UTC shieldd: WARNING: aaaf10c4-020b-4aad-805c-e6a5f7b1828e: task failed!
I wonder whether this is related to the SHIELD update or not. I'll try to rollback to the previous version, try again, and keep you updated.
So, rolling back to version 6.4.0, my scheduled backup jobs ran successfully during the night.
The purge tasks went awry like before, producing GB of errors in postgresql.log
, and no purging actually occurred. Once I deleted those zombie tasks (with the shield daemon stopped) the system stabilized (until a new purge is scheduled I guess).
Deploying back version 6.4.2, purge tasks run well. But scheduling the adhoc fs
-based blobstore backup keeps exiting with similar write: connection reset by peer
error (but not exactly the same as above, actually).
2016-11-16 07:33:43.353986061 +0000 UTC shieldd: INFO: 6d062a3b-dfe3-4270-88d5-9a79ca7294a9> Running backup task (using bzip2 compression)
2016-11-16 07:33:43.355337071 +0000 UTC shieldd: INFO: 6d062a3b-dfe3-4270-88d5-9a79ca7294a9> =============================================
2016-11-16 07:41:50.110411753 +0000 UTC shieldd: INFO: 6d062a3b-dfe3-4270-88d5-9a79ca7294a9> Put https://******-shield-backups.s3-eu-central-1.amazonaws.com/main/2016/11/16/2016-11-16-073343-72b709ea-22f4-4c8c-8235-782d13d0e726?partNumber=3&uploadId=RJ5CE5_Wams0kmJIzkrur1Aducii16YWx86tnKAUysITAVF3ebNuGh.jM4s8hbNwjSzDGCpGdw92qbUR94y40wljH2vFHC0y0RsSsxh9b43s4jHi3AkcKA6n17SsfCKB: write tcp 10.244.0.130:38135->54.231.194.54:443: write: connection reset by peer
2016-11-16 07:41:50.169451658 +0000 UTC shieldd: INFO: 6d062a3b-dfe3-4270-88d5-9a79ca7294a9> bsdtar: Write error
2016-11-16 07:41:50.173819811 +0000 UTC shieldd: INFO: 6d062a3b-dfe3-4270-88d5-9a79ca7294a9> Unable to exec '/var/vcap/packages/bsdtar/bin/bsdtar': exit status 1
2016-11-16 07:41:50.175665018 +0000 UTC shieldd: INFO: 6d062a3b-dfe3-4270-88d5-9a79ca7294a9>
2016-11-16 07:41:50.177224713 +0000 UTC shieldd: INFO: 6d062a3b-dfe3-4270-88d5-9a79ca7294a9> TASK FAILED!! shield worker 3 failed to execute the command against the remote agent 10.244.0.130:5444 (Process exited with: 4. Reason was: ())
2016-11-16 07:41:50.178619304 +0000 UTC shieldd: INFO: 6d062a3b-dfe3-4270-88d5-9a79ca7294a9> EXITING 4
2016-11-16 07:41:50.179628371 +0000 UTC shieldd: INFO: 6d062a3b-dfe3-4270-88d5-9a79ca7294a9> TASK FAILED!! No restore key detected in worker 3. Cowardly refusing to create an archive record
2016-11-16 07:41:50.181139736 +0000 UTC shieldd: WARNING: 6d062a3b-dfe3-4270-88d5-9a79ca7294a9: task failed!
Back to 6.4.0 again, scheduling the adhoc fs
-based blobstore backup works fine and exits with no errors.
Sounds like there may have been a regression between SHIELD 0.7.2 and SHIELD 0.7.3 insofar as the fs plugin is concerned?
What does this command output (run from the SHIELD agent that will be running the fs
plugin)?:
find -L /var/vcap/packages -name bsdtar 2>/dev/null
Here is the result:
# find -L /var/vcap/packages -name bsdtar -ls 2>/dev/null
9311004 4 drwxr-xr-x 6 root root 4096 Oct 19 06:11 /var/vcap/packages/bsdtar
9311008 2860 -rwxr-xr-x 1 root root 2924772 Oct 19 06:11 /var/vcap/packages/bsdtar/bin/bsdtar
# /var/vcap/packages/bsdtar/bin/bsdtar --version
bsdtar 3.1.2 - libarchive 3.1.2
You're right, and like you I was suspecting a issue there. But I had a look at the changes between releases 6.4.0 and 6.4.2 and between SHIELD 07.2 and 0.7.3 and I don't see any reason for a regression on bsdtar
. Plus, the bsdtar
blob hasn't changed.
This morning my nightly backups have run fine with the new version, and adhoc scheduled backups also run successfully. I must have been fooled by some flapping issues, related to S3 or the network in between, which happened to show up only when I was running the new version.
I'm closing this issue. Thanks a lot for the active support!
On the
shield/0
VM, the postgres job produces 27+ GB of logs each day. This fills my production disk very quickly.For example, I recreated the
shield/0
VM yesterday and here is what I have:I didn't find a way to control the verbosity of these logs in the
postgres-9.4
job spec. Is there any other way? Would it be possible for the BOSH release to implement some rolling policies on these log files? Or is it up to the deployer to implement some BOSH add-on that does the trick?(This issue was originally part of issue #48.)