shieldproject / shield-boshrelease

BOSH Release for shield
MIT License
11 stars 17 forks source link

v6.4.0: heavy postgres logs #50

Closed bgandon closed 7 years ago

bgandon commented 8 years ago

On the shield/0 VM, the postgres job produces 27+ GB of logs each day. This fills my production disk very quickly.

For example, I recreated the shield/0 VM yesterday and here is what I have:

# du -sh /var/vcap/data/sys/log/postgres/postgresql.log
27G /var/vcap/data/sys/log/postgres/postgresql.log

I didn't find a way to control the verbosity of these logs in the postgres-9.4 job spec. Is there any other way? Would it be possible for the BOSH release to implement some rolling policies on these log files? Or is it up to the deployer to implement some BOSH add-on that does the trick?

(This issue was originally part of issue #48.)

bgandon commented 8 years ago

@jhunt said:

I'd be curious what types of messages are making it into your logs to contribute to the 27G/day. Can you characterize these messages, and possible supply some samples?

I have one single error pattern that is repeated at a rate of 3 or 4 per milli-second:

# tail postgres/postgresql.log
            SELECT uuid FROM tasks
              WHERE target_uuid = $1 AND status = $2 LIMIT 1
2016-11-14 15:37:38.283 GMT: ERROR:  invalid input syntax for uuid: ""
2016-11-14 15:37:38.283 GMT: STATEMENT:  
            SELECT uuid FROM tasks
              WHERE target_uuid = $1 AND status = $2 LIMIT 1
2016-11-14 15:37:38.283 GMT: ERROR:  invalid input syntax for uuid: ""
2016-11-14 15:37:38.283 GMT: STATEMENT:  
            SELECT uuid FROM tasks
              WHERE target_uuid = $1 AND status = $2 LIMIT 1
bgandon commented 8 years ago

After making progress with #48, deleting the /var/vcap/store/postgres-9.4/pg_log/startup.log and recreating the SHIELD VM with bosh deploy --recreate doesn't solve this issue #50. The same error keep being repeated in /var/vcap/data/sys/log/postgres/postgresql.log.

jhunt commented 8 years ago

That definitely sounds like a SHIELD API bug.

Let's add configuration for managing the log level to the BOSH release, and I'll re-post the above errors to the SHIELD project proper so that we can get eyes on that. It sounds like it might be a purge task gone awry, but if so we should handle that case a little better.

Just to double-triple-check, what does shield -k status output?

bgandon commented 8 years ago

Here it is:

# ./shield -k status
Using http://10.244.10.2 (main) as SHIELD backend

Name:        (none)
API Version: 0.7.2
bgandon commented 8 years ago

Oh yes indeed, in my SHIELD dashboard, I have hundreds of pending purge tasks owned by the system. Is there some programatic way to cancel them all at once?

jhunt commented 8 years ago

No, but if you get on-box, you can shut down the daemon, delete the tasks from the table, and start up the daemon.

jhunt commented 8 years ago

I think this is solved with an update to 0.7.3. Verifying on my local deployment now, and if it fixes, will release 6.4.2 with these updates.

jhunt commented 8 years ago

Cleared out the pending PURGE tasks, and stopped the log load.

jhunt commented 8 years ago

@bgandon check out 6.4.2 and see if you see similar results to what I saw in the lab (purge tasks clear out at about 5/sec, log slows to ~0% growth)

I used this to command pipeline to ensure that the purge tasks were in fact clearing out:

watch 'shield -k ls tasks -a | grep pending | wc -l'
bgandon commented 8 years ago

I didn't deleted the 2046 zombies tasks from the tables yet. I wonder wether it is related but the shield-daemon startup leads to a runtime panic.

2016-11-15 10:34:37.156653258 +0000 UTC shieldd: INFO: starting shield daemon
2016-11-15 10:34:37.178467203 +0000 UTC shieldd: INFO: initial run of blobstore_db_daily2amazon [1805f7d2-e365-47b7-99b5-f38692897ee3] which runs daily at 4:00 is at 2016-11-16 04:00:00 +0000 UTC
2016-11-15 10:34:37.17966568 +0000 UTC shieldd: INFO: initial run of cf_pg_dbs_weekly [18855e6e-b82f-4fad-ac96-200da7f53574] which runs daily at 4:00 is at 2016-11-16 04:00:00 +0000 UTC
2016-11-15 10:34:37.184711572 +0000 UTC shieldd: WARNING: Found task 19b8657a-be84-40fc-aaee-dbe4f7de7a09 in 'running' state at startup; setting to 'failed'
2016-11-15 10:34:37.187947738 +0000 UTC shieldd: WARNING: Found task 19ec2722-9fc5-44df-8cee-7c7828c8e408 in 'running' state at startup; setting to 'failed'
2016-11-15 10:34:37.192851992 +0000 UTC shieldd: WARNING: Found task 1a018edc-adfe-4b99-8a36-875f4d0bef16 in 'running' state at startup; setting to 'failed'
2016-11-15 10:34:37.195714926 +0000 UTC shieldd: WARNING: Found task 1a2624c8-a9de-4d04-9d47-7eeeda8a9323 in 'running' state at startup; setting to 'failed'
2016-11-15 10:34:37.197959066 +0000 UTC shieldd: WARNING: Found task 1a2c925f-2b67-4506-a7d3-a006a8e21056 in 'running' state at startup; setting to 'failed'
2016-11-15 10:34:37.228207045 +0000 UTC shieldd: WARNING: Found task 1a2e471e-44d2-4b74-9294-cce0b1f1032e in 'pending' state at startup; rescheduling
2016-11-15 10:34:37.228267667 +0000 UTC shieldd: INFO: schedule task 1a2e471e-44d2-4b74-9294-cce0b1f1032e with deadline {{63614846077 228250346 0x1158da0}}

... 2045 other tasks being rescheduled ...

2016-11-15 10:34:38.163233982 +0000 UTC shieldd: INFO: sent a task to a worker
2016-11-15 10:34:38.165199595 +0000 UTC shieldd: INFO: sent a task to a worker
2016-11-15 10:34:38.166557868 +0000 UTC shieldd: INFO: sent a task to a worker
2016-11-15 10:34:38.167507989 +0000 UTC shieldd: INFO: sent a task to a worker
2016-11-15 10:34:38.170438808 +0000 UTC shieldd: INFO: sent a task to a worker
2016-11-15 10:34:38.214733544 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> Validating environment...
2016-11-15 10:34:38.220233964 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> =========================
2016-11-15 10:34:38.222810269 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> SHIELD_OP ... found
2016-11-15 10:34:38.224548479 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> SHIELD_STORE_PLUGIN ... found
2016-11-15 10:34:38.225601685 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> SHIELD_STORE_ENDPOINT ... found
2016-11-15 10:34:38.226621647 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> SHIELD_RESTORE_KEY ... found
2016-11-15 10:34:38.230144162 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> OK
2016-11-15 10:34:38.235910258 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> 
2016-11-15 10:34:38.237591756 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> Validating STORE plugin `s3`...
2016-11-15 10:34:38.23865921 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> ===============================
2016-11-15 10:34:38.244140075 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> ✓ s3_host              s3.amazonaws.com
2016-11-15 10:34:38.246930932 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> ✓ access_key_id        ************************
2016-11-15 10:34:38.249522602 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> ✓ secret_access_key    ************************************************
2016-11-15 10:34:38.250665485 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> ✓ bucket               ******-shield-backups
2016-11-15 10:34:38.251786298 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> ✓ prefix               main
2016-11-15 10:34:38.253030031 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> ✓ signature_version    4
2016-11-15 10:34:38.253925117 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> ✓ socks5_proxy         (no proxy will be used)
2016-11-15 10:34:38.255369435 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> ✓ skip_ssl_validation  no, SSL WILL be validated
2016-11-15 10:34:38.257582006 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> OK
2016-11-15 10:34:38.258491061 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> 
2016-11-15 10:34:38.260585181 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> Running purge task
2016-11-15 10:34:38.262722604 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> ==================
2016-11-15 10:34:38.268212687 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> Validating environment...
2016-11-15 10:34:38.270326643 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> =========================
2016-11-15 10:34:38.273456091 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> SHIELD_OP ... found
2016-11-15 10:34:38.274958796 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> SHIELD_STORE_PLUGIN ... found
2016-11-15 10:34:38.278265641 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> SHIELD_STORE_ENDPOINT ... found
2016-11-15 10:34:38.279855787 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> SHIELD_RESTORE_KEY ... found
2016-11-15 10:34:38.28242874 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> OK
2016-11-15 10:34:38.285142772 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> 
2016-11-15 10:34:38.286467521 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> Validating STORE plugin `s3`...
2016-11-15 10:34:38.289595572 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ===============================
2016-11-15 10:34:38.290351814 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ✓ s3_host              s3.amazonaws.com
2016-11-15 10:34:38.291372335 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ✓ access_key_id        ************************
2016-11-15 10:34:38.293137922 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ✓ secret_access_key    ************************************************
2016-11-15 10:34:38.293774316 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ✓ bucket               ******-shield-backups
2016-11-15 10:34:38.295672602 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ✓ prefix               main
2016-11-15 10:34:38.29734125 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ✓ signature_version    4
2016-11-15 10:34:38.298167891 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ✓ socks5_proxy         (no proxy will be used)
2016-11-15 10:34:38.299058786 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ✓ skip_ssl_validation  no, SSL WILL be validated
2016-11-15 10:34:38.300927739 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> OK
2016-11-15 10:34:38.302014469 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> 
2016-11-15 10:34:38.303040577 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> Running purge task
2016-11-15 10:34:38.304674025 +0000 UTC shieldd: INFO:   1a4c9b17-1c4c-4e3e-94ac-85503c9c62b6> ==================
2016-11-15 10:34:38.314888452 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> Validating environment...
2016-11-15 10:34:38.317494649 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> =========================
2016-11-15 10:34:38.31869117 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> SHIELD_OP ... found
2016-11-15 10:34:38.322298891 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> SHIELD_STORE_PLUGIN ... found
2016-11-15 10:34:38.324207793 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> SHIELD_STORE_ENDPOINT ... found
2016-11-15 10:34:38.325751844 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> SHIELD_RESTORE_KEY ... found
2016-11-15 10:34:38.327386968 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> OK
2016-11-15 10:34:38.328753063 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> 
2016-11-15 10:34:38.329646473 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> Validating STORE plugin `s3`...
2016-11-15 10:34:38.330506359 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> ===============================
2016-11-15 10:34:38.332691832 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> ✓ s3_host              s3.amazonaws.com
2016-11-15 10:34:38.334126092 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> ✓ access_key_id        ************************
2016-11-15 10:34:38.334827299 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> ✓ secret_access_key    ************************************************
2016-11-15 10:34:38.335642486 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> ✓ bucket               ******-shield-backups
2016-11-15 10:34:38.337090715 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> ✓ prefix               main
2016-11-15 10:34:38.33797295 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> ✓ signature_version    4
2016-11-15 10:34:38.338579452 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> ✓ socks5_proxy         (no proxy will be used)
2016-11-15 10:34:38.340119592 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> ✓ skip_ssl_validation  no, SSL WILL be validated
2016-11-15 10:34:38.342441395 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> OK
2016-11-15 10:34:38.344239668 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> 
2016-11-15 10:34:38.347269662 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> Running purge task
2016-11-15 10:34:38.348842767 +0000 UTC shieldd: INFO:   1aadf10d-e02c-48db-aef0-dab51d4297d7> ==================
2016-11-15 10:34:38.370778516 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> Validating environment...
2016-11-15 10:34:38.372595227 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> =========================
2016-11-15 10:34:38.373959367 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> SHIELD_OP ... found
2016-11-15 10:34:38.374670072 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> SHIELD_STORE_PLUGIN ... found
2016-11-15 10:34:38.375551469 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> SHIELD_STORE_ENDPOINT ... found
2016-11-15 10:34:38.377291914 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> SHIELD_RESTORE_KEY ... found
2016-11-15 10:34:38.378264663 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> OK
2016-11-15 10:34:38.379026212 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> 
2016-11-15 10:34:38.381498873 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> Validating STORE plugin `s3`...
2016-11-15 10:34:38.382997946 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ===============================
2016-11-15 10:34:38.390293287 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ✓ s3_host              s3.amazonaws.com
2016-11-15 10:34:38.392498316 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ✓ access_key_id        ************************
2016-11-15 10:34:38.394092373 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ✓ secret_access_key    ************************************************
2016-11-15 10:34:38.395184691 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ✓ bucket               ******-shield-backups
2016-11-15 10:34:38.403321758 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ✓ prefix               main
2016-11-15 10:34:38.405575117 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ✓ signature_version    4
2016-11-15 10:34:38.406397568 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ✓ socks5_proxy         (no proxy will be used)
2016-11-15 10:34:38.407114419 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ✓ skip_ssl_validation  no, SSL WILL be validated
2016-11-15 10:34:38.408719372 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> OK
2016-11-15 10:34:38.409554673 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> 
2016-11-15 10:34:38.410567931 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> Running purge task
2016-11-15 10:34:38.411436477 +0000 UTC shieldd: INFO:   1ac9ba8d-e5c4-46df-a737-b2ef1360b8e9> ==================
2016-11-15 10:34:38.425916809 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> Validating environment...
2016-11-15 10:34:38.431492936 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> =========================
2016-11-15 10:34:38.438327325 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> SHIELD_OP ... found
2016-11-15 10:34:38.440518665 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> SHIELD_STORE_PLUGIN ... found
2016-11-15 10:34:38.442454106 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> SHIELD_STORE_ENDPOINT ... found
2016-11-15 10:34:38.443315668 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> SHIELD_RESTORE_KEY ... found
2016-11-15 10:34:38.446924786 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> OK
2016-11-15 10:34:38.451229802 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> 
2016-11-15 10:34:38.453961434 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> Validating STORE plugin `s3`...
2016-11-15 10:34:38.455429778 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ===============================
2016-11-15 10:34:38.458977714 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ✓ s3_host              s3.amazonaws.com
2016-11-15 10:34:38.461570502 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ✓ access_key_id        ************************
2016-11-15 10:34:38.464491823 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ✓ secret_access_key    ************************************************
2016-11-15 10:34:38.466374465 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ✓ bucket               ******-shield-backups
2016-11-15 10:34:38.468434224 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ✓ prefix               main
2016-11-15 10:34:38.471512268 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ✓ signature_version    4
2016-11-15 10:34:38.473320878 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ✓ socks5_proxy         (no proxy will be used)
2016-11-15 10:34:38.474537793 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ✓ skip_ssl_validation  no, SSL WILL be validated
2016-11-15 10:34:38.477743228 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> OK
2016-11-15 10:34:38.479771698 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> 
2016-11-15 10:34:38.486385108 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> Running purge task
2016-11-15 10:34:38.490178887 +0000 UTC shieldd: INFO:   1ac9eb3f-1a54-44e6-a783-059dd9b3fb2d> ==================
2016-11-15 10:34:39.009166914 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> 
2016-11-15 10:34:39.011679245 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e> EXITING 0
2016-11-15 10:34:39.01546101 +0000 UTC shieldd: INFO:   1a2e471e-44d2-4b74-9294-cce0b1f1032e: archive 17046b7d-2767-43dd-863a-51a109037866 purged from storage
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x705ae6]

goroutine 1 [running]:
panic(0xb0fd20, 0xc8200100a0)
        /goroot/src/runtime/panic.go:481 +0x3e6
github.com/starkandwayne/shield/db.(*DB).PurgeArchive(0xc82006bd40, 0xc8202ebfa0, 0x10, 0x10, 0x0, 0x0)
        /tmp/build/01f561ac/gopath/src/github.com/starkandwayne/shield/db/archives.go:227 +0x406
github.com/starkandwayne/shield/supervisor.(*Supervisor).Run(0xc8201180c0, 0x0, 0x0)
        /tmp/build/01f561ac/gopath/src/github.com/starkandwayne/shield/supervisor/supervisor.go:362 +0x2828
main.main()
        /tmp/build/01f561ac/gopath/src/github.com/starkandwayne/shield/cmd/shieldd/main.go:67 +0x51e
bgandon commented 8 years ago

I first set all tasks statuses to canceled with an UPDATE tasks SET status='canceled' WHERE status='pending' OR status='running'; but those cancelled tasks were still being displayed as running in the dashboard, so I followed your advice and just deleted them with DELETE FROM tasks where status='canceled';

Anyway, the shield-daemon now starts successfully. There is no more errors in thepostgresql.log that is small now, and startup.log is tiny.

But while running new backups, I'm having issues when the CF blobstore backup (≈1.7 GB) is written to Amazon S3. The backup job fails with a Client.Timeout exceeded while awaiting headers error:

2016-11-15 11:19:53.219658621 +0000 UTC shieldd: INFO:   aaaf10c4-020b-4aad-805c-e6a5f7b1828e> Put https://******-shield-backups.s3-eu-central-1.amazonaws.com/main/2016/11/15/2016-11-15-111658-f064d2e4-b0c9-4197-b341-96eaa769d4b9?partNumber=1&uploadId=MMBxCCG9uQMZ9..fI3tEB5m5wjaprg93ZuBAAr_hleJuC94oxKUbUYpeJTxNbZ.bcd5O3sqBKQWpikKff2AZcJCw2ku0r4H.DgNdqDbjHw4TJETfmg4ep6sidbAcT7Ou: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2016-11-15 11:19:53.308476334 +0000 UTC shieldd: INFO:   aaaf10c4-020b-4aad-805c-e6a5f7b1828e> bsdtar: Write error
2016-11-15 11:19:53.311802175 +0000 UTC shieldd: INFO:   aaaf10c4-020b-4aad-805c-e6a5f7b1828e> Unable to exec '/var/vcap/packages/bsdtar/bin/bsdtar': exit status 1
2016-11-15 11:19:53.313477807 +0000 UTC shieldd: INFO:   aaaf10c4-020b-4aad-805c-e6a5f7b1828e> TASK FAILED!!  shield worker 5 failed to execute the command against the remote agent 10.244.0.130:5444 (Process exited with: 4. Reason was:  ())
2016-11-15 11:19:53.314677122 +0000 UTC shieldd: INFO:   aaaf10c4-020b-4aad-805c-e6a5f7b1828e> 
2016-11-15 11:19:53.315581147 +0000 UTC shieldd: INFO:   aaaf10c4-020b-4aad-805c-e6a5f7b1828e> EXITING 4
2016-11-15 11:19:53.316901706 +0000 UTC shieldd: INFO:   aaaf10c4-020b-4aad-805c-e6a5f7b1828e> TASK FAILED!! No restore key detected in worker 5. Cowardly refusing to create an archive record
2016-11-15 11:19:53.317783941 +0000 UTC shieldd: WARNING:   aaaf10c4-020b-4aad-805c-e6a5f7b1828e: task failed!

I wonder whether this is related to the SHIELD update or not. I'll try to rollback to the previous version, try again, and keep you updated.

bgandon commented 8 years ago

So, rolling back to version 6.4.0, my scheduled backup jobs ran successfully during the night. The purge tasks went awry like before, producing GB of errors in postgresql.log, and no purging actually occurred. Once I deleted those zombie tasks (with the shield daemon stopped) the system stabilized (until a new purge is scheduled I guess).

Deploying back version 6.4.2, purge tasks run well. But scheduling the adhoc fs-based blobstore backup keeps exiting with similar write: connection reset by peer error (but not exactly the same as above, actually).

2016-11-16 07:33:43.353986061 +0000 UTC shieldd: INFO:   6d062a3b-dfe3-4270-88d5-9a79ca7294a9> Running backup task (using bzip2 compression)
2016-11-16 07:33:43.355337071 +0000 UTC shieldd: INFO:   6d062a3b-dfe3-4270-88d5-9a79ca7294a9> =============================================
2016-11-16 07:41:50.110411753 +0000 UTC shieldd: INFO:   6d062a3b-dfe3-4270-88d5-9a79ca7294a9> Put https://******-shield-backups.s3-eu-central-1.amazonaws.com/main/2016/11/16/2016-11-16-073343-72b709ea-22f4-4c8c-8235-782d13d0e726?partNumber=3&uploadId=RJ5CE5_Wams0kmJIzkrur1Aducii16YWx86tnKAUysITAVF3ebNuGh.jM4s8hbNwjSzDGCpGdw92qbUR94y40wljH2vFHC0y0RsSsxh9b43s4jHi3AkcKA6n17SsfCKB: write tcp 10.244.0.130:38135->54.231.194.54:443: write: connection reset by peer
2016-11-16 07:41:50.169451658 +0000 UTC shieldd: INFO:   6d062a3b-dfe3-4270-88d5-9a79ca7294a9> bsdtar: Write error
2016-11-16 07:41:50.173819811 +0000 UTC shieldd: INFO:   6d062a3b-dfe3-4270-88d5-9a79ca7294a9> Unable to exec '/var/vcap/packages/bsdtar/bin/bsdtar': exit status 1
2016-11-16 07:41:50.175665018 +0000 UTC shieldd: INFO:   6d062a3b-dfe3-4270-88d5-9a79ca7294a9> 
2016-11-16 07:41:50.177224713 +0000 UTC shieldd: INFO:   6d062a3b-dfe3-4270-88d5-9a79ca7294a9> TASK FAILED!!  shield worker 3 failed to execute the command against the remote agent 10.244.0.130:5444 (Process exited with: 4. Reason was:  ())
2016-11-16 07:41:50.178619304 +0000 UTC shieldd: INFO:   6d062a3b-dfe3-4270-88d5-9a79ca7294a9> EXITING 4
2016-11-16 07:41:50.179628371 +0000 UTC shieldd: INFO:   6d062a3b-dfe3-4270-88d5-9a79ca7294a9> TASK FAILED!! No restore key detected in worker 3. Cowardly refusing to create an archive record
2016-11-16 07:41:50.181139736 +0000 UTC shieldd: WARNING:   6d062a3b-dfe3-4270-88d5-9a79ca7294a9: task failed!

Back to 6.4.0 again, scheduling the adhoc fs-based blobstore backup works fine and exits with no errors.

jhunt commented 8 years ago

Sounds like there may have been a regression between SHIELD 0.7.2 and SHIELD 0.7.3 insofar as the fs plugin is concerned?

What does this command output (run from the SHIELD agent that will be running the fs plugin)?:

find -L /var/vcap/packages -name bsdtar 2>/dev/null
bgandon commented 7 years ago

Here is the result:

# find -L /var/vcap/packages -name bsdtar -ls 2>/dev/null
9311004    4 drwxr-xr-x   6 root     root         4096 Oct 19 06:11 /var/vcap/packages/bsdtar
9311008 2860 -rwxr-xr-x   1 root     root      2924772 Oct 19 06:11 /var/vcap/packages/bsdtar/bin/bsdtar

# /var/vcap/packages/bsdtar/bin/bsdtar --version
bsdtar 3.1.2 - libarchive 3.1.2

You're right, and like you I was suspecting a issue there. But I had a look at the changes between releases 6.4.0 and 6.4.2 and between SHIELD 07.2 and 0.7.3 and I don't see any reason for a regression on bsdtar. Plus, the bsdtarblob hasn't changed.

This morning my nightly backups have run fine with the new version, and adhoc scheduled backups also run successfully. I must have been fooled by some flapping issues, related to S3 or the network in between, which happened to show up only when I was running the new version.

I'm closing this issue. Thanks a lot for the active support!