shieldproject / shield

A standalone system that can perform backup and restore functions for a wide variety of pluggable data systems
MIT License
363 stars 69 forks source link

[BUG] SHIELD Agent 8.7.2 incompatible with BOSH 263.4.0 #696

Closed sogajeffrey closed 3 years ago

sogajeffrey commented 4 years ago

Describe the bug Although the agent works fine backing up bosh itself, it renders bosh completely useless in terms of deploying vms as the SHIELD agent possibly (unconfirmed yet) conflicts with a BOSH agent process so the deployment never gets past deploying the VM meaning the director is never able to connect to the vm and execute the rest of the deployment.

To Reproduce Steps to reproduce the behavior:

  1. Add SHIELD agent 8.7.2 to BOSH director with version 263.4.0
  2. Verify that SHIELD is able to back up bosh successfully
  3. Attempt to re-create vms of an existing deployment or deploy a completely new deployment and see that its able to deploy the actual VM but unable to configure it.
  4. As per recommendation upgraded BOSH director to 263.5 and re-tested, same error.

Expected behavior BOSH will fail with:

Error: Timed out pinging to [GUID] after 600 seconds

SHIELD versions (please complete the following information):

jhunt commented 4 years ago

Is this a specific regression related to BOSH v263.4.0?

cweibel commented 4 years ago

Yes. On newer bosh deployments the same agent code block works.

jhunt commented 4 years ago

Does 263.5.0 work? It seems that 263.4.0 https://github.com/cloudfoundry/bosh/releases/tag/v263.4.0 added vars interpolation to add-ons, and 263.5.0 https://github.com/cloudfoundry/bosh/releases/tag/v263.5.0 fixed a few edge cases (one of which may be what we're running into here)

sogajeffrey commented 4 years ago

Ill be testing this out in our sandbox env.

sogajeffrey commented 4 years ago

Upgraded bosh to 263.5 and tested a redeploy. still got the below error (Testing using shield deployment)

bosh2 -d uswest2-sb-shield8 recreate
Using environment (openid, bosh.admin)

Using deployment 'uswest2-sb-shield8'

Continue? [yN]: y

Task 2112322

21:33:23 | Deprecation: Ignoring cloud config. Manifest contains 'networks' section.
21:33:23 | Preparing deployment: Preparing deployment (00:00:03)
21:33:28 | Preparing package compilation: Finding packages to compile (00:00:00)
21:33:28 | Updating instance shield: shield/d56b6b9d-8ea4-49d9-b5cd-bbd5e6fb5406 (0) (canary)
 (00:12:14)
            L Error: Timed out pinging to 8bf859d5-da3b-46a9-a746-780cc77c713b after 600 seconds

21:45:42 | Error: Timed out pinging to 8bf859d5-da3b-46a9-a746-780cc77c713b after 600 seconds

Started  Wed Jul 22 21:33:23 UTC 2020
Finished Wed Jul 22 21:45:42 UTC 2020
Duration 00:12:19

Task 2112322 error
norman-abramovitz commented 4 years ago

Hi James, Is there more information you need from Jeff to help resolve this issue without getting onto Jeff's environment directly?

jhunt commented 4 years ago

I'm curious what happens to that deployment if all of the SHIELD BOSH release bits are removed from it, and it just provisions VMs. Does the bosh_agent respond to pings if the SHIELD software doesn't get loaded? Or is this a problem between the stemcell and BOSH itself?

If that doesn't shed any light on the situation, I'm going to need a minimum viable deployment to reproduce this in my lab, or on a replica of the same VPC/IaaS configuration.

sogajeffrey commented 4 years ago

@jhunt I ended up removing all shield related release/property bits from the bosh director manifest in my prod envs and all works fine after that.

Do you mean leaving in the shield property configs but removing the release bits?

jhunt commented 4 years ago

That's exactly what I needed to know to assist in differential diagnostic.

Which SHIELD BOSH release jobs are you trying to put on this particular deployment?

sogajeffrey commented 4 years ago

@jhunt

Heres the info:

jhunt commented 4 years ago

So this is just the shield-agent job?

sogajeffrey commented 4 years ago

Correct just shield agent and whatever shield agent needs to run on BOSH director. @jhunt

jhunt commented 4 years ago

What IaaS are you spinning this on, and what stemcell version?

sogajeffrey commented 4 years ago

AWS @jhunt stemcell: sha1: ab6cc40471502ac46d296b446def0138d1a01742 url: https://bosh.io/d/stemcells/bosh-aws-xen-hvm-ubuntu-trusty-go_agent?v=3312.20

jhunt commented 4 years ago

Trusty stemcells have been EOL'd since April of 2019.

sogajeffrey commented 4 years ago

Yea we know. This is an old cluster. We're moving off in the coming year.