sourcegraph / deploy

The standard way to deploy Sourcegraph
Apache License 2.0
4 stars 4 forks source link

AMIs: v5.0.6 -> v5.1.x Migrator fails on upgrade if not booted, introducing schema drift #56

Open DaedalusG opened 1 year ago

DaedalusG commented 1 year ago

Migrator fails on upgrade

Issue reported in v4.4

During an AMI upgrade, if an AMI instance is upgraded via the standard upgrade procedure, some drift may be introduce if the instance is not rebooted as instructed in step 10.

Reproduction

An instance s initialized in v4.4

Screenshot 2023-07-21 at 1 04 41 AM

Upgrade to v4.5.1 - reboot

Screenshot 2023-07-21 at 2 27 38 AM

Observe failing migrator pods after startup and attachment of old volume

migrator-ivmtf-qdbmk                         0/1     Error     0               45m
migrator-ivmtf-dkkmg                         0/1     Error     0               45m
migrator-ivmtf-n7g4d                         0/1     Error     0               44m
migrator-ivmtf-k86r6                         0/1     Error     0               44m
migrator-ivmtf-ccnkt                         0/1     Error     0               43m
otel-collector-754d7c6c4f-rlktn              0/1     Pending   0               6m48s
migrator-ivmtf-gwkbg                         0/1     Error     0               7m13s

reboot the EC2 machine and check for drift

[ec2-user@ip-172-31-50-4 ~]$ kubectl logs job/migrator-o10ui
✱ Sourcegraph migrator 5.1.4
ℹ️ Connection DSNs used: frontend => postgres://sg:password@pgsql:5432/sg
Attempting connection to postgres://sg:password@pgsql:5432/sg...
✅ Connection to "postgres://sg:password@pgsql:5432/sg" succeeded
ℹ️ Locating schema description
✅ Schema found in Local file (/schema-descriptions/v4.5.1-internal_database_schema.json).
✅ No drift detected

Upgrade to v5.0.6

[ec2-user@ip-172-31-57-33 ~]$ kubectl logs job/migrator-vu1bz 
✱ Sourcegraph migrator 5.1.4
ℹ️ Connection DSNs used: frontend => postgres://sg:password@pgsql:5432/sg
Attempting connection to postgres://sg:password@pgsql:5432/sg...
✅ Connection to "postgres://sg:password@pgsql:5432/sg" succeeded
💡 Parsed "v5.0.6" from version flag value "5.0.6"
ℹ️ Locating schema description
ℹ️ Reading schema definition in Local file (/schema-descriptions/v5.0.6-internal_database_schema.json)... Schema not found (open /schema-descriptions/v5.0.6-internal_database_schema.json: no such file or directory). Will attempt a fallback source.
✅ Schema found in GitHub (https://raw.githubusercontent.com/sourcegraph/sourcegraph/v5.0.6/internal/database/schema.json).
✅ No drift detected

Upgrade to v5.0.6

Screenshot 2023-07-21 at 3 31 38 AM

Upgrade to v5.1.4

Version

Screenshot 2023-07-21 at 3 49 18 AM

Drift in UI

Screenshot 2023-07-21 at 3 49 12 AM

Manual drift check

[ec2-user@ip-172-31-55-120 ~]$ k logs job/migrator-kzvjh
Found 2 pods, using pod/migrator-kzvjh-m68ww
✱ Sourcegraph migrator 5.1.4
ℹ️ Connection DSNs used: frontend => postgres://sg:password@pgsql:5432/sg
Attempting connection to postgres://sg:password@pgsql:5432/sg...
✅ Connection to "postgres://sg:password@pgsql:5432/sg" succeeded
{"SeverityText":"FATAL","Timestamp":1689936589304208831,"InstrumentationScope":"migrator","Caller":"migrator/main.go:29","Function":"main.main","Body":"version assertion failed: \"5.0\" != \"v5.1.4\". Re-invoke with --skip-version-check to ignore this check","Resource":{"service.name":"migrator","service.version":"5.1.4","service.instance.id":"8404e9ce-9ce4-47e1-be7f-d6c00e765e04"},"Attributes":{}}

Database version

sg=# SELECT * FROM versions;
 service  | version |          updated_at           | first_version | auto_upgrade 
----------+---------+-------------------------------+---------------+--------------
 frontend | 5.0.6   | 2023-07-21 10:44:57.253021+00 | 4.4.0         | f
(1 row)

Reboot in version 5.1.4

NAME                                          READY   STATUS    RESTARTS      AGE
otel-collector-64d9c9b6d6-zvbqr               0/1     Pending   0             18m
migrator-kzvjh-m68ww                          0/1     Error     0             8m5s
migrator-kzvjh-kc7bf                          0/1     Error     0             8m1s
migrator-kzvjh-mhzsj                          0/1     Error     0             7m47s
migrator-kzvjh-f7w5j                          0/1     Error     0             7m24s
migrator-kzvjh-twmwz                          0/1     Error     0             6m41s
migrator-kzvjh-xtl42                          0/1     Error     0             5m17s
[ec2-user@ip-172-31-55-120 ~]$ k logs migrator-kzvjh-m68ww 
unable to retrieve container logs for containerd://784ba2c7dd1081b77f4f314b4ddfe12922a0f3fc1ad3b488b9da885d2a71ba34[ec2-user@ip-172-31-55-120 ~]$
[ec2-user@ip-172-31-55-120 ~]$ k logs migrator-kzvjh-m68ww 
unable to retrieve container logs for containerd://784ba2c7dd1081b77f4f314b4ddfe12922a0f3fc1ad3b488b9da885d2a71ba34[ec2-user@ip-172-31-55-120 ~]$ k describe pod migrator-kzvjh-m68ww 
Name:             migrator-kzvjh-m68ww
Namespace:        default
Priority:         0
Service Account:  default
Node:             sourcegraph-0/172.31.55.120
Start Time:       Fri, 21 Jul 2023 10:49:48 +0000
Labels:           app.kubernetes.io/instance=sourcegraph-migrator
                  app.kubernetes.io/name=sourcegraph-migrator
                  controller-uid=b7655183-4018-4bbe-a81a-93c71ccb9488
                  deploy=sourcegraph
                  job=migrator
                  job-name=migrator-kzvjh
Annotations:      kubectl.kubernetes.io/default-container: migrator
Status:           Failed
IP:               10.10.0.99
IPs:
  IP:           10.10.0.99
Controlled By:  Job/migrator-kzvjh
Containers:
  migrator:
    Container ID:  containerd://784ba2c7dd1081b77f4f314b4ddfe12922a0f3fc1ad3b488b9da885d2a71ba34
    Image:         index.docker.io/sourcegraph/migrator:5.1.4@sha256:b871f4d32dee8ae757e3a66e5e0b75b0f2d6e04d6c598f1f0540a8e93648715b
    Image ID:      docker.io/sourcegraph/migrator@sha256:b871f4d32dee8ae757e3a66e5e0b75b0f2d6e04d6c598f1f0540a8e93648715b
    Port:          <none>
    Host Port:     <none>
    Args:
      drift
      --db=frontend
      --version=v5.1.4
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 21 Jul 2023 10:49:48 +0000
      Finished:     Fri, 21 Jul 2023 10:49:49 +0000
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  100M
    Requests:
      cpu:     100m
      memory:  50M
    Environment:
      PGDATABASE:               <set to the key 'database' in secret 'pgsql-auth'>            Optional: false
      PGHOST:                   <set to the key 'host' in secret 'pgsql-auth'>                Optional: false
      PGPASSWORD:               <set to the key 'password' in secret 'pgsql-auth'>            Optional: false
      PGPORT:                   <set to the key 'port' in secret 'pgsql-auth'>                Optional: false
      PGUSER:                   <set to the key 'user' in secret 'pgsql-auth'>                Optional: false
      CODEINTEL_PGDATABASE:     <set to the key 'database' in secret 'codeintel-db-auth'>     Optional: false
      CODEINTEL_PGHOST:         <set to the key 'host' in secret 'codeintel-db-auth'>         Optional: false
      CODEINTEL_PGPASSWORD:     <set to the key 'password' in secret 'codeintel-db-auth'>     Optional: false
      CODEINTEL_PGPORT:         <set to the key 'port' in secret 'codeintel-db-auth'>         Optional: false
      CODEINTEL_PGUSER:         <set to the key 'user' in secret 'codeintel-db-auth'>         Optional: false
      CODEINSIGHTS_PGDATABASE:  <set to the key 'database' in secret 'codeinsights-db-auth'>  Optional: false
      CODEINSIGHTS_PGHOST:      <set to the key 'host' in secret 'codeinsights-db-auth'>      Optional: false
      CODEINSIGHTS_PGPASSWORD:  <set to the key 'password' in secret 'codeinsights-db-auth'>  Optional: false
      CODEINSIGHTS_PGPORT:      <set to the key 'port' in secret 'codeinsights-db-auth'>      Optional: false
      CODEINSIGHTS_PGUSER:      <set to the key 'user' in secret 'codeinsights-db-auth'>      Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vwwrt (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-vwwrt:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  10m   default-scheduler  Successfully assigned default/migrator-kzvjh-m68ww to sourcegraph-0
  Normal  Pulled     10m   kubelet            Container image "index.docker.io/sourcegraph/migrator:5.1.4@sha256:b871f4d32dee8ae757e3a66e5e0b75b0f2d6e04d6c598f1f0540a8e93648715b" already present on machine
  Normal  Created    10m   kubelet            Created container migrator
  Normal  Started    10m   kubelet            Started container migrator
You have new mail in /var/spool/mail/ec2-user
[ec2-user@ip-172-31-55-120 ~]$ k describe pod migrator-kzvjh-m68ww 
Name:             migrator-kzvjh-m68ww
Namespace:        default
Priority:         0
Service Account:  default
Node:             sourcegraph-0/172.31.55.120
Start Time:       Fri, 21 Jul 2023 10:49:48 +0000
Labels:           app.kubernetes.io/instance=sourcegraph-migrator
                  app.kubernetes.io/name=sourcegraph-migrator
                  controller-uid=b7655183-4018-4bbe-a81a-93c71ccb9488
                  deploy=sourcegraph
                  job=migrator
                  job-name=migrator-kzvjh
Annotations:      kubectl.kubernetes.io/default-container: migrator
Status:           Failed
IP:               10.10.0.99
IPs:
  IP:           10.10.0.99
Controlled By:  Job/migrator-kzvjh
Containers:
  migrator:
    Container ID:  containerd://784ba2c7dd1081b77f4f314b4ddfe12922a0f3fc1ad3b488b9da885d2a71ba34
    Image:         index.docker.io/sourcegraph/migrator:5.1.4@sha256:b871f4d32dee8ae757e3a66e5e0b75b0f2d6e04d6c598f1f0540a8e93648715b
    Image ID:      docker.io/sourcegraph/migrator@sha256:b871f4d32dee8ae757e3a66e5e0b75b0f2d6e04d6c598f1f0540a8e93648715b
    Port:          <none>
    Host Port:     <none>
    Args:
      drift
      --db=frontend
      --version=v5.1.4
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 21 Jul 2023 10:49:48 +0000
      Finished:     Fri, 21 Jul 2023 10:49:49 +0000
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  100M
    Requests:
      cpu:     100m
      memory:  50M
    Environment:
      PGDATABASE:               <set to the key 'database' in secret 'pgsql-auth'>            Optional: false
      PGHOST:                   <set to the key 'host' in secret 'pgsql-auth'>                Optional: false
      PGPASSWORD:               <set to the key 'password' in secret 'pgsql-auth'>            Optional: false
      PGPORT:                   <set to the key 'port' in secret 'pgsql-auth'>                Optional: false
      PGUSER:                   <set to the key 'user' in secret 'pgsql-auth'>                Optional: false
      CODEINTEL_PGDATABASE:     <set to the key 'database' in secret 'codeintel-db-auth'>     Optional: false
      CODEINTEL_PGHOST:         <set to the key 'host' in secret 'codeintel-db-auth'>         Optional: false
      CODEINTEL_PGPASSWORD:     <set to the key 'password' in secret 'codeintel-db-auth'>     Optional: false
      CODEINTEL_PGPORT:         <set to the key 'port' in secret 'codeintel-db-auth'>         Optional: false
      CODEINTEL_PGUSER:         <set to the key 'user' in secret 'codeintel-db-auth'>         Optional: false
      CODEINSIGHTS_PGDATABASE:  <set to the key 'database' in secret 'codeinsights-db-auth'>  Optional: false
      CODEINSIGHTS_PGHOST:      <set to the key 'host' in secret 'codeinsights-db-auth'>      Optional: false
      CODEINSIGHTS_PGPASSWORD:  <set to the key 'password' in secret 'codeinsights-db-auth'>  Optional: false
      CODEINSIGHTS_PGPORT:      <set to the key 'port' in secret 'codeinsights-db-auth'>      Optional: false
      CODEINSIGHTS_PGUSER:      <set to the key 'user' in secret 'codeinsights-db-auth'>      Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vwwrt (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-vwwrt:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  10m   default-scheduler  Successfully assigned default/migrator-kzvjh-m68ww to sourcegraph-0
  Normal  Pulled     10m   kubelet            Container image "index.docker.io/sourcegraph/migrator:5.1.4@sha256:b871f4d32dee8ae757e3a66e5e0b75b0f2d6e04d6c598f1f0540a8e93648715b" already present on machine
  Normal  Created    10m   kubelet            Created container migrator
  Normal  Started    10m   kubelet            Started container migrator

Checking drift against the v5.0.6 version

helm upgrade --install --set "migrator.args={drift,--db=frontend,--version=v5.0.6,--skip-version-check}" sourcegraph-migrator sourcegraph/sourcegraph-migrator --version 5.1.4

Actual logs omitted, but this drift output is the same as is registered in the Update page

Summary

On upgrade from v5.0.6 to v5.1.x the migrator isn't correctly initializing and setting the db state to the correct version. It is however likely running schema migrations. Either the migrations are being applied correctly and the schema drift in the updates page is the result of a bad versions table entry. Or the schema migrations aren't being run by the up command.

Given these conditions once the direction issue of the up operations failure is identified this can likely be solved manually by correct use of the upgrade command. A hypothesis as to the root cause of this issue is the tagging of a 5.0.6 image set in sourcegraph/deploy repo. Migrators image definitions in sourcegraph/sourcegraph repo may not handle correctly for the extra/missing version.

DaedalusG commented 1 year ago

A manual run of the drift check ignoring the version tag check shows no drift -- this will be verified but indicates that the drift isn't real and the UI is indicating the wrong drift to to an inaccurate entry in the versions table

[ec2-user@ip-172-31-55-120 ~]$ helm upgrade --install --set "migrator.args={drift,--db=frontend,--version=v5.1.4,--skip-version-check}" sourcegraph-migrator sourcegraph/sourcegraph-migrator --version 5.1.4
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/rancher/k3s/k3s.yaml
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /etc/rancher/k3s/k3s.yaml
Release "sourcegraph-migrator" has been upgraded. Happy Helming!
NAME: sourcegraph-migrator
LAST DEPLOYED: Fri Jul 21 11:14:49 2023
NAMESPACE: default
STATUS: deployed
REVISION: 9
TEST SUITE: None
[ec2-user@ip-172-31-55-120 ~]$ k get jobs
NAME             COMPLETIONS   DURATION   AGE
migrator-nxavh   1/1           4s         6s
[ec2-user@ip-172-31-55-120 ~]$ k logs job/migrator-nxavh
✱ Sourcegraph migrator 5.1.4
ℹ️ Connection DSNs used: frontend => postgres://sg:password@pgsql:5432/sg
Attempting connection to postgres://sg:password@pgsql:5432/sg...
✅ Connection to "postgres://sg:password@pgsql:5432/sg" succeeded
ℹ️ Locating schema description
ℹ️ Reading schema definition in Local file (/schema-descriptions/v5.1.4-internal_database_schema.json)... Schema not found (open /schema-descriptions/v5.1.4-internal_database_schema.json: no such file or directory). Will attempt a fallback source.
✅ Schema found in GitHub (https://raw.githubusercontent.com/sourcegraph/sourcegraph/v5.1.4/internal/database/schema.json).
✅ No drift detected

Inferring that migrations were run correctly but the versions table wasn't upgraded correctly.

DaedalusG commented 1 year ago

During testing of this issue it appears the failing migrator init jobs completed correcting the instance version

[ec2-user@ip-172-31-55-120 ~]$ k get pods
NAME                                          READY   STATUS      RESTARTS       AGE
otel-collector-64d9c9b6d6-zvbqr               0/1     Pending     0              39m
codeinsights-db-0                             2/2     Running     4 (23m ago)    39m
embeddings-5fd8c4f865-jnbgm                   1/1     Running     2 (23m ago)    39m
github-proxy-6f66d84fcf-78xlf                 1/1     Running     2 (23m ago)    39m
grafana-0                                     1/1     Running     1 (23m ago)    38m
cadvisor-skvmq                                1/1     Running     2 (23m ago)    39m
gitserver-0                                   1/1     Running     10 (21m ago)   39m
symbols-5d8fb9f887-nftjx                      1/1     Running     4 (23m ago)    181d
repo-updater-8567589bcc-pxk4b                 1/1     Running     4 (23m ago)    39m
blobstore-5fbfd6dcf7-8mwgn                    1/1     Running     2 (23m ago)    39m
worker-66c45bbd77-vcbg5                       1/1     Running     2 (23m ago)    39m
sourcegraph-frontend-6f5f7f796c-dr4m7         1/1     Running     1 (23m ago)    39m
prometheus-9944457d7-q6xkl                    1/1     Running     2 (23m ago)    39m
codeintel-db-0                                2/2     Running     4 (23m ago)    38m
syntect-server-5cbd47df6b-hrd6w               1/1     Running     2 (23m ago)    39m
searcher-0                                    1/1     Running     2 (23m ago)    38m
indexed-search-1                              2/2     Running     4 (23m ago)    38m
executor-batches-codeintel-6d4998cdfb-58jq6   1/1     Running     2 (23m ago)    39m
redis-store-6dfd9dd9f9-52ccm                  2/2     Running     4 (23m ago)    38m
pgsql-0                                       2/2     Running     4 (23m ago)    38m
otel-collector-6cc9b7dd4-z9h8l                1/1     Running     2 (23m ago)    62m
searcher-785f9f5ddb-dtmt7                     1/1     Running     2 (23m ago)    181d
sourcegraph-frontend-6f5f7f796c-gr7sj         1/1     Running     1 (23m ago)    39m
precise-code-intel-worker-9b49484c6-djsbf     1/1     Running     2 (23m ago)    39m
symbols-0                                     1/1     Running     2 (23m ago)    37m
node-exporter-whgvk                           1/1     Running     2 (23m ago)    39m
otel-agent-bnnj2                              1/1     Running     2 (23m ago)    39m
indexed-search-0                              2/2     Running     2 (23m ago)    37m
redis-cache-76d699955b-m4dtg                  2/2     Running     4 (23m ago)    38m
migrator-nxavh-ph877                          0/1     Completed   0              4m44s

drift seems to have resolved the versions table is correctly set

So the root cause here seems to be an issue in the migrators init and run of the up command. Admins encountering this issue are advised to try rebooting their EC2 machine. Further reproduction of this issue and resolution will be explored. Running migrator up manually may resolve this issue.

Admins experiencing this issue are advised to check for "orphaned"/errored migrator pods