jfrog / charts

JFrog official Helm Charts
https://jfrog.com/integration/helm-repository/
Apache License 2.0
259 stars 448 forks source link

Artifactory single node won't start, claims join key is missing. #1917

Open artm opened 2 months ago

artm commented 2 months ago

Is this a request for help?: yes

Is this a BUG REPORT or ~FEATURE REQUEST~? (choose one): BUG REPORT

Version of Helm and Kubernetes: helm 3.12.0, RKE2 kubernetes 1.28.10+rke2r1

Which chart: artifactory, 107.90.9

Which product license (Enterprise/Pro/oss): Pro

JFrog support reference (if already raised with support team): -

What happened: While deploying artifactory to a new cluster, we are unable to get it running with the helm chart configuration as close to the running version as possible.

This is the initial installation of artifactory on these cluster, an older installation is running on a different cluster with older kubernetes and slightly different configuration.

We tried to install the same version of the chart as on the old cluster (107.59.18) and several version between that and the latest one.

One difference between the two installation is that we use builtin postgresql on the old cluster and external postgresql cluster on the new one.

What you expected to happen: artifactory starts.

How to reproduce it (as minimally and precisely as possible):

helm override file:

artifactory:
  unifiedSecretInstallation: false
  admin:
    secret: artifactory-config
    dataKey: admin-password
  license: 
    secret: artifactory-config
    dataKey: license-key
  masterKeySecretName: artifactory-config
  joinKeySecretName: artifactory-config

the rest of the configuration is probably irrelevant for the problem, but here it is for completeness:

  resources:
    requests:
      memory: 4Gi
      cpu: "2"
    limits:
      memory: 6Gi
      cpu: "4"
  javaOpts:
    xms: 4g
    xmx: 4g
  persistence:
    size: 30Gi
  customVolumes: |
    - name: artifactory-backup
      persistentVolumeClaim:
        claimName: artifactory-backup
    - name: old-artifactory-backup
      nfs:
        server: nfs.xxx.xxx
        path: /xxx/xxx/xxx
  customVolumeMounts: |
    - name: artifactory-backup
      mountPath: /opt/backup-data
    - name: old-artifactory-backup
      mountPath: /opt/old-backup-data    
postgresql:
  enabled: false
database:
  type: postgresql
  driver: org.postgresql.Driver
  secrets:
    user:  
      name: artifactory-config
      key: db-username
    password:
      name:  artifactory-config
      key: db-password
    url:
      name:  artifactory-config
      key: db-url
ingress:
  enabled: true
  defaultBackend:
    enabled: false
  annotations:
    ingress.kubernetes.io/force-ssl-redirect: "true"
    ingress.kubernetes.io/proxy-body-size: "0"
    ingress.kubernetes.io/proxy-read-timeout: "600"
    ingress.kubernetes.io/proxy-send-timeout: "600"
nginx:
  enabled: false

with a kubernetes secret artifactory-config with the fields admin-password, license-key, master-key and join-key. I'm not sure about unifiedSecretInstallation, but we tried both true and false.

Anything else we need to know:

$ kubectl logs statefulset/artifactory -c artifactory | grep join
2024-08-30T12:00:30.235Z [shell] [INFO ] [] [artifactoryCommon.sh:93       ] [main] - Bootstrap joinKey found in [/opt/jfrog/artifactory/var/bootstrap/access/etc/security/join.key:]. Deleting original
2024-08-30T12:01:02.178Z [jfrt ] [INFO ] [ead12e87bca3bcb5] [o.j.s.c.KeyUtils:108          ] [art-init            ] - Cluster join: Join key is missing. Pending for 5 seconds with 60 seconds timeout
2024-08-30T12:01:07.180Z [jfrt ] [INFO ] [ead12e87bca3bcb5] [o.j.s.c.KeyUtils:108          ] [art-init            ] - Cluster join: Join key is missing. Pending for 5 seconds with 60 seconds timeout
2024-08-30T12:01:12.183Z [jfrt ] [INFO ] [ead12e87bca3bcb5] [o.j.s.c.KeyUtils:108          ] [art-init            ] - Cluster join: Join key is missing. Pending for 5 seconds with 60 seconds timeout
2024-08-30T12:01:17.188Z [jfrt ] [INFO ] [ead12e87bca3bcb5] [o.j.s.c.KeyUtils:108          ] [art-init            ] - Cluster join: Join key is missing. Pending for 5 seconds with 60 seconds timeout
2024-08-30T12:01:22.190Z [jfrt ] [INFO ] [ead12e87bca3bcb5] [o.j.s.c.KeyUtils:108          ] [art-init            ] - Cluster join: Join key is missing. Pending for 5 seconds with 60 seconds timeout
2024-08-30T12:01:27.194Z [jfrt ] [INFO ] [ead12e87bca3bcb5] [o.j.s.c.KeyUtils:108          ] [art-init            ] - Cluster join: Join key is missing. Pending for 5 seconds with 60 seconds timeout
2024-08-30T12:01:32.196Z [jfrt ] [INFO ] [ead12e87bca3bcb5] [o.j.s.c.KeyUtils:108          ] [art-init            ] - Cluster join: Join key is missing. Pending for 5 seconds with 60 seconds timeout
2024-08-30T12:01:37.199Z [jfrt ] [INFO ] [ead12e87bca3bcb5] [o.j.s.c.KeyUtils:108          ] [art-init            ] - Cluster join: Join key is missing. Pending for 5 seconds with 60 seconds timeout
2024-08-30T12:01:42.201Z [jfrt ] [INFO ] [ead12e87bca3bcb5] [o.j.s.c.KeyUtils:108          ] [art-init            ] - Cluster join: Join key is missing. Pending for 5 seconds with 60 seconds timeout
2024-08-30T12:01:47.203Z [jfrt ] [INFO ] [ead12e87bca3bcb5] [o.j.s.c.KeyUtils:108          ] [art-init            ] - Cluster join: Join key is missing. Pending for 5 seconds with 60 seconds timeout
2024-08-30T12:01:52.206Z [jfrt ] [INFO ] [ead12e87bca3bcb5] [o.j.s.c.KeyUtils:108          ] [art-init            ] - Cluster join: Join key is missing. Pending for 5 seconds with 60 seconds timeout
2024-08-30T12:01:57.292Z [jfrt ] [ERROR] [ead12e87bca3bcb5] [ctoryContextConfigListener:131] [art-init            ] - Application could not be initialized: Cluster join: Failed resolving join key; Missing join key

the last error complete:

2024-08-30T12:01:57.292Z [jfrt ] [ERROR] [ead12e87bca3bcb5] [ctoryContextConfigListener:131] [art-init            ] - Application could not be initialized: Cluster join: Failed resolving join key; Missing join key
java.lang.reflect.InvocationTargetException: null
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
        at org.artifactory.lifecycle.webapp.servlet.ArtifactoryContextConfigListener.configure(ArtifactoryContextConfigListener.java:272)
        at org.artifactory.lifecycle.webapp.servlet.ArtifactoryContextConfigListener$1.run(ArtifactoryContextConfigListener.java:127)
Caused by: org.springframework.beans.factory.BeanInitializationException: Failed to initialize bean 'AccessServiceImpl'.; nested exception is java.lang.IllegalStateException: Cluster join: Failed resolving join key; Missing join key
        at org.artifactory.spring.ArtifactoryApplicationContext.initReloadableBeans(ArtifactoryApplicationContext.java:359)
        at org.artifactory.spring.ArtifactoryApplicationContext.refresh(ArtifactoryApplicationContext.java:309)
        at org.artifactory.spring.ArtifactoryApplicationContext.<init>(ArtifactoryApplicationContext.java:183)
        ... 7 common frames omitted
Caused by: java.lang.IllegalStateException: Cluster join: Failed resolving join key; Missing join key
        at org.jfrog.security.common.KeyUtils.keyResolutionFailure(KeyUtils.java:152)
        at org.jfrog.security.common.KeyUtils.waitForKey(KeyUtils.java:81)
        at org.jfrog.access.key.join.JoinKeyBootstrapper.initJoinKey(JoinKeyBootstrapper.java:35)
        at org.jfrog.access.key.join.JoinKeyBootstrapper.getJoinKey(JoinKeyBootstrapper.java:29)
        at org.jfrog.access.client.AccessClientBootstrap.<init>(AccessClientBootstrap.java:90)
        at org.jfrog.access.client.AccessClientBootstrap.<init>(AccessClientBootstrap.java:132)
        at org.artifactory.security.access.AccessServiceImpl.bootstrapAccessClient(AccessServiceImpl.java:1549)
        at org.artifactory.security.access.AccessServiceImpl.lambda$bootstrapAccessClient$26(AccessServiceImpl.java:1508)
        at io.vavr.control.Try.mapTry(Try.java:634)
        at io.vavr.control.Try.map(Try.java:585)
        at org.artifactory.security.access.AccessServiceImpl.bootstrapAccessClient(AccessServiceImpl.java:1508)
        at org.artifactory.security.access.AccessServiceImpl.initAccessService(AccessServiceImpl.java:577)
        at org.artifactory.security.access.AccessServiceImpl.initAccessClientIfNeeded(AccessServiceImpl.java:565)
        at org.artifactory.security.access.AccessServiceImpl.init(AccessServiceImpl.java:548)
        at org.artifactory.spring.ArtifactoryApplicationContext.initReloadableBeans(ArtifactoryApplicationContext.java:357)
        ... 9 common frames omitted

let us know if you need more information.

Momotoculteur commented 2 months ago

Hello, same here.....

Edit :

Before on a working version

ls /var/opt/jfrog/artifactory/etc/security
main.key   join.key

On latest version :

ls /var/opt/jfrog/artifactory/etc/security
main.key

We have thats logs in copy-system-configuration init container

[......other logs useless]
Copy joinKey to /var/opt/jfrog/artifactory/bootstrap/access/etc/security
Copy masterKey to /var/opt/jfrog/artifactory/etc/security

So before we have both of those 2 key together in the same folder, but that change here

reespozzi commented 2 months ago

@Momotoculteur which version did you have this working on before? We're facing the same issue

Momotoculteur commented 2 months ago

@Momotoculteur which version did you have this working on before? We're facing the same issue

@reespozzi for the moment we stay at 107.84.21 (helm chart)

shettypriy commented 2 months ago

@Momotoculteur I installed chart version 107.84.21 (helm chart) and I am still getting the error Join key is missing. Could you please share the helm values file. I see the join key in the location var/bootstrap/access/etc/security/join.key but unable to find it here /opt/jfrog/router/var/etc/security

artm commented 2 months ago

@reespozzi for the moment we stay at 107.84.21 (helm chart)

Have you gotten to this version by gradually upgrading an existing installation or have you installed in from scratch at some point?

It seems the problem lies with the bootstrap code which is only executed on a new installation, but is skipped when an existing installation is detected. The next thing we want to try is to copy the whole data volume and the database to the new installation prior to starting artifactory to avoid executing the bootstrap. I also want to read the bootstrap code to try to understand what might be going wrong.

artm commented 2 months ago

I wasn't able to find the bootstrap code in jfrog's github projects, I was looking for the error messages we are getting. We will now attempt to copy the old data volume / database before starting the artifactory.

ak-mustafa commented 2 months ago

I am facing the same issue. Although I copied join key to correct path via connecting the pod, it doesn't work. Do we really have to use master and join keys if we don't use HA?

reespozzi commented 2 months ago

We did get this fully working in the end We're using PVCs in aks, enabled via persistence in our values file. This option also seems the best in terms of stuff that jfrog bootstrap is doing in the background. We originally tried with customPersistentVolumeClaims and some file moving etc but no luck. If you're doing this and using storage other than postgres, like azure managed disks, you'll also need to set allowNonPostgresql to true in your system.yaml file. You can do this via helm

Most importantly, I don't think the join key or master key matter for the non-HA installation, but yes we had to supply them. We created secrets in AKS for both with identical values for join-key and master-key as the data key. We then referenced them via joinKeySecretName and masterKeySecretName in our chart. Once we did this and reinstalled everything from scratch, things started to work.

We also saw this join key error when setting an admin password as a sealed secret but not having the format correct, the format is weird and expects this bootstrap.creds='admin@127.0.0.1=Password' -- this article helped us move past it

Please also note, if you see an error about cluster join key is missing -- this is sometimes a red herring if bootstrapping failed. If you look at the artifactory container logs, you'll see the real error that's breaking things. For instance, we sometimes had bad config in our configmap, and randomly started seeing the [security_keys.go:185 ] [main ] [] - Cluster join: Join key is missing. Pending for 15 seconds with 5m0s timeout error again. This even shows up time to time when the pods are healthy, then they eventually start.

Another note, because we deploy via flux and we have PVCs, we were often deleting these and letting them get rebuilt from scratch along with the artifactory deployment. I'm not sure exactly what's persisted to data storage in terms of keys, but if you're not bothered about your data in there, could be worth a fresh deployment once you define the keys yourself. I say this because at times when we tried to place a join.key file in the filesystem, our pods would crash out saying the join key did not match. It's also not ideal to have to manually create files each time you rebuild, so I think the secret definitions are the best way to go, as looking through the chart templates, bootstrapping seems to handle all the moving around for you if they are defined as secrets. And make sure they are built with the correct structure:

apiVersion: v1
data:
  join-key: some-base64-val
kind: Secret
metadata:
....

We are doing a fresh install on 107.84.21 , haven't tried any other versions yet

ak-mustafa commented 2 months ago

@reespozzi We did same deployment from scratch and it worked well. But for some reason if the pod restarted then it doesn't work again. Did you try to restart the pod anyway?

reespozzi commented 2 months ago

We've restarted the pod, deleted the Helm Release etc and it all spins back up for us yes, we don't touch anything manually during the install, perhaps the error will show in artifactory container logs

artm commented 2 months ago

I'm not sure I can see what exactly in @reespozzi 's case solves the problem

artm commented 2 months ago

found your configmap, it contains artifactory.config.import.xml, where does its contents come from?

reespozzi commented 2 months ago

I'm not sure I can see what exactly in @reespozzi 's case solves the problem

  • using PVCs: same as us, we just omit the storage class, since we only have one and it's the default
  • if you're doing this, you'll also need to set allowNonPostgresql to true in your system.yaml file - if we're doing what exactly, using PVCs? We are using external postgresql, it feels wrong to set this to true?
  • what is it that supplying a config map fixes? what should be in the config map?
  • we also supply master / join key via secrets like you do

Of course if you're using postgres you won't set this to true - Maybe to clarify sorry, our pvc is for an azure managed disk, this is why we need the system.yaml change. Can you see errors in the artifactory container? This is where we saw it complaining about initialising our storage, maybe there's a postgres error in there for you

config map is just general stuff for our config about repositories or security, probably could've been left out the conversation, not directly related to the secrets

artm commented 2 months ago

ok, we'll try to read through the logs again. For now we were focused on the messages about the join key, but if I understand you well it could be a false symptom of something else going wrong during the bootstrap. Thank you for sharing your insights!

artm commented 2 months ago

aha, you were correct @reespozzi , I have initially missed this warning:

.evidence key is misplaced or doesnt apply at this location
.access.runOnArtifactoryTomcat key is misplaced or doesnt apply at this location
.federation key is misplaced or doesnt apply at this location
yaml validation failed
2024-09-11T09:20:46.070Z [shell] [WARN ] [] [installerCommon.sh:819        ] [main] - System.yaml validation failed

Database connection check failed Could not determine database type

so the error is in the database configuration.

reespozzi commented 2 months ago

@artm ah nice, it's weird that seemingly all bootstrap / setup errors result in artifactory logs repeatedly spitting out join-key errors, and it's really misleading

artm commented 2 months ago

similar problem: https://github.com/jfrog/charts/issues/1823

we also first see succesful database configuration:

2024-09-11T09:20:45.371Z [shell] [INFO ] [] [systemYamlHelper.sh:621       ] [main] - Resolved .shared.database.type (postgresql) from /opt/jfrog/artifactory/var/etc/system.yaml
2024-09-11T09:20:45.514Z [shell] [INFO ] [] [systemYamlHelper.sh:621       ] [main] - Resolved JF_SHARED_DATABASE_URL (jdbc:postgresql://dbhost:5432/artifactory) from environment variable
2024-09-11T09:20:45.602Z [shell] [INFO ] [] [systemYamlHelper.sh:621       ] [main] - Resolved JF_SHARED_DATABASE_PASSWORD (__sensitive_key_hidden___) from environment variable
2024-09-11T09:20:45.684Z [shell] [INFO ] [] [systemYamlHelper.sh:621       ] [main] - Resolved .artifactory.database.maxOpenConnections (80) from /opt/jfrog/artifactory/var/etc/system.yaml
2024-09-11T09:20:45.800Z [shell] [INFO ] [] [systemYamlHelper.sh:621       ] [main] - Resolved .access.database.maxOpenConnections (80) from /opt/jfrog/artifactory/var/etc/system.yaml

but later on:

2024-09-11T09:20:46.070Z [shell] [WARN ] [] [installerCommon.sh:819        ] [main] - System.yaml validation failed

...

ak-mustafa commented 2 months ago

after installation from scratch and couple of restarts.

###   All services started successfully in 81.823 seconds   ###
############################################################### 

I have only updated as below in the value file:

      storageClassName: "managed-premium"
      size: 50Gi

previously

persistence:
      size: 50Gi
reespozzi commented 2 months ago

FYI: this also works for us on 107.90.9 - I just ran an upgrade and no issues. So don't think there is a particular versioning problem here

artm commented 2 months ago

I still don't understand what exactly goes wrong. We clean up the database between installation attempts and eventually it gets filled with 90 tables, so the setup process is able to connect to and use the database. Sounds like the Database connection check failed Could not determine database type is another red herring.

artm commented 2 months ago

We have tried to start the new instance with the copy of the data volume from the existing installation, but we still get Cluster join: Join key is missing. Pending for 5 seconds with 60 seconds timeout and no ready artifactory.

Right before that error we see:

{"log_name":"tomcat-catalina.log","app":{"datetime":"2024-09-12T12:23:00.685Z","service":"tomcat","loglevel":"SEVERE","class":"org.glassfish.jersey.server.spring.SpringComponentProvider","message":"None or multiple beans found in Spring context for type class org.artifactory.rest.resource.federati
on.FederatedStatusResource, skipping the type."}}

could this be the actual error that causes bootstrap to go wrong?

gitta-jfrog commented 2 months ago

Hi @artm

I reviewed this thread and I have some ideas what might happen.

When Artifactory Service (jfrt) is printing the following log line: "[jfrt ] [INFO ] [ead12e87bca3bcb5] [o.j.s.c.KeyUtils:108 ] [art-init ] - Cluster join: Join key is missing. Pending for 5 seconds with 60 seconds timeout that's can indicate jfrt is waiting to jfac (Access Service) to be up and running. (Yes, this message can be misleading, we are working to improve it)

Following Artifactory 7.90 we separated Access to a different container (see Individual JVM for Access Service

In some clusters, you might need to add Resource Request/Limit per container, in order to provide a container more resources then the default (which might not be enough) - Can you confirm you have some default resources per container in your cluster? You can also investigate access-service.log to see if the Access service behave as expected (up and running) while jfrt is trying to reach out and resolve the Join Key.

Thanks!

Jacobsjohagra commented 1 month ago

I found a solution that worked for me and hopefully, it can help you as well. Here's what I did: For me the issue was in database. I'm running psql. First reindex Database:

REINDEX DATABASE <your_database_name>;

Delete the access_db_check data: DELETE FROM access_db_check;

Reinstall Artifactory: helm install --namespace <artifactory> --create-namespace jfrog jfrog/jfrog-platform -f <values.yaml>

Momotoculteur commented 1 month ago

u might need to add Resource Request/Limit per container, in order to provide a container more resources then the default (which might not be enough) - Can you confirm you have some default resources per container in your cluster? You can also investigate access-service.log to see if the Access service behave as expected (up and running) while jfrt is trying to reach out and resolve the Join Key.

I have access.enabled with false value in my charts. Does i need to activate this service ? What does this service ?

Momotoculteur commented 1 month ago

FIxed this issue by activate following jfrog uService : jfConnect, metadata, access (the most important)