neo4j / helm-charts

Apache License 2.0
59 stars 53 forks source link

[Bug]: Can't connect to DB when deploying in Kubernetes #156

Closed MaxTeiger closed 1 year ago

MaxTeiger commented 1 year ago

Contact Details

max.teiger@gmail.com

What happened?

I followed the documentation to deploy Neo4J using the Helm Chart + Kubernetes.

Everything seems to work fine until the last step. I connect to http://localhost:7475/browser and try to connect Neo4J to the database using the password I set up in my values.yaml file, but it takes a long time and I get end up with an error.

ServiceUnavailable: WebSocket connection failure. Due to security constraints in your web browser, the reason for the failure is not available to this Neo4j Driver. Please use your browsers development console to determine the root cause of the failure. Common reasons include the database being unavailable, using the wrong connection URL or temporary network problems. If you have enabled encryption, ensure your browser is configured to trust the certificate Neo4j is configured to use. WebSocket `readyState` is: 3

I get this logs in my console where I run the port-forward after the previous fail in the UI :

E0327 18:52:37.288063   39231 portforward.go:406] an error occurred forwarding 7687 -> 7687: error forwarding port 7687 to pod 58729ca7e766e6effb669ccba80ea819750d9ae8b074bff661b8dd68a12c1ca4, uid : failed to execute portforward in network namespace "/var/run/netns/cni-3e5161ef-b0cf-9dd2-cf84-433ca67fd05c": read tcp4 127.0.0.1:45936->127.0.0.1:7687: read: connection reset by peer
E0327 18:52:37.289441   39231 portforward.go:234] lost connection to pod

Meanwhile, the logs of the pod seem Ok except they don't output any error :

Changed password for user 'neo4j'. IMPORTANT: this change will only take effect if performed before the database is started for the first time.
2023-03-27 14:52:02.823+0000 INFO Command expansion is explicitly enabled for configuration
2023-03-27 14:52:02.828+0000 WARN Unrecognized setting. No declared setting with name: server.panic.shutdown_on_panic.
2023-03-27 14:52:02.924+0000 INFO Starting...
2023-03-27 14:52:05.736+0000 INFO This instance is ServerId{b6eb9040} (b6eb9040-44ba-46eb-85f8-1cb4c7e0f760)
2023-03-27 14:52:09.930+0000 INFO ======== Neo4j 5.5.0 ========
2023-03-27 14:52:17.868+0000 INFO Bolt enabled on 0.0.0.0:7687.
2023-03-27 14:52:22.328+0000 INFO Remote interface available at http://localhost:7474/
2023-03-27 14:52:22.332+0000 INFO id: F4D2410117C4D8F1DDCF1FB045D2821BB7E26D8961F763D6BDA3701B75922BDF
2
2023-03-27 14:52:22.333+0000 INFO name: system
2023-03-27 14:52:22.333+0000 INFO creationDate: 2023-03-27T14:52:12.331Z
2023-03-27 14:52:22.334+0000 INFO Started.

Here is a list of resources deployed using the chart in ArgoCD (I tried using ingress to avoid the problem but still occurs): image

I am using the chart neo4j/neo4j from the .tgz available on your release page.

image

Do you have any clue on what I did wrong ?

Thank you for your time ! Have a great day

Chart Name

Cluster

Chart Version

5.5.0

Environment

Google Cloud Platform

Relevant log output

Changed password for user 'neo4j'. IMPORTANT: this change will only take effect if performed before the database is started for the first time.
2023-03-27 14:52:02.823+0000 INFO Command expansion is explicitly enabled for configuration
2023-03-27 14:52:02.828+0000 WARN Unrecognized setting. No declared setting with name: server.panic.shutdown_on_panic.
2023-03-27 14:52:02.924+0000 INFO Starting...
2023-03-27 14:52:05.736+0000 INFO This instance is ServerId{b6eb9040} (b6eb9040-44ba-46eb-85f8-1cb4c7e0f760)
2023-03-27 14:52:09.930+0000 INFO ======== Neo4j 5.5.0 ========
2023-03-27 14:52:17.868+0000 INFO Bolt enabled on 0.0.0.0:7687.
2023-03-27 14:52:22.328+0000 INFO Remote interface available at http://localhost:7474/
2023-03-27 14:52:22.332+0000 INFO id: F4D2410117C4D8F1DDCF1FB045D2821BB7E26D8961F763D6BDA3701B75922BDF
2
2023-03-27 14:52:22.333+0000 INFO name: system
2023-03-27 14:52:22.333+0000 INFO creationDate: 2023-03-27T14:52:12.331Z
2023-03-27 14:52:22.334+0000 INFO Started.

Code of Conduct

harshitsinghvi22 commented 1 year ago

Hi @MaxTeiger

I just tried the documentation steps on GCP and everything seems to be working just fine.

Neo4j Browser is loading as it should be . However , i did notice a bug in the logs which will be fixed in the upcoming release but that should not be the reason for your issue.

I have used the .tgz of 5.5.0 from the github releases section

Thanks, Harshit

MaxTeiger commented 1 year ago

Hi @MaxTeiger

I just tried the documentation steps on GCP and everything seems to be working just fine.

Neo4j Browser is loading as it should be . However , i did notice a bug in the logs which will be fixed in the upcoming release but that should not be the reason for your issue.

I have used the .tgz of 5.5.0 from the github releases section

Thanks,

Harshit

And did you manage to connect to the database from the UI using localhost port forward ?

harshitsinghvi22 commented 1 year ago

yes, i was able to connect to the database from the UI using port-forward

MaxTeiger commented 1 year ago

Ok thanks, I'll give it another shot this afternoon.

Did you use 0.5cpu & 1Gb memory for the resources of the pod ?

harshitsinghvi22 commented 1 year ago

@MaxTeiger you can use the below config to install the enterprise edition of neo4j helm chart which will by default install a LB for you and you would not need to do the port forwarding

neo4j:
  name: my-standalone
  resources:
    cpu: "0.5"
    memory: "2Gi"

  # Uncomment to set the initial password
  password: "my-initial-password"

  # Uncomment to use enterprise edition
  edition: "enterprise"
  acceptLicenseAgreement: "yes"

volumes:
  data:
    mode: "dynamic"
    dynamic:
      # In GKE;
      # * premium-rwo provisions SSD disks (recommended)
      # * standard-rwo provisions balanced SSD-backed disks
      # * standard provisions HDD disks
      storageClassName: premium-rwo

yes, in my previous reply i have tried the config as mentioned in the documentation

neo4j:
  name: my-standalone
  resources:
    cpu: "0.5"
    memory: "2Gi"

  # Uncomment to set the initial password
  #password: "my-initial-password"

  # Uncomment to use enterprise edition
  #edition: "enterprise"
  #acceptLicenseAgreement: "yes"

volumes:
  data:
    mode: "dynamic"
    dynamic:
      # In GKE;
      # * premium-rwo provisions SSD disks (recommended)
      # * standard-rwo provisions balanced SSD-backed disks
      # * standard provisions HDD disks
      storageClassName: premium-rwo
ojhughes commented 1 year ago

It looks like the LoadBalancer service is not being created? I suspect this is an issue because ArgoCD uses helm template to install the k8s resources. You might need to create the LoadBalancer service manually in this case

MaxTeiger commented 1 year ago

As I use the community version, isn't it normal that the load balancer service isn't created ?

ojhughes commented 1 year ago

I just tried and definitely works for me to port forward

kubectl port-forward services/standalone tcp-bolt tcp-http
ojhughes commented 1 year ago

Actually I did see the same error as you after a minute and had to restart the port forward. I would recommend just creating a separate service for the DB

MaxTeiger commented 1 year ago

Thank you for the tip. I just created another service targeting port 7867 (tcp-bolt)

Here is it's definition in YAML :

apiVersion: v1
kind: Service
metadata:
  labels:
    app: neo4j
    argocd.argoproj.io/instance: staging-neo4j
    helm.neo4j.com/instance: neo4j
    helm.neo4j.com/neo4j.name: neo4j
    helm.neo4j.com/service: default
  name: neo4j-bolt
  namespace: default
spec:
  ports:
    - name: tcp-bolt
      port: 7687
      protocol: TCP
      targetPort: 7687
  publishNotReadyAddresses: false
  selector:
    app: neo4j
    helm.neo4j.com/instance: neo4j
  type: ClusterIP

I then run the port forwards (in different terminal ofc) and get the following outputs :

❯ kubectl port-forward svc/neo4j 7475:7474
Forwarding from 127.0.0.1:7475 -> 7474
Forwarding from [::1]:7475 -> 7474
Handling connection for 7475
Handling connection for 7475
Handling connection for 7475
Handling connection for 7475
...
❯ kubectl port-forward svc/neo4j-bolt 7687:7687
Forwarding from 127.0.0.1:7687 -> 7687
Forwarding from [::1]:7687 -> 7687
Handling connection for 7687
Handling connection for 7687
Handling connection for 7687
Handling connection for 7687
Handling connection for 7687
Handling connection for 7687
Handling connection for 7687
E0329 12:10:06.512509   26712 portforward.go:406] an error occurred forwarding 7687 -> 7687: error forwarding port 7687 to pod 95d188b7f0e1f9332e4a7cf00707cb743e32fd57b7f8f43de39cd9d827122d88, uid : failed to execute portforward in network namespace "/var/run/netns/cni-4d7497c8-5e94-bc6b-2bf2-5605987ad804": read tcp4 127.0.0.1:41318->127.0.0.1:7687: read: connection reset by peer
Handling connection for 7687
E0329 12:10:06.513054   26712 portforward.go:346] error creating error stream for port 7687 -> 7687: use of closed network connection
E0329 12:10:06.513391   26712 portforward.go:234] lost connection to pod

After 1 or 2 minutes the port forward stops and I am disconnected from the DB.

But now I have logs on the pod (I had the time to run :clear & :help command on the DB before it crashes)

ERROR [bolt-357] Terminating connection due to unexpected error
org.neo4j.bolt.protocol.error.streaming.BoltStreamingWriteException: Failed to finalize batch: Cannot write result response
at org.neo4j.bolt.protocol.common.transaction.result.ResultHandler.onFinish(ResultHandler.java:116) ~[neo4j-bolt-5.5.0.jar:5.5.0]
at org.neo4j.bolt.protocol.common.fsm.AbstractStateMachine.after(AbstractStateMachine.java:126) ~[neo4j-bolt-5.5.0.jar:5.5.0]
at org.neo4j.bolt.protocol.common.fsm.AbstractStateMachine.process(AbstractStateMachine.java:99) ~[neo4j-bolt-5.5.0.jar:5.5.0]
at org.neo4j.bolt.protocol.common.connector.connection.AtomicSchedulingConnection.lambda$submit$4(AtomicSchedulingConnection.java:112) ~[neo4j-bolt-5.5.0.jar:5.5.0]
at org.neo4j.bolt.protocol.common.connector.connection.AtomicSchedulingConnection.executeJob(AtomicSchedulingConnection.java:335) ~[neo4j-bolt-5.5.0.jar:5.5.0]
at org.neo4j.bolt.protocol.common.connector.connection.AtomicSchedulingConnection.doExecuteJobs(AtomicSchedulingConnection.java:269) ~[neo4j-bolt-5.5.0.jar:5.5.0]
at org.neo4j.bolt.protocol.common.connector.connection.AtomicSchedulingConnection.executeJobs(AtomicSchedulingConnection.java:211) ~[neo4j-bolt-5.5.0.jar:5.5.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
at java.lang.Thread.run(Thread.java:833) ~[?:?]
Caused by: io.netty.channel.StacklessClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source) ~[netty-transport-4.1.87.Final.jar:4.1.87.Final]
Caused by: io.netty.channel.unix.Errors$NativeIoException: writevAddresses(..) failed: Connection reset by peer

Do you have any clue ? Do not hesitate if you need more information Thank you 🙂

ojhughes commented 1 year ago

Hi @MaxTeiger this seems to be an issue that others have encountered

Out of interest, are using M1 Mac?

ojhughes commented 1 year ago

~This is strange.. try entering http://127.0.0.1:7474/browser/ instead of http://localhost:7474/browser/. That seems to work for me~ Nope it still crashes after some time

MaxTeiger commented 1 year ago

Using 127.0.0.1 seems to work for me, after 5mn still no crash.

I am using Mac Intel i9

I created two ingress, one for the db service I created and one for the http service. Once I tested it and seting up the firewall correctly for my LB, I keep you inform if this solves the problem 🙂

Thank you

harshitsinghvi22 commented 1 year ago

@MaxTeiger we have introduced the loadbalancer for community as well now with the following PR https://github.com/neo4j/helm-charts/pull/167

We will update the documentation once the above is released this week.

thanks for bringing this to our notice. Hope the above PR resolves your issue and allows you to connect to Neo4j

harshitsinghvi22 commented 1 year ago

closing this now as the documentation is updated.

luchillo17 commented 1 year ago

Can't find anything on the docs about running that load balancer behind an Ingress, for example, if I want my db exposed at <hostname>/neo4j or <hostname>/db, I can use the rewrite/redirect feature of nginx to properly render the neo4j browser, but when connecting the issue is the server's advertised bolt address uses localhost without any path, idk what to do here:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: neo4j-ingress-http
  namespace: neo4j
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /$2
    # nginx.ingress.kubernetes.io/configuration-snippet: |
    #   rewrite ^(/neo4j)$ $1/ redirect;
spec:
  rules:
    - http:
        paths:
          - pathType: Prefix
            path: /neo4j/http(/|$)(.*)
            backend:
              service:
                name: neo4j-db-lb-neo4j
                port:
                  name: http
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: neo4j-ingress-bolt
  namespace: neo4j
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/use-regex: "true"
    nginx.ingress.kubernetes.io/rewrite-target: /$2
    nginx.ingress.kubernetes.io/configuration-snippet: |
      rewrite ^(/neo4j.7687)$ $1/ redirect;
spec:
  rules:
    - http:
        paths:
          - pathType: Prefix
            path: /neo4j.7687(/|$)(.*)
            backend:
              service:
                name: neo4j-db-lb-neo4j
                port:
                  name: tcp-bolt

To be precise the bolt port does connect properly over HTTP, but it then advertises a web socket URL under ws://localhost:7687 which is both missing the path & adding a port, regardless of what I do in Ingress, so I can't find a way to connect to it...