ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
963 stars 328 forks source link

[Bug] Unable to open a custom application port #2203

Closed anovv closed 5 days ago

anovv commented 6 days ago

Search before asking

KubeRay Component

ray-operator, Others

What happened + What you expected to happen

I have an actor listening on TCP port for incoming connections from other actors on different nodes - they locate the actor by node-ip:port address (ZMQ transport). When running locally (on local Ray instance, actors in the same node) everything works fine, but fails when running in Kubernetes.

I figured the issue is exposing the port, so added following to helm chart:

worker:
  ports:
    - containerPort: 1234
      name: 'test-tcp'

Unfortunately this did not seem to fix the problem.

What am I doing wrong?

Reproduction script

I ran a sample test

- Run in the same pod (different shell, previous is blocked):

import requests requests.get('http://localhost:1234').text '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n\n\n\nDirectory listing for /\n\n\n

Directory listing for /

\n
\n\n
\n\n\n'


this works fine
>>> requests.get('http://10.244.3.21:1234').text
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/home/ray/anaconda3/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

this fails

You can see that server works (localhost based request works), but using node_ip + port fails

Using node_ip+port from another node/pod fails as well

Anything else

I'm running minikube on m2 mac. KubeRay 1.1.1 , Ray 2.22.0

Are you willing to submit a PR?

anovv commented 6 days ago

Also separate issue is that if setting worker.portson head node spec it overrides all existing ports (dashboard, prometheus, client, etc.) instead of appending it

andrewsykim commented 6 days ago

Just to clarify, when you say node_ip + port, you are referring to the pod IP and not the Kubernetes node right?

anovv commented 6 days ago

Correct pod_ip == node_ip here, not kubernetes node

andrewsykim commented 6 days ago

Your example code is binding the server to localhost:

>>> from http.server import HTTPServer, SimpleHTTPRequestHandler
>>> httpd = HTTPServer(('localhost', 1234), SimpleHTTPRequestHandler)
>>> httpd = HTTPServer(('localhost', 1234), SimpleHTTPRequestHandler)
>>> httpd.serve_forever()

Which explains why you can only reach it from localhost. Can you run it with 0.0.0.0 and test it again?

anovv commented 6 days ago

Thanks @andrewsykim, this solved the issue, it works now. Also figured you don't even need to add ports to worker.ports, any port works without any configuration on Kubernetes side.

kevin85421 commented 6 days ago

@anovv Can I close this issue?

anovv commented 5 days ago

@kevin85421, closed, thanks!