ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.27k stars 5.82k forks source link

[Core] Enormous file handler and memory leaks on windows #47459

Open RBrearton opened 3 months ago

RBrearton commented 3 months ago

What happened + What you expected to happen

As part of a .cmd script on windows, we're starting up ray with CALL ray start --head --num-cpus 64. Afterwards, the script spins up a service that makes use of ray, and stays alive indefinitely.

We noticed that, after moving to ray, our resource usage on the machine massively increased. Most obvious is that Ray seems to leak file handles at an alarming rate - we've reproduced this on a few windows machines, all of which happen to be using 2.33.0. Ray is also leaking memory, albeit more slowly, and we have only had one machine taken down by this.

If it helps, we observed file handler leaks on windows 10 enterprise version 22H2, and file handler + memory leaks on Microsoft Windows Server 2019 Standard 10.0.17763. It's entirely possible that we would observe memory leaks on the windows enterprise machines, but those machines are running ray with far fewer cores.

Details

Fortunately, we have extremely detailed system logs on one of the machines that exhibits these issues. I'm very happy to give as much information as required to help solve the issue, but our data is sensitive, so I'll have to clean everything first.

The machine from which the following data was extracted was the windows server machine. This data is associated with a single ray worker, of which there were 64 on that machine.

The dip in resource usage on September 1st at around 2pm came from full restart of ray, required after the system ran out of memory.

For context, our usage of ray was extremely light during this period. I've included the working_set_peak plot primarily because it shows when we're actually using ray quite nicely. The little bump on Aug 27th was light usage between midnight and around 5am. It was then nearly completely unused until about 6:30am on Aug 31st.

image image image image

Versions / Dependencies

2.33.0

Reproduction script

ray start --head --num-cpus 64

Issue Severity

High: It blocks me from completing my task.

anyscalesam commented 2 months ago

@mattip can you take a look here; recommend a quick call with @RBrearton if he is available...

RBrearton commented 2 months ago

I'm happy to help any way I can by call. I'll be available 00:30-14:30 PDT tomorrow (I'm guessing you're US west coast - either way, I'll be online 8:30am until 10:30pm UK time).

anyscalesam commented 2 months ago

@mattip did you get a chance to sync up with @RBrearton on this? Can you pull me into the meeting invite if y'all find a slot; would love to listen in...

mattip commented 2 months ago

We met last week. The problem reproduces even with 10 cpus so I should be able to debug locally. I have an idea for a simpler reproducer, hopefully will try it out soon

anyscalesam commented 2 months ago

Terrific - TY for the update @mattip

mattip commented 2 months ago

Here is a small reproducer that puts some data into the local store, uses it, and then just waits

import numpy as np
import ray
import time

ray.init(num_cpus=10, num_gpus=0)

@ray.remote
def f(data1, data2):
    return data1 + data2

data1 = np.ones((3000, 3000))
data2 = np.ones((3000, 3000))

data1_ref = ray.put(data1)
data2_ref = ray.put(data2)

futures = [f.remote(data1_ref, data2_ref) for i in range(4)]
print(np.sum(ray.get(futures)))  # 72000000.0

time.sleep(60*60*10)  # ten hours

After about 30 minutes, this is what I see in TaskManager: the number of handles allocated to 7 python processes continues to climb

issue47459

Looking at the command line for those processes, it is some of the default_worker processes. Using psutils, I can print out the PID and the cmdline for all the python processes with `default_worker in them

psutils output ``` {5368: 'd:\\CPython310\\python.exe ' '\\ray\\_private\\workers\\default_worker.py ' '--node-ip-address=127.0.0.1 --node-manager-port=53289 ' '--object-store-name=tcp://127.0.0.1:56236 ' '--raylet-name=tcp://127.0.0.1:58024 --redis-address=None ' '--metrics-agent-port=55081 --runtime-env-agent-port=54230 ' '--logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 ' '--runtime-env-agent-port=54230 --gcs-address=127.0.0.1:53722 ' '--session-name=session_2024-09-11_16-34-06_976666_13356 ' '--temp-dir=C:\\Users\\matti\\AppData\\Local\\Temp\\ray ' '--webui=127.0.0.1:8265 ' '--cluster-id=3f888bb561e4207771c7a1396a2fe6bf8289b855f95ca7efe587b73b ' '--startup-token=4 --worker-launch-time-ms=1726061649746 ' '--node-id=4def0359bc13fdde283ed81cf2cf71eef358a525c21aa687bccc4c93 ' '--runtime-env-hash=0', 5588: 'd:\\temp\\ray_venv\\Scripts\\python.exe ' '\\ray\\_private\\workers\\default_worker.py ' '--node-ip-address=127.0.0.1 --node-manager-port=53289 ' '--object-store-name=tcp://127.0.0.1:56236 ' '--raylet-name=tcp://127.0.0.1:58024 --redis-address=None ' '--metrics-agent-port=55081 --runtime-env-agent-port=54230 ' '--logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 ' '--runtime-env-agent-port=54230 --gcs-address=127.0.0.1:53722 ' '--session-name=session_2024-09-11_16-34-06_976666_13356 ' '--temp-dir=C:\\Users\\matti\\AppData\\Local\\Temp\\ray ' '--webui=127.0.0.1:8265 ' '--cluster-id=3f888bb561e4207771c7a1396a2fe6bf8289b855f95ca7efe587b73b ' '--startup-token=2 --worker-launch-time-ms=1726061649732 ' '--node-id=4def0359bc13fdde283ed81cf2cf71eef358a525c21aa687bccc4c93 ' '--runtime-env-hash=0', 6260: 'd:\\temp\\ray_venv\\Scripts\\python.exe ' '\\ray\\_private\\workers\\default_worker.py ' '--node-ip-address=127.0.0.1 --node-manager-port=53289 ' '--object-store-name=tcp://127.0.0.1:56236 ' '--raylet-name=tcp://127.0.0.1:58024 --redis-address=None ' '--metrics-agent-port=55081 --runtime-env-agent-port=54230 ' '--logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 ' '--runtime-env-agent-port=54230 --gcs-address=127.0.0.1:53722 ' '--session-name=session_2024-09-11_16-34-06_976666_13356 ' '--temp-dir=C:\\Users\\matti\\AppData\\Local\\Temp\\ray ' '--webui=127.0.0.1:8265 ' '--cluster-id=3f888bb561e4207771c7a1396a2fe6bf8289b855f95ca7efe587b73b ' '--startup-token=6 --worker-launch-time-ms=1726061651495 ' '--node-id=4def0359bc13fdde283ed81cf2cf71eef358a525c21aa687bccc4c93 ' '--runtime-env-hash=0', 6280: 'd:\\CPython310\\python.exe ' '\\ray\\_private\\workers\\default_worker.py ' '--node-ip-address=127.0.0.1 --node-manager-port=53289 ' '--object-store-name=tcp://127.0.0.1:56236 ' '--raylet-name=tcp://127.0.0.1:58024 --redis-address=None ' '--metrics-agent-port=55081 --runtime-env-agent-port=54230 ' '--logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 ' '--runtime-env-agent-port=54230 --gcs-address=127.0.0.1:53722 ' '--session-name=session_2024-09-11_16-34-06_976666_13356 ' '--temp-dir=C:\\Users\\matti\\AppData\\Local\\Temp\\ray ' '--webui=127.0.0.1:8265 ' '--cluster-id=3f888bb561e4207771c7a1396a2fe6bf8289b855f95ca7efe587b73b ' '--startup-token=0 --worker-launch-time-ms=1726061649718 ' '--node-id=4def0359bc13fdde283ed81cf2cf71eef358a525c21aa687bccc4c93 ' '--runtime-env-hash=0', 6664: 'd:\\temp\\ray_venv\\Scripts\\python.exe ' '\\ray\\_private\\workers\\default_worker.py ' '--node-ip-address=127.0.0.1 --node-manager-port=53289 ' '--object-store-name=tcp://127.0.0.1:56236 ' '--raylet-name=tcp://127.0.0.1:58024 --redis-address=None ' '--metrics-agent-port=55081 --runtime-env-agent-port=54230 ' '--logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 ' '--runtime-env-agent-port=54230 --gcs-address=127.0.0.1:53722 ' '--session-name=session_2024-09-11_16-34-06_976666_13356 ' '--temp-dir=C:\\Users\\matti\\AppData\\Local\\Temp\\ray ' '--webui=127.0.0.1:8265 ' '--cluster-id=3f888bb561e4207771c7a1396a2fe6bf8289b855f95ca7efe587b73b ' '--startup-token=5 --worker-launch-time-ms=1726061649758 ' '--node-id=4def0359bc13fdde283ed81cf2cf71eef358a525c21aa687bccc4c93 ' '--runtime-env-hash=0', 7464: 'd:\\CPython310\\python.exe ' '\\ray\\_private\\workers\\default_worker.py ' '--node-ip-address=127.0.0.1 --node-manager-port=53289 ' '--object-store-name=tcp://127.0.0.1:56236 ' '--raylet-name=tcp://127.0.0.1:58024 --redis-address=None ' '--metrics-agent-port=55081 --runtime-env-agent-port=54230 ' '--logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 ' '--runtime-env-agent-port=54230 --gcs-address=127.0.0.1:53722 ' '--session-name=session_2024-09-11_16-34-06_976666_13356 ' '--temp-dir=C:\\Users\\matti\\AppData\\Local\\Temp\\ray ' '--webui=127.0.0.1:8265 ' '--cluster-id=3f888bb561e4207771c7a1396a2fe6bf8289b855f95ca7efe587b73b ' '--startup-token=6 --worker-launch-time-ms=1726061651495 ' '--node-id=4def0359bc13fdde283ed81cf2cf71eef358a525c21aa687bccc4c93 ' '--runtime-env-hash=0', 8868: 'd:\\CPython310\\python.exe ' '\\ray\\_private\\workers\\default_worker.py ' '--node-ip-address=127.0.0.1 --node-manager-port=53289 ' '--object-store-name=tcp://127.0.0.1:56236 ' '--raylet-name=tcp://127.0.0.1:58024 --redis-address=None ' '--metrics-agent-port=55081 --runtime-env-agent-port=54230 ' '--logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 ' '--runtime-env-agent-port=54230 --gcs-address=127.0.0.1:53722 ' '--session-name=session_2024-09-11_16-34-06_976666_13356 ' '--temp-dir=C:\\Users\\matti\\AppData\\Local\\Temp\\ray ' '--webui=127.0.0.1:8265 ' '--cluster-id=3f888bb561e4207771c7a1396a2fe6bf8289b855f95ca7efe587b73b ' '--startup-token=5 --worker-launch-time-ms=1726061649758 ' '--node-id=4def0359bc13fdde283ed81cf2cf71eef358a525c21aa687bccc4c93 ' '--runtime-env-hash=0', 11920: 'd:\\temp\\ray_venv\\Scripts\\python.exe ' '\\ray\\_private\\workers\\default_worker.py ' '--node-ip-address=127.0.0.1 --node-manager-port=53289 ' '--object-store-name=tcp://127.0.0.1:56236 ' '--raylet-name=tcp://127.0.0.1:58024 --redis-address=None ' '--metrics-agent-port=55081 --runtime-env-agent-port=54230 ' '--logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 ' '--runtime-env-agent-port=54230 --gcs-address=127.0.0.1:53722 ' '--session-name=session_2024-09-11_16-34-06_976666_13356 ' '--temp-dir=C:\\Users\\matti\\AppData\\Local\\Temp\\ray ' '--webui=127.0.0.1:8265 ' '--cluster-id=3f888bb561e4207771c7a1396a2fe6bf8289b855f95ca7efe587b73b ' '--startup-token=0 --worker-launch-time-ms=1726061649718 ' '--node-id=4def0359bc13fdde283ed81cf2cf71eef358a525c21aa687bccc4c93 ' '--runtime-env-hash=0', 12020: 'd:\\CPython310\\python.exe ' '\\ray\\_private\\workers\\default_worker.py ' '--node-ip-address=127.0.0.1 --node-manager-port=53289 ' '--object-store-name=tcp://127.0.0.1:56236 ' '--raylet-name=tcp://127.0.0.1:58024 --redis-address=None ' '--metrics-agent-port=55081 --runtime-env-agent-port=54230 ' '--logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 ' '--runtime-env-agent-port=54230 --gcs-address=127.0.0.1:53722 ' '--session-name=session_2024-09-11_16-34-06_976666_13356 ' '--temp-dir=C:\\Users\\matti\\AppData\\Local\\Temp\\ray ' '--webui=127.0.0.1:8265 ' '--cluster-id=3f888bb561e4207771c7a1396a2fe6bf8289b855f95ca7efe587b73b ' '--startup-token=3 --worker-launch-time-ms=1726061649738 ' '--node-id=4def0359bc13fdde283ed81cf2cf71eef358a525c21aa687bccc4c93 ' '--runtime-env-hash=0', 12176: 'd:\\temp\\ray_venv\\Scripts\\python.exe ' '\\ray\\_private\\workers\\default_worker.py ' '--node-ip-address=127.0.0.1 --node-manager-port=53289 ' '--object-store-name=tcp://127.0.0.1:56236 ' '--raylet-name=tcp://127.0.0.1:58024 --redis-address=None ' '--metrics-agent-port=55081 --runtime-env-agent-port=54230 ' '--logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 ' '--runtime-env-agent-port=54230 --gcs-address=127.0.0.1:53722 ' '--session-name=session_2024-09-11_16-34-06_976666_13356 ' '--temp-dir=C:\\Users\\matti\\AppData\\Local\\Temp\\ray ' '--webui=127.0.0.1:8265 ' '--cluster-id=3f888bb561e4207771c7a1396a2fe6bf8289b855f95ca7efe587b73b ' '--startup-token=3 --worker-launch-time-ms=1726061649738 ' '--node-id=4def0359bc13fdde283ed81cf2cf71eef358a525c21aa687bccc4c93 ' '--runtime-env-hash=0', 12444: 'd:\\temp\\ray_venv\\Scripts\\python.exe ' '\\ray\\_private\\workers\\default_worker.py ' '--node-ip-address=127.0.0.1 --node-manager-port=53289 ' '--object-store-name=tcp://127.0.0.1:56236 ' '--raylet-name=tcp://127.0.0.1:58024 --redis-address=None ' '--metrics-agent-port=55081 --runtime-env-agent-port=54230 ' '--logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 ' '--runtime-env-agent-port=54230 --gcs-address=127.0.0.1:53722 ' '--session-name=session_2024-09-11_16-34-06_976666_13356 ' '--temp-dir=C:\\Users\\matti\\AppData\\Local\\Temp\\ray ' '--webui=127.0.0.1:8265 ' '--cluster-id=3f888bb561e4207771c7a1396a2fe6bf8289b855f95ca7efe587b73b ' '--startup-token=1 --worker-launch-time-ms=1726061649723 ' '--node-id=4def0359bc13fdde283ed81cf2cf71eef358a525c21aa687bccc4c93 ' '--runtime-env-hash=0', 12712: 'd:\\CPython310\\python.exe ' '\\ray\\_private\\workers\\default_worker.py ' '--node-ip-address=127.0.0.1 --node-manager-port=53289 ' '--object-store-name=tcp://127.0.0.1:56236 ' '--raylet-name=tcp://127.0.0.1:58024 --redis-address=None ' '--metrics-agent-port=55081 --runtime-env-agent-port=54230 ' '--logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 ' '--runtime-env-agent-port=54230 --gcs-address=127.0.0.1:53722 ' '--session-name=session_2024-09-11_16-34-06_976666_13356 ' '--temp-dir=C:\\Users\\matti\\AppData\\Local\\Temp\\ray ' '--webui=127.0.0.1:8265 ' '--cluster-id=3f888bb561e4207771c7a1396a2fe6bf8289b855f95ca7efe587b73b ' '--startup-token=1 --worker-launch-time-ms=1726061649723 ' '--node-id=4def0359bc13fdde283ed81cf2cf71eef358a525c21aa687bccc4c93 ' '--runtime-env-hash=0', 12892: 'd:\\CPython310\\python.exe ' '\\ray\\_private\\workers\\default_worker.py ' '--node-ip-address=127.0.0.1 --node-manager-port=53289 ' '--object-store-name=tcp://127.0.0.1:56236 ' '--raylet-name=tcp://127.0.0.1:58024 --redis-address=None ' '--metrics-agent-port=55081 --runtime-env-agent-port=54230 ' '--logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 ' '--runtime-env-agent-port=54230 --gcs-address=127.0.0.1:53722 ' '--session-name=session_2024-09-11_16-34-06_976666_13356 ' '--temp-dir=C:\\Users\\matti\\AppData\\Local\\Temp\\ray ' '--webui=127.0.0.1:8265 ' '--cluster-id=3f888bb561e4207771c7a1396a2fe6bf8289b855f95ca7efe587b73b ' '--startup-token=2 --worker-launch-time-ms=1726061649732 ' '--node-id=4def0359bc13fdde283ed81cf2cf71eef358a525c21aa687bccc4c93 ' '--runtime-env-hash=0', 13476: 'd:\\temp\\ray_venv\\Scripts\\python.exe ' '\\ray\\_private\\workers\\default_worker.py ' '--node-ip-address=127.0.0.1 --node-manager-port=53289 ' '--object-store-name=tcp://127.0.0.1:56236 ' '--raylet-name=tcp://127.0.0.1:58024 --redis-address=None ' '--metrics-agent-port=55081 --runtime-env-agent-port=54230 ' '--logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 ' '--runtime-env-agent-port=54230 --gcs-address=127.0.0.1:53722 ' '--session-name=session_2024-09-11_16-34-06_976666_13356 ' '--temp-dir=C:\\Users\\matti\\AppData\\Local\\Temp\\ray ' '--webui=127.0.0.1:8265 ' '--cluster-id=3f888bb561e4207771c7a1396a2fe6bf8289b855f95ca7efe587b73b ' '--startup-token=4 --worker-launch-time-ms=1726061649746 ' '--node-id=4def0359bc13fdde283ed81cf2cf71eef358a525c21aa687bccc4c93 ' '--runtime-env-hash=0'} ```

There are 14 processes, when I asked for 10 workers via ray.init and activated 4 remote processes. Only 7 of them leak handles.

I will keep digging.

mattip commented 2 months ago

It seems to be connected with putting objects on the store and calling a remote, regardless of whether the remote uses the store and regardless of their size. Using this script:

@ray.remote
def f(data1, data2):
    return data1 + data2

data1 = np.ones((3, 3))
data2 = np.ones((3, 3))

if use_put:
    data1_ref = ray.put(data1)
    data2_ref = ray.put(data2)

if call_remote:
    if use_put and use_get:
        futures = [f.remote(data1_ref, data2_ref) for i in range(4)]
    else:
        futures = [f.remote(data1, data2) for i in range(4)]
    print(np.sum(ray.get(futures)))  # 72000000.0

time.sleep(60*60*10)  # ten hours

I see that

call_remote use_put use_get leak?
True True True Yes
True True False Yes
False --- --- No
--- False --- No

Edit: add sleep to the script

mattip commented 2 months ago

There are 35 threads in the process that is leaking. Could someone provide a little guidance which of those are connected with checking/managing the object store?

anyscalesam commented 1 month ago

@jjyao @rkooo567

rkooo567 commented 1 month ago

@mattip have you been investigating this issue? I'd love to talk for more guidance

mattip commented 1 month ago

@rkooo567 yes let's meet. I sent you a DM on the quansight/anyscale slack channel

mattip commented 1 month ago

From what I can tell using this one liner to find open file handles on linux, running the test script on linux does not leak handles.

$ for pid in $(ps xa | grep ray | cut -f1 -d" "); \
    do echo -n "$pid "; lsof -a -p $pid | wc -l; done
jjyao commented 1 week ago

Downgrading to P1 as it only affects Windows.

jjyao commented 1 week ago

@mattip Do you know for those leaked fds on windows, are they files, sockets or something else?

RBrearton commented 1 week ago

Downgrading to P1 as it only affects Windows.

I appreciate that the majority of ray users are running ray on linux, but it still seems surprising to me that a memory leak capable of taking down our beefy worker (>200GB of RAM) every other week is anything other than a P0 issue.

We would be serious about scaling up a large amount of compute using ray, but we can't justify this while it leaks so heavily.

Rightly or wrongly, I suspect that a large number of industry users will test ray on Windows. It isn't a great look when a typical ~100 core system leaks ~10,000,000 file handles per day.

mattip commented 1 week ago

Do you know for those leaked fds on windows, are they files, sockets or something else?

From what I can tell using windows APIs to check the handles, they seem to be "something else". Threads and processes and others all fall into that class.

jjyao commented 1 week ago

@mattip do you mind doing some bisect and see which version of Ray starts to have the leak?

jjyao commented 1 week ago

Hi @RBrearton it's an overall P1 issue but it's P0 for @mattip (our Windows expert) to work on.