Open RBrearton opened 3 months ago
@mattip can you take a look here; recommend a quick call with @RBrearton if he is available...
I'm happy to help any way I can by call. I'll be available 00:30-14:30 PDT tomorrow (I'm guessing you're US west coast - either way, I'll be online 8:30am until 10:30pm UK time).
@mattip did you get a chance to sync up with @RBrearton on this? Can you pull me into the meeting invite if y'all find a slot; would love to listen in...
We met last week. The problem reproduces even with 10 cpus so I should be able to debug locally. I have an idea for a simpler reproducer, hopefully will try it out soon
Terrific - TY for the update @mattip
Here is a small reproducer that puts some data into the local store, uses it, and then just waits
import numpy as np
import ray
import time
ray.init(num_cpus=10, num_gpus=0)
@ray.remote
def f(data1, data2):
return data1 + data2
data1 = np.ones((3000, 3000))
data2 = np.ones((3000, 3000))
data1_ref = ray.put(data1)
data2_ref = ray.put(data2)
futures = [f.remote(data1_ref, data2_ref) for i in range(4)]
print(np.sum(ray.get(futures))) # 72000000.0
time.sleep(60*60*10) # ten hours
After about 30 minutes, this is what I see in TaskManager: the number of handles allocated to 7 python processes continues to climb
Looking at the command line for those processes, it is some of the default_worker
processes. Using psutils, I can print out the PID and the cmdline for all the python processes with `default_worker in them
There are 14 processes, when I asked for 10 workers via ray.init
and activated 4 remote processes. Only 7 of them leak handles.
I will keep digging.
It seems to be connected with putting objects on the store and calling a remote, regardless of whether the remote uses the store and regardless of their size. Using this script:
@ray.remote
def f(data1, data2):
return data1 + data2
data1 = np.ones((3, 3))
data2 = np.ones((3, 3))
if use_put:
data1_ref = ray.put(data1)
data2_ref = ray.put(data2)
if call_remote:
if use_put and use_get:
futures = [f.remote(data1_ref, data2_ref) for i in range(4)]
else:
futures = [f.remote(data1, data2) for i in range(4)]
print(np.sum(ray.get(futures))) # 72000000.0
time.sleep(60*60*10) # ten hours
I see that
call_remote | use_put | use_get | leak? |
---|---|---|---|
True | True | True | Yes |
True | True | False | Yes |
False | --- | --- | No |
--- | False | --- | No |
Edit: add sleep to the script
There are 35 threads in the process that is leaking. Could someone provide a little guidance which of those are connected with checking/managing the object store?
@jjyao @rkooo567
@mattip have you been investigating this issue? I'd love to talk for more guidance
@rkooo567 yes let's meet. I sent you a DM on the quansight/anyscale slack channel
From what I can tell using this one liner to find open file handles on linux, running the test script on linux does not leak handles.
$ for pid in $(ps xa | grep ray | cut -f1 -d" "); \
do echo -n "$pid "; lsof -a -p $pid | wc -l; done
Downgrading to P1 as it only affects Windows.
@mattip Do you know for those leaked fds on windows, are they files, sockets or something else?
Downgrading to P1 as it only affects Windows.
I appreciate that the majority of ray users are running ray on linux, but it still seems surprising to me that a memory leak capable of taking down our beefy worker (>200GB of RAM) every other week is anything other than a P0 issue.
We would be serious about scaling up a large amount of compute using ray, but we can't justify this while it leaks so heavily.
Rightly or wrongly, I suspect that a large number of industry users will test ray on Windows. It isn't a great look when a typical ~100 core system leaks ~10,000,000 file handles per day.
Do you know for those leaked fds on windows, are they files, sockets or something else?
From what I can tell using windows APIs to check the handles, they seem to be "something else". Threads and processes and others all fall into that class.
@mattip do you mind doing some bisect and see which version of Ray starts to have the leak?
Hi @RBrearton it's an overall P1 issue but it's P0 for @mattip (our Windows expert) to work on.
What happened + What you expected to happen
As part of a .cmd script on windows, we're starting up ray with
CALL ray start --head --num-cpus 64
. Afterwards, the script spins up a service that makes use of ray, and stays alive indefinitely.We noticed that, after moving to ray, our resource usage on the machine massively increased. Most obvious is that Ray seems to leak file handles at an alarming rate - we've reproduced this on a few windows machines, all of which happen to be using 2.33.0. Ray is also leaking memory, albeit more slowly, and we have only had one machine taken down by this.
If it helps, we observed file handler leaks on windows 10 enterprise version 22H2, and file handler + memory leaks on Microsoft Windows Server 2019 Standard 10.0.17763. It's entirely possible that we would observe memory leaks on the windows enterprise machines, but those machines are running ray with far fewer cores.
Details
Fortunately, we have extremely detailed system logs on one of the machines that exhibits these issues. I'm very happy to give as much information as required to help solve the issue, but our data is sensitive, so I'll have to clean everything first.
The machine from which the following data was extracted was the windows server machine. This data is associated with a single ray worker, of which there were 64 on that machine.
The dip in resource usage on September 1st at around 2pm came from full restart of ray, required after the system ran out of memory.
For context, our usage of ray was extremely light during this period. I've included the
working_set_peak
plot primarily because it shows when we're actually using ray quite nicely. The little bump on Aug 27th was light usage between midnight and around 5am. It was then nearly completely unused until about 6:30am on Aug 31st.Versions / Dependencies
2.33.0
Reproduction script
ray start --head --num-cpus 64
Issue Severity
High: It blocks me from completing my task.