Open vishal00100 opened 3 years ago
This looks bad. Can you check if this is reproducible when you purely keep calling ray.put?
This looks bad. Can you check if this is reproducible when you purely keep calling ray.put?
Sure. I'll let this code run overnight and post results.
import ray
import numpy as np
from tqdm import tqdm
import time
from ray.util.metrics import Count, Gauge
def run():
put_count = Count(name="repro_put_count")
put_gauge = Gauge(name="repro_put_gauge")
for ridx in tqdm(range(1000000000), desc="Main loop"):
stime = time.time()
object_ref = ray.put(np.zeros((3, 3000, 5000, 1)))
put_gauge.record((time.time() - stime) * 1000)
put_count.record(1.)
if __name__ == '__main__':
print(ray.init(_metrics_export_port=58391))
print(ray.cluster_resources())
run()
purely ray.put
code seems fine.
Hmm interesting. Can you then try with ray.put in a worker? (same code but you run ray.put inside a actor class and you keep calling it).
Sure.
import ray
import numpy as np
from tqdm import tqdm
import time
from ray.util.metrics import Count, Gauge
@ray.remote
class Actor1:
def __init__(self):
self.put_count = Count(name="repro_put_count")
self.put_gauge = Gauge(name="repro_put_gauge")
def do_stuff(self):
stime = time.time()
object_ref = ray.put(np.zeros((3, 3000, 5000, 1)))
self.put_gauge.record((time.time() - stime) * 1000)
self.put_count.record(1.)
return object_ref
def run():
actor1 = Actor1.remote()
for ridx in tqdm(range(1000000000), desc="Main loop"):
object_ref = ray.get(actor1.do_stuff.remote())
if __name__ == '__main__':
print(ray.init(_metrics_export_port=58391))
print(ray.cluster_resources())
run()
I ran this code for ~20 hours and I see similar spikes here.
P.S: There are few periods where there are no metrics. That was because the prometheus server was not able to access metrics during that time. However, the code above was running the entire time.
Hmm interesting. We will take a look at what might be the cause soon!
Hey @vishal00100 Thanks again for trying all. I'd like to ask you one last thing. Is it possible to see if this happens in other versions? What versions are you using?
No problem and thanks for quick response. I'm using 1.1.0
. Which version would you like me to try?
@vishal00100 can you try the nightly wheels https://docs.ray.io/en/master/installation.html?
I'd like to know if this was a regression as well. (Sorry for many request haha..)
Ideally, it'll be great if you can test them with
No worries. I will post results as I gather them.
I ran with nightly wheel for few hours.
>>> ray.__version__
'1.2.0.dev0'
It's been running for ~3 hours but I think it's quite clear that this is trending in same direction as version 1.1.0
.
I'll switch to 1.0.0
and run it again.
@rkooo567 Looks like 1.0.0
doesn't include support for custom metrics. Is there a workaround you can suggest? I can keep track of metrics in my code if necessary.
Ah, yeah. It was introduced after that version. Maybe you should collect manually. Sorry for the inconvenience 😢
@vishal00100 How's this going now? I am thinking to take a look at it in next 2 weeks once you could verify this.
My apologies. I wasn't able to spend time on this sooner.
I modified my code to keep tracking of timing metrics for older versions of ray.
import ray
import numpy as np
from tqdm import tqdm
import time
try:
from ray.util.metrics import Count, Gauge
except ImportError as e:
process_start_time = time.time()
print(f"Count and Gauge are not available in ray version {ray.__version__}.")
import pandas as pd
class Gauge:
def __init__(self, name):
self.name = name
self.file_name = f'/tmp/{self.name}.pkl'
pd.DataFrame([], columns=['name', 'time', 'value']).to_pickle(self.file_name)
def record(self, value : float):
# We read DataFrame from file each time so we don't add overhead of maintaining a large DataFrame in memory
df=pd.read_pickle(self.file_name)
df.at[len(df), ['name', 'time', 'value']] = [self.name, int(process_start_time - time.time()), value]
df.to_pickle(self.file_name)
class Count:
def __init__(self, name):
self.name = name
self.file_name = f'/tmp/{self.name}.pkl'
pd.DataFrame([], columns=['name', 'time', 'value']).to_pickle(self.file_name)
def record(self, value : float):
# We read DataFrame from file each time so we don't add overhead of maintaining a large DataFrame in memory
df=pd.read_pickle(self.file_name)
df.at[len(df), ['name', 'time', 'value']] = [self.name, int(process_start_time - time.time()), value]
df.to_pickle(self.file_name)
@ray.remote
class Actor1:
def __init__(self):
self.put_count = Count(name="repro_put_count")
self.put_gauge = Gauge(name="repro_put_gauge")
def do_stuff(self):
stime = time.time()
object_ref = ray.put(np.zeros((3, 3000, 5000, 1)))
self.put_gauge.record((time.time() - stime) * 1000)
self.put_count.record(1.)
return object_ref
def run():
actor1 = Actor1.remote()
for ridx in tqdm(range(1000000000), desc="Main loop"):
object_ref = ray.get(actor1.do_stuff.remote())
if __name__ == '__main__':
print(ray.init(_metrics_export_port=58391))
print(ray.cluster_resources())
run()
I'm seeing similar pattern with 1.0.0
. I'll run this code again with 0.8.6
I tried running the above script (modified below) on a i3.8xl instance (latest DLAMI, pytorch_p36 python env, Ray nightly) for 24 hours. I did not observe any increase in latency over time (things stayed bounded at ~100ms max). The average rose slightly from 23ms to 27ms, but that seemed pretty negligible to me.
Is it possible the problem is a system-specific issue (for example, running low on memory or some operating system issue?) I'm downgrading the priority since the issue seems to only occur on specific environments.
Note: I also opted for print()ing the times instead of using the metrics system, it's possible the metrics add overheads increasing the latency that would not show up since I used prints()s only.
import ray
import numpy as np
from tqdm import tqdm
import time
process_start_time = time.time()
print(f"Count and Gauge are not available in ray version {ray.__version__}.")
import pandas as pd
class Gauge:
def __init__(self, name):
self.name = name
self.file_name = f'/tmp/{self.name}.pkl'
pd.DataFrame([], columns=['name', 'time', 'value']).to_pickle(self.file_name)
def record(self, value : float):
# We read DataFrame from file each time so we don't add overhead of maintaining a large DataFrame in memory
df=pd.read_pickle(self.file_name)
df.at[len(df), ['name', 'time', 'value']] = [self.name, int(process_start_time - time.time()), value]
df.to_pickle(self.file_name)
class Count:
def __init__(self, name):
self.name = name
self.file_name = f'/tmp/{self.name}.pkl'
pd.DataFrame([], columns=['name', 'time', 'value']).to_pickle(self.file_name)
def record(self, value : float):
# We read DataFrame from file each time so we don't add overhead of maintaining a large DataFrame in memory
df=pd.read_pickle(self.file_name)
df.at[len(df), ['name', 'time', 'value']] = [self.name, int(process_start_time - time.time()), value]
df.to_pickle(self.file_name)
@ray.remote
class Actor1:
def __init__(self):
self.put_count = Count(name="repro_put_count")
self.put_gauge = Gauge(name="repro_put_gauge")
self.i = 0
def do_stuff(self):
stime = time.time()
object_ref = ray.put(np.zeros((3, 3000, 5000, 1)))
self.i += 1
print(self.i, (time.time() - stime) * 1000)
# self.put_gauge.record(
# self.put_count.record(1.)
return object_ref
def run():
actor1 = Actor1.remote()
for ridx in tqdm(range(1000000000), desc="Main loop"):
object_ref = ray.get(actor1.do_stuff.remote())
if __name__ == '__main__':
print(ray.init(_metrics_export_port=58391))
print(ray.cluster_resources())
run()
@ericl Is there a reason why we see spikes in the latency of ray.put()
call? In the graph you posted above, an operation that typically takes < 25 ms seems to spike to 75, 100, and even 200 ms very frequently. I want to better understand why that happens and how the p99
latency evolves with time.
Just a guess, but it could be something like periodic python GC, or something from the OS.
On Thu, Feb 18, 2021, 8:10 PM Srinath Rajagopalan notifications@github.com wrote:
@ericl https://github.com/ericl Is there a reason why we see spikes in the latency of ray.put() call? In the graph, you posted above, an operation that typically takes < 25 ms seems to spike to 75, 100, and even 200 ms. I want to better understand why that happens.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/13612#issuecomment-781802636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSVAVOMOKHZ5K2QKNCLS7XQDXANCNFSM4WNPGLOA .
ray.put
slows down over time.I have a simple setup with 2 actors. First Actor places raw and preprocessed images in shared memory and second actor runs predictions on preprocessed images.
I notice that
ray.put()
call inCamera#get
frame_ref = ray.put(np.zeros((3, 3000, 5000, 1)))
starts to slow down over time.
This trend looks a bit concerning. Any ideas about what’s happening here?