Open imperio-wxm opened 1 year ago
Any progress on this issue?
The file "ray_demo.py" should be in the the code search path "ray.job.code-search-path", can you check it?
@SongGuyang Hi, I'm sure ray_demo.py is in the "ray.job.code-search-path" path. Because I executed it many times in a row, only a few times failed, and others succeeded. No environmental or code changes during this period. The file is always in the path, I just call the request multiple times.
I see. Is your test in single node or multiple nodes?
@SongGuyang What do single-node and multi-node refer to? Ray cluster has multiple nodes and the springboot service also has multiple nodes, providing a unified entrance through Nginx load balancing.
@SongGuyang hi, any progress? After testing, the task does not have this problem, only the actor will appear.
I have no idea for it, maybe you should provide an easy way for us to reproduce it?
@SongGuyang
Hi, running this code can reproduce the problem of ModuleNotFoundError: No module named 'ray_demo'
.
Is there something wrong with the way this code?
# ray_demo.py
import ray
from typing import List
@ray.remote
class Counter(object):
def __init__(self):
self.value = 0
def increment(self):
self.value += 1
return self.value
public static void main(String[] args) throws Exception {
int loopNum = Integer.parseInt(args[0]);
String searchPath = args[1];
for (int i = 0; i < loopNum; i++) {
System.setProperty("ray.address", "ip:6379");
System.setProperty("ray.job.code-search-path", searchPath);
if (!Ray.isInitialized()) {
Ray.init();
}
PyActorClass actorClass = PyActorClass.of("ray_demo", "Counter");
PyActorHandle actor = Ray.actor(actorClass).remote();
ObjectRef objRef1 = actor.task(PyActorMethod.of("increment", int.class)).remote();
System.out.println("increment count by java,result = " + objRef1.get());
ObjectRef objRef2 = actor.task(PyActorMethod.of("increment", int.class)).remote();
System.out.println("increment count by java,result = " + objRef2.get());
actor.kill();
Thread.sleep(2000);
}
}
Exception in thread "main" io.ray.api.exception.RayActorException: The actor 736014eef257297e65efdbe9fa050000 died unexpectedly before finishing this task.
at io.ray.runtime.object.ObjectSerializer.deserializeActorException(ObjectSerializer.java:257)
at io.ray.runtime.object.ObjectSerializer.deserialize(ObjectSerializer.java:104)
at io.ray.runtime.object.ObjectStore.get(ObjectStore.java:140)
at io.ray.runtime.AbstractRayRuntime.get(AbstractRayRuntime.java:144)
at io.ray.runtime.AbstractRayRuntime.get(AbstractRayRuntime.java:125)
at io.ray.runtime.AbstractRayRuntime.get(AbstractRayRuntime.java:120)
at io.ray.api.Ray.get(Ray.java:98)
at io.ray.runtime.object.ObjectRefImpl.get(ObjectRefImpl.java:77)
1:job_id:f9050000
2023-07-30 19:38:46,724 ERROR worker.py:861 -- Worker exits with an exit code None. The worker may have exceeded K8s pod memory limits.
Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 1796, in ray._raylet.task_execution_handler
File "python/ray/_raylet.pyx", line 1656, in ray._raylet.execute_task_with_cancellation_handler
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/function_manager.py", line 559, in load_actor_class
actor_class = self._load_actor_class_from_local(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/function_manager.py", line 619, in _load_actor_class_from_local
object = self.load_function_or_class_from_local(module_name, class_name)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/function_manager.py", line 139, in load_function_or_class_from_local
module = importlib.import_module(module_name)
File "/home/ray/anaconda3/lib/python3.8/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'ray_demo'
An unexpected internal error occurred while the worker was executing a task.
Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 1796, in ray._raylet.task_execution_handler
File "python/ray/_raylet.pyx", line 1656, in ray._raylet.execute_task_with_cancellation_handler
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/function_manager.py", line 559, in load_actor_class
actor_class = self._load_actor_class_from_local(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/function_manager.py", line 619, in _load_actor_class_from_local
object = self.load_function_or_class_from_local(module_name, class_name)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/function_manager.py", line 139, in load_function_or_class_from_local
module = importlib.import_module(module_name)
File "/home/ray/anaconda3/lib/python3.8/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'ray_demo'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 1838, in ray._raylet.task_execution_handler
SystemExit
Is "ip:6379" a valid address? Have you already started a ray cluster before you ran the code? Can you show all of the command lines?
@SongGuyang The environment is correct, because the first few for loops can run correctly, and then an error is reported and interrupted. Please don’t doubt the problem of ip or cluster.
It is the above for loop code, I can reproduce it stably, you can try the above code.
// 10 is loopNum
// search_path is ray_demo.py path
java -cp ray-demo-1.0-SNAPSHOT.jar ActorMain 10 "search_path"
increment count by java,result = 1
increment count by java,result = 2 // first loop
increment count by java,result = 1
increment count by java,result = 2 // second loop
increment count by java,result = 1
increment count by java,result = 2 // third loop
Exception in thread "main" io.ray.api.exception.RayActorException: The actor 2b4cfbd8d18444e38e54519e00060000 died unexpectedly before finishing this task.
at io.ray.runtime.object.ObjectSerializer.deserializeActorException(ObjectSerializer.java:257)
at io.ray.runtime.object.ObjectSerializer.deserialize(ObjectSerializer.java:104)
at io.ray.runtime.object.ObjectStore.get(ObjectStore.java:140)
at io.ray.runtime.AbstractRayRuntime.get(AbstractRayRuntime.java:144)
at io.ray.runtime.AbstractRayRuntime.get(AbstractRayRuntime.java:125)
at io.ray.runtime.AbstractRayRuntime.get(AbstractRayRuntime.java:120)
at io.ray.api.Ray.get(Ray.java:98)
at io.ray.runtime.object.ObjectRefImpl.get(ObjectRefImpl.java:77)
@SongGuyang The environment is correct, because the first few for loops can run correctly, and then an error is reported and interrupted. Please don’t doubt the problem of ip or cluster.
It is the above for loop code, I can reproduce it stably, you can try the above code.
// 10 is loopNum // search_path is ray_demo.py path java -cp ray-demo-1.0-SNAPSHOT.jar ActorMain 10 "search_path"
- print ray actor call result
increment count by java,result = 1 increment count by java,result = 2 // first loop increment count by java,result = 1 increment count by java,result = 2 // second loop increment count by java,result = 1 increment count by java,result = 2 // third loop Exception in thread "main" io.ray.api.exception.RayActorException: The actor 2b4cfbd8d18444e38e54519e00060000 died unexpectedly before finishing this task. at io.ray.runtime.object.ObjectSerializer.deserializeActorException(ObjectSerializer.java:257) at io.ray.runtime.object.ObjectSerializer.deserialize(ObjectSerializer.java:104) at io.ray.runtime.object.ObjectStore.get(ObjectStore.java:140) at io.ray.runtime.AbstractRayRuntime.get(AbstractRayRuntime.java:144) at io.ray.runtime.AbstractRayRuntime.get(AbstractRayRuntime.java:125) at io.ray.runtime.AbstractRayRuntime.get(AbstractRayRuntime.java:120) at io.ray.api.Ray.get(Ray.java:98) at io.ray.runtime.object.ObjectRefImpl.get(ObjectRefImpl.java:77)
@SongGuyang Hi, Can you reproduce the problem as in the previous reply?
Can I reproduce this in single machine which means single node ray cluster? And Can you provide whole of the code? I think the code above is segment.
@SongGuyang Hi, single node ray cluster can not reproduce, it works well.
The jar has too many dependencies, I only provide the source jar, and there is just an ActorMain.java
file inside, that is all the code, nothing more.
ray-actor-demo-1.0-SNAPSHOT.jar.zip
The cluster environment can be reproduced, A head nodeA has been started.
I start a nodeB, execute the commandray start --address=nodeA:6379
to connect with the head. Then I can see on the ray dashboard that nodeB has registered and connected.
Next, I run ray-actor-demo-1.0-SNAPSHOT.jar on nodeB.
java -cp ray-actor-demo-1.0-SNAPSHOT.jar com.wxmimperio.ray.ActorMain 10 "/opt/python/files" "nodeA:6379"
Get error:
increment count by java,result = 1
increment count by java,result = 2
increment count by java,result = 1
increment count by java,result = 2
Exception in thread "main" io.ray.api.exception.RayActorException: The actor 1ec43d7e8127ca542f8d82597e060000 died unexpectedly before finishing this task.
at io.ray.runtime.object.ObjectSerializer.deserializeActorException(ObjectSerializer.java:257)
at io.ray.runtime.object.ObjectSerializer.deserialize(ObjectSerializer.java:104)
at io.ray.runtime.object.ObjectStore.get(ObjectStore.java:140)
at io.ray.runtime.AbstractRayRuntime.get(AbstractRayRuntime.java:144)
at io.ray.runtime.AbstractRayRuntime.get(AbstractRayRuntime.java:125)
at io.ray.runtime.AbstractRayRuntime.get(AbstractRayRuntime.java:120)
at io.ray.api.Ray.get(Ray.java:98)
at io.ray.runtime.object.ObjectRefImpl.get(ObjectRefImpl.java:77)
at com.wxmimperio.ray.ActorMain.main(ActorMain.java:32)
@imperio-wxm
I'm going to reproduce the problem
@imperio-wxm you can upload your code in github and give me the link, so that I can set up a good local environment to better reproduce and troubleshoise problems, thanks.
@JackyMa1997 I don't know what kind of code you want, I just need this to reproduce the problem. Create an empty project, use ActorMain.java as the entry point, package it into a jar and run it, it's that simple, without any other code.
You can't reproduce it with these two files? ActorMain.java.zip ray_demo.py.zip
@JackyMa1997 you can open this project with ide direct and run mvn clean package
.
ray-actor-demo.zip
@JackyMa1997 Hi, any progress on this issue? It is found that tasks under high concurrency will have similar problems.
@imperio-wxm I did not find this problem in the reproduction, I have no idea with this. I use exactly the same environment here, I'm going to try again later
@JackyMa1997 Hi, how does java send the code fragment to the ray node, can you give me a link to the source code?
Whether it is related to the order of .py file placement, first set searchPath, in ray.init(), and finally create remote.call. Is the order of the three interchangeable? After calling ray.init, can the .py files in searchPath be deleted and re-added? In other words, is dynamic loading supported?
I'm reproducing the same phenomenon here. Are you sure it's the same phenomenon
@JackyMa1997
Thank you for your reappearance, the error of the java client is the same, but it depends on the error message in the ray actor. ModuleNotFoundError: No module named 'ray_demo'
@imperio-wxm I followed your way and the code to reproduce. After checking, I found that /opt/python/files was the root permission on my side, so nodeB could not see the code search path. I change the code search path and the code works. Please check whether the code search path over there is a permission problem, and then run the code to try.
What happened + What you expected to happen
I have a springboot long service, every http request will trigger this method, every time an actor is created, kill it after execution. But sometimes it executes successfully, sometimes it fails.
my code:
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "python/ray/_raylet.pyx", line 1838, in ray._raylet.task_execution_handler SystemExit