Closed Hoeze closed 1 year ago
It seems like the applyInPandas function starts a new local cluster in every evaluation. How can I make it re-use the actors' cluster connection?
hi @Hoeze, applyInPandas will start python workers, and these workers are not connected to ray. Actor itself is a process, so it's not quite possible to 'reuse' its session. In addition, I think connecting to ray in each python worker is fine, the problem should be that they are not able to get the obj_ref, because it's not registered in their session. To solve this problem, I suggest defining an actor to hold all the object refs, and let the pyspark python workers connect to ray using the same namespace as the driver progoram. Then, these workers can get the actor by name and get the objects from it.
You can refer to our _convert_by_udf function in python/raydp/spark/dataset.py.
close as stale
The following code snipped dead-locks:
What is my mistake?