Kedro with custom execution engine? (Ray)

kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

https://kedro.org

Apache License 2.0

9.53k stars 877 forks source link

Kedro with custom execution engine? (Ray) #479

Open crypdick opened 3 years ago

crypdick commented 3 years ago

Potential user here. I'm interested in using Kedro, but we use Ray Distributed instead of PySpark for our execution engine. Do your pipelines support this?

921kiyo commented 3 years ago

Hi, thank you for your interest for Kedro! We don't natively support Ray DataSet nor RayRunner so far, but you can add both as a custom dataset/runner. See https://kedro.readthedocs.io/en/latest/06_nodes_and_pipelines/02_pipelines.html?#using-a-custom-runner for using a custom runner, and https://kedro.readthedocs.io/en/latest/07_extend_kedro/01_custom_datasets.html for custom dataset.

crypdick commented 3 years ago

Update: I confirmed that Ray works fine :)

yetudada commented 3 years ago

Just a quick note on this. @crypdick followed this tutorial from @dataengineerone, except instead of using multiprocessing inside each node he used ray.

Thanks for investigating this @crypdick ✨

Harsh-Maheshwari commented 1 year ago

can we add official support for ray?

astrojuanlu commented 1 year ago

We happen to have documentation about Kedro on Dask https://kedro.readthedocs.io/en/stable/deployment/dask.html and apparently people have made Kedro on Ray work. Maybe we could reopen this issue and turn it into a documentation one? Happy to work on this in 6 to 8 weeks.

Harsh-Maheshwari commented 1 year ago

@crypdick and @astrojuanlu Is it possible that I just add the @ray.remote decorator to my nodes and everything should work without any changes? Because for Ray>1.5 we don't have to specifically call ray.init()

@yetudada The tutorial is very old and not a very scalable way of doing this.

I think with Hooks and some other features from kedro we should be able to do this. I am a beginner in both ray and kedro frameworks. Any help/guidance on an approach to solving this is appreciated

crypdick commented 1 year ago

I am not sure @Harsh-Maheshwari , I stopped using Kedro years ago in favor of Metaflow.

astrojuanlu commented 10 months ago

Reopening this as a documentation issue.

stichbury commented 10 months ago

We were asked about this again this week https://linen-slack.kedro.org/t/15736818/is-there-any-docs-that-explains-how-kedro-can-be-integrated-#70729160-84c6-4750-a59f-b3571e2e026b

Maybe time to write this up.

astrojuanlu commented 6 months ago

Yesterday I met @IvanNardini and we discussed that it would be nice to bring this back to life at some point 😃

Harsh-Maheshwari commented 6 months ago

@astrojuanlu @stichbury Ray would be a great addition to Kedro, I moved to purely a Ray code base because of difficulties in executing Kedro with Ray. Documentation exploring the integration between the two would help us a lot

astrojuanlu commented 5 months ago

Btw here's an early prototype from a hackathon https://github.com/kedro-org/kedro/pull/995

noklam commented 5 months ago

@Harsh-Maheshwari Could you share what's the difficulties you had using Ray with Kedro? Which part of Ray are you using?

Harsh-Maheshwari commented 4 months ago

Hi @noklam , Sorry for the delayed response

I am using Ray mostly for distributed computing, In the context of Kedro, We should be able to run a node across various workers. kedro shouldn't have to manage the scheduling or cluster side of things

Let's say I have a partitioned dataset with 10_000 parquet files. I should be able to start a remote ray cluster and then connect my local Kedro project to that cluster and schedule a pipeline to run on each worker. where each worker is running the same pipeline but on different batches of parquet files and the results are stored in a new partitioned dataset according to my catalog. All of the scheduling and work distribution should be managed by the ray head node

A good-to-have feature would be : Start and then if the system fails for any reason, restart from where we left off

noklam commented 4 months ago

@Harsh-Maheshwari

I am using Ray mostly for distributed computing, In the context of Kedro, We should be able to run a node across various workers

https://github.com/kedro-org/kedro/issues/479#issuecomment-674116048 why does it fails to solve your problem?

Is this a particular problem about ParitionedDataset instead? https://github.com/kedro-org/kedro/issues/1413

I can see that the approach you suggest would work but it's not clear to me why is it better?

Harsh-Maheshwari commented 4 months ago

@noklam

I have just described the use case, right now I am not sure how to integrate ray with kedro

So I don't know if this is the best/only way to do this

what I can say is if the integration is bit more native between kedro and ray then let's say we can even use different auto-scalers in ray for different nodes in kedro

noklam commented 4 months ago

@Harsh-Maheshwari I am no expert of Ray so I need some example to understand what's not working and how Kedro can make this easier.

Maybe the problem is we just need a kedro-ray plugin and nothing should change in Kedro. I will leave it with someone more experience with Ray.

astrojuanlu commented 1 week ago

In https://github.com/astrojuanlu/workshop-from-zero-to-mlops I described how to execute Kedro pipelines in Ray using Prefect.

An alternative method would be creating a custom runner, but maybe it's good to leave that to an orchestrator instead?

pascalwhoop commented 3 days ago

just adding a comment to A) follow and B) mention that we may look into this in the coming months, mostly because doing batch embeddings in Spark with modern embedding models from huggingface is a pain