kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 877 forks source link

[KED-949] Kedro-Glue Plugin #57

Closed njgerner closed 3 years ago

njgerner commented 5 years ago

Description

Create a plugin to allow for easy conversion and deployment of Kedro pipelines to AWS Glue.

Context

Kedro is great for localized development but currently lacks a way to deploy pipelines on AWS. Kedro-Docker and Kedro-Airflow are great solutions to this problem but it would be great to be able to use AWS native ETL tools.

Possible Implementation

Implementation should be fairly similar to Kedro-Airflow but with obvious alterations tailored towards the AWS API. The general idea would be to convert a Kedro pipeline nodes to CodeGenNode structures, as well as capturing the relationship between nodes in CodeGenEdge structures.

For more information see: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-etl-script-generation.html

yetudada commented 5 years ago

Thanks for raising this issue @njgerner!

yetudada commented 4 years ago

@njgerner You'll be excited to learn that kedro-airflow was open-sourced last week. Check it out: https://github.com/quantumblacklabs/kedro-airflow

921kiyo commented 4 years ago

I've updated the title with our internal ticket number to keep track of this more easily. :)

i25959341 commented 4 years ago

Can I have jab at this?

yetudada commented 4 years ago

Hi @i25959341! I guess it's all yours. Let us know if you need help constructing it. You can have a look at our documentation about how plugins are constructed here and you can refer to the kedro-airflow, kedro-docker and kedro-viz plugins here.

njgerner commented 4 years ago

@i25959341 happy to contribute if you manage to get something off the ground, let me know!

i25959341 commented 4 years ago

Will do!! Mite get stuck, I am not that familiar with Glue,

i25959341 commented 4 years ago

sorry guys, i mite have to leave this one to others :(

sarchila commented 4 years ago

I've looked into this briefly and am concerned that there is currently no elegant way of running a Kedro node as a pyspark AWS Glue job.

From point (6) here:

If your script requires additional libraries or files, you can specify them as follows:

Python library path Comma-separated Amazon Simple Storage Service (Amazon S3) paths to Python libraries that are required by the script.

Note

Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.

Because Kedro depends on pandas - among other libraries that rely on C extensions - it doesn't seem possible to bundle Kedro and its dependencies for use in the Glue environment, unless it's possible to have a more bare-bones Kedro designed only for pyspark use that does not include pandas and other such dependencies.

Am I missing something here or is there currently no way of running a Kedro node in Glue?

yetudada commented 4 years ago

Hi @sarchila, thanks for raising this point. Kedro does support an entirely pyspark workflow, documented here, which I suspect was going to be the premise for the AWS Glue plugin.

sarchila commented 4 years ago

Hi @yetudada, thanks for the prompt response. Yes, I have been kicking the tires with Kedro and like the support for an entirely pyspark workflow, so great work on that front. The trouble for me now is that the framework still requires pandas, numpy, etc. regardless of whether those io DataSets are used in one's pipeline or not.

That's makes it so that I can't provide Kedro as a library to AWS Glue, because it includes in its dependency list libraries that break on Glue for relying on C extensions.

One thought this raises for me is the possibility of having a version of Kedro that is essentially a pure python 'Kedro Core' library with no io or contrib.io datasets built-in (besides the core AbstractDataSet), leaving each of those to be pip installed separately as io plugins based on one's needs.

That would make it so that I can provide this hypothetical Kedro core library to Glue and not worry that it's going to choke on trying to include pandas or numpy (as I can't use any of those io DataSets anyways in Glue).

yetudada commented 4 years ago

Thanks for mentioning this @sarchila. I see what you mean about how pandas, numpy, etc. are included in the Kedro dependencies 🤔

We are though looking into a better io and contrib.io management system so your feedback will help shape this. Have you been able to use Kedro outside of the need to construct a kedro-glue workflow?

sarchila commented 4 years ago

Have you been able to use Kedro outside of the need to construct a kedro-glue workflow?

Yes, I've found the framework to be very intuitive to get ramped up and use. That's part of the reason that I'm doggedly trying to figure out how this can be deployed to Glue. 🙂

My only points of friction getting ramped up so far have been around:

tolomea commented 4 years ago

The ability to include a node that doesn't process any data per se trigger a task to send an email after processing some data

We don't have explicit non data dependencies although we have discussed it as a possible feature In the absence of that your two main options are 1: you said "after processing some data" you could just have the output of that be an input to the email node but ignore it in the email node 2: if you want to avoid the cost of the load you could instead have a "dummy" dataset between the two nodes that is just an in memory or json dataset that you write some small meaningless thing to

uwaisiqbal commented 4 years ago

I just came across this issue and it turns out that a Kedro-Glue plugins would make things considerably easier for a project I'm currently working on. I've being doing some research and according to this answer of SO it is possible to install pandas and numpy on GLUE - https://stackoverflow.com/a/54852126/1112091

sarchila commented 4 years ago

Hi @uwaisiqbal, I went down that rabbit hole with this a few weeks ago, and it turns out that SO answer only applies for AWS Glue jobs that are of "Python Shell" job type.

There are two types of AWS Glue jobs: Spark jobs and python shell jobs. See https://docs.aws.amazon.com/glue/latest/dg/add-job.html#create-job

My attempt here was to get Kedro working with AWS Glue "Spark" type jobs. One has less flexibility with the environment in "Spark" type jobs and cannot install third party libraries in the manner described in that SO answer.

yetudada commented 4 years ago

It's finally possible to build Kedro-Glue! Commit ecd7277 has addressed removal of pandas and numpy from our core dependencies. The next major release will have this change. Good luck with the plugin! Let us know if you need help.

yetudada commented 3 years ago

Hi @njgerner! We hope that you're well. We haven't seen any movement on this ticket so we'll be closing it. But do let us know if you need more help with this and we'll be happy to re-open it.