Closed njgerner closed 3 years ago
Thanks for raising this issue @njgerner!
@njgerner You'll be excited to learn that kedro-airflow
was open-sourced last week. Check it out: https://github.com/quantumblacklabs/kedro-airflow
I've updated the title with our internal ticket number to keep track of this more easily. :)
Can I have jab at this?
Hi @i25959341! I guess it's all yours. Let us know if you need help constructing it. You can have a look at our documentation about how plugins are constructed here and you can refer to the kedro-airflow
, kedro-docker
and kedro-viz
plugins here.
@i25959341 happy to contribute if you manage to get something off the ground, let me know!
Will do!! Mite get stuck, I am not that familiar with Glue,
sorry guys, i mite have to leave this one to others :(
I've looked into this briefly and am concerned that there is currently no elegant way of running a Kedro node as a pyspark AWS Glue job.
From point (6) here:
If your script requires additional libraries or files, you can specify them as follows:
Python library path Comma-separated Amazon Simple Storage Service (Amazon S3) paths to Python libraries that are required by the script.
Note
Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.
Because Kedro depends on pandas - among other libraries that rely on C extensions - it doesn't seem possible to bundle Kedro and its dependencies for use in the Glue environment, unless it's possible to have a more bare-bones Kedro designed only for pyspark use that does not include pandas and other such dependencies.
Am I missing something here or is there currently no way of running a Kedro node in Glue?
Hi @sarchila, thanks for raising this point. Kedro does support an entirely pyspark
workflow, documented here, which I suspect was going to be the premise for the AWS Glue plugin.
Hi @yetudada, thanks for the prompt response. Yes, I have been kicking the tires with Kedro and like the support for an entirely pyspark
workflow, so great work on that front. The trouble for me now is that the framework still requires pandas
, numpy
, etc. regardless of whether those io DataSets are used in one's pipeline or not.
That's makes it so that I can't provide Kedro as a library to AWS Glue, because it includes in its dependency list libraries that break on Glue for relying on C extensions.
One thought this raises for me is the possibility of having a version of Kedro that is essentially a pure python 'Kedro Core' library with no io
or contrib.io
datasets built-in (besides the core AbstractDataSet), leaving each of those to be pip installed separately as io plugins based on one's needs.
That would make it so that I can provide this hypothetical Kedro core library to Glue and not worry that it's going to choke on trying to include pandas or numpy (as I can't use any of those io
DataSets anyways in Glue).
Thanks for mentioning this @sarchila. I see what you mean about how pandas
, numpy
, etc. are included in the Kedro dependencies 🤔
We are though looking into a better io
and contrib.io
management system so your feedback will help shape this. Have you been able to use Kedro outside of the need to construct a kedro-glue workflow?
Have you been able to use Kedro outside of the need to construct a kedro-glue workflow?
Yes, I've found the framework to be very intuitive to get ramped up and use. That's part of the reason that I'm doggedly trying to figure out how this can be deployed to Glue. 🙂
My only points of friction getting ramped up so far have been around:
The ability to include a node that doesn't process any data per se trigger a task to send an email after processing some data
We don't have explicit non data dependencies although we have discussed it as a possible feature In the absence of that your two main options are 1: you said "after processing some data" you could just have the output of that be an input to the email node but ignore it in the email node 2: if you want to avoid the cost of the load you could instead have a "dummy" dataset between the two nodes that is just an in memory or json dataset that you write some small meaningless thing to
I just came across this issue and it turns out that a Kedro-Glue plugins would make things considerably easier for a project I'm currently working on. I've being doing some research and according to this answer of SO it is possible to install pandas
and numpy
on GLUE - https://stackoverflow.com/a/54852126/1112091
Hi @uwaisiqbal, I went down that rabbit hole with this a few weeks ago, and it turns out that SO answer only applies for AWS Glue jobs that are of "Python Shell" job type.
There are two types of AWS Glue jobs: Spark jobs and python shell jobs. See https://docs.aws.amazon.com/glue/latest/dg/add-job.html#create-job
My attempt here was to get Kedro working with AWS Glue "Spark" type jobs. One has less flexibility with the environment in "Spark" type jobs and cannot install third party libraries in the manner described in that SO answer.
It's finally possible to build Kedro-Glue! Commit ecd7277 has addressed removal of pandas
and numpy
from our core dependencies. The next major release will have this change. Good luck with the plugin! Let us know if you need help.
Hi @njgerner! We hope that you're well. We haven't seen any movement on this ticket so we'll be closing it. But do let us know if you need more help with this and we'll be happy to re-open it.
Description
Create a plugin to allow for easy conversion and deployment of Kedro pipelines to AWS Glue.
Context
Kedro is great for localized development but currently lacks a way to deploy pipelines on AWS.
Kedro-Docker
andKedro-Airflow
are great solutions to this problem but it would be great to be able to use AWS native ETL tools.Possible Implementation
Implementation should be fairly similar to
Kedro-Airflow
but with obvious alterations tailored towards the AWS API. The general idea would be to convert a Kedro pipeline nodes toCodeGenNode
structures, as well as capturing the relationship between nodes inCodeGenEdge
structures.For more information see: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-etl-script-generation.html