great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
10.02k stars 1.55k forks source link

AWS Glue Catalog Data Connector Support #4945

Closed lccasagrande closed 2 years ago

lccasagrande commented 2 years ago

Is your feature request related to a problem? Please describe. It would be awesome to have a DataConnector to list all databases and tables available in AWS Glue Data Catalog. Instead of using the InferredAssetS3DataConnector to infer all data assets in S3 by matching a regex pattern, the framework could provide a GlueDataCatalogDataConnector to list all data assets. This would be similar to what InferredAssetSqlDataConnector already does.

Describe the solution you'd like The idea is to have two new DataConnectors(e.g. InferredAssetGlueCatalogDataConnector and ConfiguredAssetGlueCatalogDataConnector) to enable the usage of AWS Glue Data Catalog to list all data assets available instead of inferring the data assets in S3. By using boto3, it is possible to retrieve not just the database and table names but the location where data is stored in S3. Therefore, we can use the SparkDFExecutionEngine with PathBatchSpec just like it does when using the InferredAssetS3DataConnector to list all data assets without the need to install any additional packages. The other option is to use awswrangler, which is a great package to use in order to list all databases and tables from Glue Catalog.

Describe alternatives you've considered The first alternative I have found is to use InferredAssetS3DataConnector, but we would have to define multiple data connectors when the data lake has different path patterns, which can become really messy. The other alternative is to connect the framework with Athena through the InferredAssetSqlDataConnector, but this would require us the usage of another service just to use the AWS Glue Data Catalog. The solution I am using today is by developing a custom plugin to extend the framework and provide this integration with AWS Glue Data Catalog. It would be great if we could have this integration out of the box.

Additional context I am opening this feature request because it would be awesome to have this integration out of the box and other teams would not need to develop a custom plugin in order to have this functionality like I had to do. If you folks are interested in having this feature, I am open to share what I developed. I have developed a series of plugins to extend the integration between the framework and AWS Services (e.g. SNS, Glue Job and Glue Data Catalog). Great Expectations is an amazing framework and easy to work with by either creating expectations or custom plugins, nice work folks!

dcupif commented 2 years ago

Hey @lccasagrande, would love to check out what you've done. I am currently looking out for solutions for automated data quality tests, and since we are using the Glue data catalog, I can definitely relate to your post! Ideally, I would like to run expectations suite in a serverless way in Glue, pointing to datasets referenced in Glue Data Catalog. :+1:

talagluck commented 2 years ago

Thanks so much for opening this issue and providing this feedback, @lccasagrande! This sounds like an amazing contribution! Would you please open up a draft PR with it, and then we can continue the conversation there? If you have any questions or need any guidance prior to doing that, please let us know. Looking forward!

darkCoffy commented 2 years ago

@lccasagrande very interested in this! Currently trying to integrate with glue and not having much luck...

lccasagrande commented 2 years ago

Nice to hear that, I will try to submit a PR by the end of this weekend. I was able to integrate with Glue by using both RuntimeDataConnector and InferredS3Dataconnector. When using with the Inferred connector I had to use Glue 2.0, instead of the newest version. Inn the execution engine I had to set the force_reuse_spark_context to True.

keerthiis commented 2 years ago

@lccasagrande - We are also very interested in this and are trying to integrate with Glue. Would love to contribute !

lccasagrande commented 2 years ago

@talagluck I have just opened a PR to add support to Glue Data Catalog (#5123). Lemme know if I need to change or fix anything.

darkcofy commented 2 years ago

@lccasagrande thanks for the suggestion to drop to glue2.0. would it be possible to share a snippet of the config for the data connector object in the data config? I'm not sure about force reuse spark context

lccasagrande commented 2 years ago

Hey @alfredHerdwatch, happy to help. If you want, you can create a new thread in stack overflow and we can discuss more there. I will share a snippet with you asap.

darkcofy commented 2 years ago

https://stackoverflow.com/questions/72293200/great-expectations-v3-api-in-aws-glue-3-0 @lccasagrande TIA!