Closed lccasagrande closed 2 years ago
Hey @lccasagrande, would love to check out what you've done. I am currently looking out for solutions for automated data quality tests, and since we are using the Glue data catalog, I can definitely relate to your post! Ideally, I would like to run expectations suite in a serverless way in Glue, pointing to datasets referenced in Glue Data Catalog. :+1:
Thanks so much for opening this issue and providing this feedback, @lccasagrande! This sounds like an amazing contribution! Would you please open up a draft PR with it, and then we can continue the conversation there? If you have any questions or need any guidance prior to doing that, please let us know. Looking forward!
@lccasagrande very interested in this! Currently trying to integrate with glue and not having much luck...
Nice to hear that, I will try to submit a PR by the end of this weekend. I was able to integrate with Glue by using both RuntimeDataConnector and InferredS3Dataconnector. When using with the Inferred connector I had to use Glue 2.0, instead of the newest version. Inn the execution engine I had to set the force_reuse_spark_context to True.
@lccasagrande - We are also very interested in this and are trying to integrate with Glue. Would love to contribute !
@talagluck I have just opened a PR to add support to Glue Data Catalog (#5123). Lemme know if I need to change or fix anything.
@lccasagrande thanks for the suggestion to drop to glue2.0. would it be possible to share a snippet of the config for the data connector object in the data config? I'm not sure about force reuse spark context
Hey @alfredHerdwatch, happy to help. If you want, you can create a new thread in stack overflow and we can discuss more there. I will share a snippet with you asap.
Is your feature request related to a problem? Please describe. It would be awesome to have a
DataConnector
to list all databases and tables available in AWS Glue Data Catalog. Instead of using theInferredAssetS3DataConnector
to infer all data assets in S3 by matching a regex pattern, the framework could provide aGlueDataCatalogDataConnector
to list all data assets. This would be similar to whatInferredAssetSqlDataConnector
already does.Describe the solution you'd like The idea is to have two new
DataConnectors
(e.g. InferredAssetGlueCatalogDataConnector and ConfiguredAssetGlueCatalogDataConnector) to enable the usage of AWS Glue Data Catalog to list all data assets available instead of inferring the data assets in S3. By using boto3, it is possible to retrieve not just the database and table names but the location where data is stored in S3. Therefore, we can use theSparkDFExecutionEngine
withPathBatchSpec
just like it does when using theInferredAssetS3DataConnector
to list all data assets without the need to install any additional packages. The other option is to use awswrangler, which is a great package to use in order to list all databases and tables from Glue Catalog.Describe alternatives you've considered The first alternative I have found is to use
InferredAssetS3DataConnector
, but we would have to define multiple data connectors when the data lake has different path patterns, which can become really messy. The other alternative is to connect the framework with Athena through theInferredAssetSqlDataConnector
, but this would require us the usage of another service just to use the AWS Glue Data Catalog. The solution I am using today is by developing a custom plugin to extend the framework and provide this integration with AWS Glue Data Catalog. It would be great if we could have this integration out of the box.Additional context I am opening this feature request because it would be awesome to have this integration out of the box and other teams would not need to develop a custom plugin in order to have this functionality like I had to do. If you folks are interested in having this feature, I am open to share what I developed. I have developed a series of plugins to extend the integration between the framework and AWS Services (e.g. SNS, Glue Job and Glue Data Catalog). Great Expectations is an amazing framework and easy to work with by either creating expectations or custom plugins, nice work folks!