ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.15k stars 590 forks source link

feat: AWS Athena backend (and general AWS connections) #7682

Closed lostmygithubaccount closed 2 months ago

lostmygithubaccount commented 10 months ago

Consider how to support generic AWS authentication and backends for services, namely but not limited to Athena


Hi @lostmygithubaccount ,

Thanks for the reply and sorry for the delayed response here, I was bit occupied with other work so couldn't able spend time on this.

I had a look at the postgresql backends, but wondering about making a connection to athena using postgresql. At the moment All I have is aws credentials like AccessKeyId and SecretAccessKey. I am not sure how to pass these in the args.

If possible, could you please post a sample code snippet to make a connection to aws athena using postgresql backend ?

Originally posted by @uramith in https://github.com/ibis-project/ibis/discussions/7229#discussioncomment-7649716

jayceslesar commented 9 months ago

possibly could build off of https://github.com/laughingman7743/PyAthena? I remember suggesting this a long time ago but there were concerns on how it would be able to be tested without you know, having an AWS account in CI ect

EnkiNibiru commented 9 months ago

This feature would be valuable to me too. It'd probably be good to reuse functionality that's already common and built out in other AWS maintained packages.

For example, the AWS SDK for Pandas uses boto3 sessions for authentication. The authentication there includes a default session which will is a nice feature to connect to AWS so long as the environment is configured to work with other AWS tools like their AWS CLI. I tend to rely on the priority search authentication in there to autoload from a credentials/config file made with the AWS CLI to refresh any session tokens, but I know others may prefer refreshing standardized environment variables for AWS authentication instead.

One other pro for using this approach is that the config/credentials files used by boto3 sessions are also what pyarrow implemented for it's authentication into AWS and reading/writing parquet files to S3. So this may work nicely with the to_parquet/read_parquet and s3 file systems as well. Similarly it's what PyAthena mentioned above also uses. In practice this is also just nice to work with in my experience - get the aws authentication working once, and then I can use the same configs for multiple packages (AWS SDK for Pandas, PyArrow, boto3, PyAthena, etc)

Separate from the authentication topic...the AWS SDK for Pandas might make for a good SQL backend for Athena as well, as it implements the standard SQL operators directly in the Athena and Glue services. Likely that'd mean that the Ibis connection object would need to cover some config options, with the main one being different approaches in how to handle getting data from AWS back to the Python session that have a big impact on performance. But if all we need is authentication, then the SQL dialects in Trino (what Athena is based on) ought to get us pretty close too.

Hope the references above are helpful if this gets picked up, thanks!

cpcloud commented 9 months ago

Agreed that we should work towards making it easier to add support for backends that are ostensibly derivative of existing systems.

It's very likely that we won't get to this until after #7580 (or a sequence of its changes) are merged and released, as we'd like to get away from sqlalchemy before supporting more backends.

gforsyth commented 2 months ago

If someone wants to try handing a pyathena DB-API connection object to the trino.from_connection constructor and see if that's tractable, we can look at what else might be required to make this work. The docs on PyAthena reference dumping query results as CSV to a bucket and then downloading that CSV -- if that's the pattern, we would probably hold off on adding this until there is proper ADBC support.