laughingman7743 / PyAthena

PyAthena is a Python DB API 2.0 (PEP 249) client for Amazon Athena.
MIT License
458 stars 104 forks source link

Impl Polars cursor #436

Open laughingman7743 opened 1 year ago

laughingman7743 commented 1 year ago

https://www.pola.rs/ https://pypi.org/project/polars/ https://pola-rs.github.io/polars/py-polars/html/reference/

darkcofy commented 1 year ago

polars cursor would be a godsend!

sacundim commented 1 year ago

Polars uses Arrow as its memory representation, so, as I understand it, supporting Polars in PyAthena is mostly just a syntactic shortcut, right? Polars' documentation for the from_arrow method says:

This operation will be zero copy for the most part. Types that are not supported by Polars may be cast to the closest supported type.

So except for that note about unsupported types, the following code should have basically no overhead already today:

import polars as pl
import pyathena
from pyathena.arrow.cursor import ArrowCursor

cursor = pyathena.connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
                 region_name="us-west-2",
                 cursor_class=ArrowCursor).cursor()

# This should be zero-copy most of the time
polars_df = pl.from_arrow(cursor.execute("SELECT * FROM many_rows").as_arrow())

I actually tried out PyAthena → Arrow → Polars in this fashion the other day, so I can at least confirm this is functional (i.e. it will populate a Polars DataFrame that works, I didn't verify anything about copying or performance overheads)

mazzma12 commented 5 months ago

Polars uses Arrow as its memory representation, so, as I understand it, supporting Polars in PyAthena is mostly just a syntactic shortcut, right? Polars' documentation for the from_arrow method says:

This operation will be zero copy for the most part. Types that are not supported by Polars may be cast to the closest supported type.

So except for that note about unsupported types, the following code should have basically no overhead already today:

import polars as pl
import pyathena
from pyathena.arrow.cursor import ArrowCursor

cursor = pyathena.connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
                 region_name="us-west-2",
                 cursor_class=ArrowCursor).cursor()

# This should be zero-copy most of the time
polars_df = pl.from_arrow(cursor.execute("SELECT * FROM many_rows").as_arrow())

I actually tried out PyAthena → Arrow → Polars in this fashion the other day, so I can at least confirm this is functional (i.e. it will populate a Polars DataFrame that works, I didn't verify anything about copying or performance overheads)

Hi, may I ask what version of pyarrow are you using ? I have an error with version 15.0.0

OperationalError: When reading information for key 'test/670672a1-dab2-4635-ba3b-1c6a16dc0b6f.csv' in bucket '{BucketName}': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached

or if another solution comes to your mind to explain this error. Thank you

sacundim commented 5 months ago

Hi, may I ask what version of pyarrow are you using ? I have an error with version 15.0.0

OperationalError: When reading information for key 'test/670672a1-dab2-4635-ba3b-1c6a16dc0b6f.csv' in bucket '{BucketName}': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached

or if another solution comes to your mind to explain this error. Thank you

This was many months ago, I no longer recall what version I was using... but your error is to all appearances a network connectivity problem, says so right there in the message

mazzma12 commented 5 months ago

Yes, I totally agree, but it's cryptic to me since it's working with another cursor (like pandasCursor for example)

Hi, may I ask what version of pyarrow are you using ? I have an error with version 15.0.0

OperationalError: When reading information for key 'test/670672a1-dab2-4635-ba3b-1c6a16dc0b6f.csv' in bucket '{BucketName}': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached

or if another solution comes to your mind to explain this error. Thank you

This was many months ago, I no longer recall what version I was using... but your error is to all appearances a network connectivity problem, says so right there in the message

laughingman7743 commented 4 months ago
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_executemany[arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/a580fb77-99b1-49c8-8f70-cc3eaf663089' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_fetchall[arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/93571118-03bb-4b01-9772-4b1f99dc9f61.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_executemany_fetch[arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/f7a80ee7-26c4-4103-bf49-c94b29c6eea0' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_fetchall[async_arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/573dd21d-a3fe-4bbe-a7b5-aa1807dfd2a6.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_complex_unload_as_arrow[arrow_cursor0] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/f23455fe-6929-439d-864b-d52b55b7be7a' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_iterator[arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/b94d398d-1cac-45b2-b5d3-9210897b6d5f.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_fetchall[async_arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/c823bf59-cd6b-4e0c-9600-690de08d3f18' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_iterator[async_arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/c3433f82-a070-4044-be85-1d18786d1311' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_arraysize[arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/64c7bec3-8281-4807-ae95-fb19ca8d0159' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_iceberg_table - pyathena.error.OperationalError: When reading information for key 'tmp/bbf31c91-e845-4ccf-8b8b-588147fcf4e7.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_arraysize[async_arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/76e1c3c4-ed18-4a18-ad6f-3fe9dc5db8a1.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_arraysize[async_arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/4f059472-9026-49ac-892f-cfd47e5eac81' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_cursor.py::TestArrowCursor::test_complex_unload[arrow_cursor0] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/170fbcee-1269-4367-8d3e-b8e43f838b79' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_description[async_arrow_cursor1] - pyathena.error.OperationalError: When getting information for key 'tmp/unload/20240316/baf70311-3540-4dce-ae43-8aecc19566b1' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached
FAILED tests/pyathena/arrow/test_async_cursor.py::TestAsyncArrowCursor::test_query_execution[async_arrow_cursor0] - pyathena.error.OperationalError: When reading information for key 'tmp/0723649b-bb5c-450b-baa2-d52ea3d8a7aa.csv' in bucket 'laughingman7743-athena': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 28, Timeout was reached

An error occurred when I ran the test in the local environment. 🤔 This is not occurring in GitHubActions. https://github.com/laughingman7743/PyAthena/issues/520

laughingman7743 commented 4 months ago

https://github.com/apache/arrow/issues/36007