moj-analytical-services / pydbtools

Python version of dbtools
https://moj-analytical-services.github.io/pydbtools/
10 stars 2 forks source link

CCDE-293 pydbtools needs to be updated to latest awswrangler #50

Open isichei opened 2 years ago

isichei commented 2 years ago

There was an error where aws wrangler would not convert parquet dates correctly. This has been fixed in AWS Data Wrangler 2.12.0 (PR that fixed it for context) so we need to point pydbtools to at least this release.

Check that it functions correctly run the workaround that I posted in the PR:

import awswrangler as wr
df = wr.athena.read_sql_query(
    "SELECT mojap_end_datetime xhibit_v1.cases where mojap_latest_record LIMIT 10",
    database="database",
    pyarrow_additional_kwargs={"coerce_int96_timestamp_unit": "ms", "timestamp_as_object": True}
)
df.head()

You can use any SQL query that has an mojap_end_datetime as the default value is 2999-01-01 00:00:00 for latest records. This query should give back correct timestamps (2999-01-01 00:00:00).

If it works you will need to update the pyproject.toml to the relevant awrangler dependency.

isichei commented 2 years ago

Blocked by #51

mratford commented 2 years ago

I don't have access to the xhibit_v1 database at the moment but this fails on the following query.

releases = wr.athena.read_sql_query(
    """
    SELECT DISTINCT C.prison, C.offender_id, S.date_of_release
    FROM nomis_ao.core AS C
    LEFT JOIN nomis_ao.sentences AS S
    ON S.extract_datetime = C.extract_datetime AND S.record_number = C.record_number
    WHERE S.date_of_release IS NOT NULL
    """,
    database = "nomis_ao",
    ctas_approach = False,
    pyarrow_additional_kwargs={
        "coerce_int96_timestamp_unit": "ms", 
        "timestamp_as_object": True
    }
)

gives the AttributeError: Can only use .dt accessor with datetimelike values error.

For reference filtering the year does work

releases = wr.athena.read_sql_query(
    """
    SELECT DISTINCT C.prison, C.offender_id, S.date_of_release
    FROM nomis_ao.core AS C
    LEFT JOIN nomis_ao.sentences AS S
    ON S.extract_datetime = C.extract_datetime AND S.record_number = C.record_number
    WHERE S.date_of_release IS NOT NULL AND YEAR(S.date_of_release) < 2100
    """,
    database = "nomis_ao",
    ctas_approach = False,
    pyarrow_additional_kwargs={
        "coerce_int96_timestamp_unit": "ms", 
        "timestamp_as_object": True
    }
)

It seems not to like years after 2500 - this data looks like typos from prison staff.

mratford commented 2 years ago

https://github.com/moj-analytical-services/pydbtools/tree/give_db_name