laughingman7743 / PyAthena

PyAthena is a Python DB API 2.0 (PEP 249) client for Amazon Athena.
MIT License
461 stars 105 forks source link

If a value for a partition key is None, to_sql doesn't warn you and no data is written #531

Closed AJM10565 closed 6 months ago

AJM10565 commented 6 months ago

Suppose you had some code:

from pyathena.pandas.util import to_sql
import pandas as pd

df = pd.DataFrame({
    "name": ["Bob"],
    "day": ["Monday"],
    "time": ["morning"],
    "dt": ["2022-01-01"],
    "hr": ["00"],
    "source": ["source1"],
    "agent": [None]
})

to_sql(df, "table_name", conn, "s3://bucket_name/path", schema="schema_name", index=False, if_exists="append", partitions=["dt", "hr", "source", "agent"])

When you look at the info logs, you'd see something like:

to_parquet: s3://bucket_name/path/dt=2022-01-01/hr=00/source=source1/agent=None/6a1b10b7-338e-48d8-ba72-cdbfa6a61084

2024-03-29 14:09:37,093 - pyathena.pandas.util - INFO - 14788 - CREATE EXTERNAL TABLE IF NOT EXISTS `schema_name`.`table_name` (
`name` STRING,
`day` STRING,
`time` STRING
)
PARTITIONED BY (
`dt` STRING,
`hr` STRING,
`source` STRING,
`agent` STRING
)
STORED AS PARQUET
LOCATION 's3://bucket_name/path/'

Notice how in the ALTER TABLE statement, agent is 'None'. This results in no data being written to the table, but no warning or error is raised.

Expected behavior: If a partition key has a value of None, either raise an error or warning, or don't include that key in the partition.

Actual behavior: The partition key with a None value is included in the partition with a value of 'None', resulting in no data being written to the table and no warning or error being raised. The line you don't see in the logs is the alter table line, which looks something like this:

2024-03-29 14:09:38,439 - pyathena.pandas.util - INFO - 16134 - ALTER TABLE schema_name.table_name ADD IF NOT EXISTS PARTITION (dt = '2022-01-01', hr = '00', source = 'source1', agent = 'None') LOCATION 's3://bucket_name/path/dt=2022-01-01/hr=00/source=source1/agent=None/'

But its missing from the logs, so its easy to spot when the write fails. It would be nice if a warning log could be added to the effect of

for partition_key in paritition_keys: if partition_key is None: self.logger.warning(f"Partition key: {partition_key} is None, no data will be written to the table.")

Let me know if you have any other suggestions or feedback for the issue report.