sfu-db / connector-x

Fastest library to load data from DB to DataFrames in Rust and Python
https://sfu-db.github.io/connector-x
MIT License
1.93k stars 150 forks source link

`read_sql` with `return_type` in `"arrow2", "polars", "polars2"` causes wrong selected date if date `>= 2262/04/12` #495

Open romiof opened 1 year ago

romiof commented 1 year ago

What language are you using?

Python

What version are you using?

connectorx 0.3.2a5

What database are you using?

Tested with Oracle and SQL Server.

What dataframe are you using?

polars 0.17.12

Can you describe your bug?

Selecting date columns with date in very future, cause inconsistents result. For instance, any date >= 2262/04/12 can reproduce this bug.

What are the steps to reproduce the behavior?

# ORACLE DB
import connectorx as cx
query = """SELECT TO_DATE('2262/04/11', 'YYYY/MM/DD') AS col_ok 
                , TO_DATE('2262/04/12', 'YYYY/MM/DD') AS col_bugs_here
                , TO_DATE('2299/01/01', 'YYYY/MM/DD') AS col_bugs_here_too
          FROM dual"""

df = cx.read_sql(ENGINE_PATH, query, return_type="arrow2")

pyarrow.Table
COL_OK: timestamp[ns]
COL_BUGS_HERE: timestamp[ns]
COL_BUGS_HERE_TOO: timestamp[ns]
----
COL_OK: [[2262-04-11 00:00:00.000000000]]
COL_BUGS_HERE: [[1677-09-21 00:25:26.290448384]]
COL_BUGS_HERE_TOO: [[1714-06-13 00:25:26.290448384]]

The second and third columns dates blows up. This happens if the return_type = None or "arrow2" or "polars" or "polars2". But with return_type = "arrow" it works correct:

df = cx.read_sql(ENGINE_PATH, query, return_type="arrow")

pyarrow.Table
COL_OK: date64[ms]
COL_BUGS_HERE: date64[ms]
COL_BUGS_HERE_TOO: date64[ms]
----
COL_OK: [[2262-04-11]]
COL_BUGS_HERE: [[2262-04-12]]
COL_BUGS_HERE_TOO: [[2299-01-01]]

I've got this error through Polars, when loading data from a database, with columns with this kind os date.

Testing in SQL Server, with query = "SELECT CAST('2262-04-12' AS DATETIME2)" causes the same bug.

What is the error?

Very weired date is selected, instead the original date. Maybe some kind of memory overflow??

alexander-beedie commented 1 year ago

It seems it's forcing the precision to ns; most databases only support ms or us precision though, so this is a bit overkill (and results in your overflow/wraparound) 😅

klaerik commented 12 months ago

I was using 0.3.1 and didn't see this behavior, so it is something new with 0.3.2. It's an issue with arrow2 and polars output.

This seems similar to the issue Pandas has using nanosecond precision everywhere, which causes us lots of issues with far future dates in the database. Particularly the end dates in type II dimensions.

Note that the new version does fix another bug I got on here to post about, which is that the Oracle DATE datatype is actually a timestamp but connectorx converts them to date32. 0.3.1 was truncating DATE columns with time components and returning them as a day32 format, which is also wrong. 0.3.2 returns timestamps for Oracle DATE columns, which is right, but does it at the overly granular nanosecond precision which now garbles the dates.

0.3.1 - Large DATE values come in correctly but return datatype truncates time

>>> query = """
...     select
...         date '9999-12-31' as extreme_date,
...         date '2023-01-01' as regular_date,
...         cast(timestamp '2023-01-01 12:34:56' as date) as date_w_time
...     from dual
... """
>>>
>>> connectorx.__version__
'0.3.1'
>>> df = connectorx.read_sql(
...     conn=url,
...     query=query,
...     return_type="arrow2",
... )
>>> 
>>> df
pyarrow.Table
EXTREME_DATE: date32[day]
REGULAR_DATE: date32[day]
DATE_W_TIME: date32[day]
----
EXTREME_DATE: [[9999-12-31]]
REGULAR_DATE: [[2023-01-01]]
DATE_W_TIME: [[2023-01-01]]
>>>

0.3.2 - Large DATE values are garbled due to nanosecond precision, but are timestamps

>>> query = """
...     select 
...         date '9999-12-31' as extreme_date,
...         date '2023-01-01' as regular_date,
...         cast(timestamp '2023-01-01 12:34:56' as date) as date_w_time
...     from dual
... """
>>>
>>> connectorx.__version__
'0.3.2'
>>> df = connectorx.read_sql(
...     conn=url,
...     query=query,
...     return_type='arrow2',
... )
>>> 
>>> df
pyarrow.Table
EXTREME_DATE: timestamp[ns]
REGULAR_DATE: timestamp[ns]
DATE_W_TIME: timestamp[ns]
----
EXTREME_DATE: [[1816-03-29 05:56:08.066277376]]
REGULAR_DATE: [[2023-01-01 00:00:00.000000000]]
DATE_W_TIME: [[2023-01-01 12:34:56.000000000]]
klaerik commented 12 months ago

Looking into this a little further, it appears that this was already an issue in earlier versions with timestamp fields - nanosecond precision failing for large/small dates. It was just masked in Oracle in versions prior to 0.3.2 due to the other bug which converted Oracle DATEs (which again are really timestamps) into date32[day] format.

romiof commented 12 months ago

@klaerik this bug appears with any date or datetime >= 2262/04/12

I'm using Polars, and I escaped it, by using cx return_type = "arrow" and them using polars.from_arrow