pacman82 / arrow-odbc-py

Read Apache Arrow batches from ODBC data sources in Python
MIT License
60 stars 5 forks source link

Support reading datetime2 as timestamp[ms] #69

Closed jonashaag closed 11 months ago

jonashaag commented 11 months ago

I have a column that is datetime2(prec=27, scale=7, length=8). However all values are actually dates:

-- Result: 0
SELECT COUNT(*)
FROM ...
WHERE (DATEDIFF_BIG(nanosecond, '1800-01-01', col) % 86400000000000) > 0

If you have dates outside of the range of timestamp[ns], loading data with arrow-odbc-py fails.

I wonder if and how arrow-odbc-py could automatically use a timestamp type with larger range than timestamp[ns] in this case.

For example, we could add an option to allow truncation of timestamps.

pacman82 commented 11 months ago

The Rust crate allows passing an arrow Schema in case the application developer has insight which is not reflected in the Schema. Arrow does have Date type, doesn't it?

jonashaag commented 11 months ago

I'd be more interested in having something done automatically, since I'm writing generic/library code.

pacman82 commented 11 months ago

It already works automatically if the type reported by the driver is SQL_TYPE_DATE. If the schema information is not okay, or the driver has problems relaying it precisely input from the application developer is needed.

pacman82 commented 11 months ago

I'd be more interested in having something done automatically, since I'm writing generic/library code.

I think I need to better understand what you mean by this. Would a function be helpful which returns the automatically deduced arrow schema?

You could then apply generic logic to that schema to your hearts content. Together with the ability to feed it back into the creation of the reader.

pacman82 commented 11 months ago

Alternatively? How do you think about a function you can pass which manipulates the schema?

What I currently rule out are any decisions based on the values of the fields in the table themselves. This would really mess with the stream based nature of the package. Still applications can feel free to restart the stream based on what they see.

jonashaag commented 11 months ago

Yea, that would work well! If there's a way to pass the first batch of data to the callback that might be even more general purpose, although I don't know what representation could be used for the data in that case, outside of all-strings.

Just a note: I'm not aware of any other library that uses a callback to modify the schema. Usually you can only pass the final schema directly, without the callback. It probably still is the better design for arrow-odbc-py!

jonashaag commented 11 months ago

Btw, I'm also fine fully solving this problem on my end by dynamically getting the schema and adding casts to the query.

pacman82 commented 11 months ago

arrow-odbc 2.1.0 has been released. It allows for specifying a schema parameter in order to overwrite the desired target arrow schema. In theory a generic application could already be written on top of that. One could instantiate the reader twice the first time to generate obtain the schema and the second time to with the manipulated and the actually desired target schema. However of course this involves an extra roundtrip to the database and is wasteful.

If I would go the second step, I would probably provide a function to just create the cursor. The user can than inspect and manipulate the schema. With a second function call the cursor would then be turned into the reader.

jonashaag commented 11 months ago

Thank you! For now, I consider this fixed (assuming it works, which I'll try out soon)