trinodb / trino-python-client

Python client for Trino
Apache License 2.0
309 stars 151 forks source link

Documentation of `type_code` #348

Open abhiaagarwal opened 1 year ago

abhiaagarwal commented 1 year ago

Describe the feature

I'm working with adding some plumbing between our backend system and Trino, and we're trying to derive more type information to speed up Python. Trino gives access to column information with the cursor.description method, but it gives a type_code integer. I've been struggling to figure out what type_code maps to, and my google-fu has failed me. I'd appreciate any sort of guidance.

Describe alternatives you've considered

I've been running queries to try and manually figure out type_code, but it hasn't been exhaustive.

Are you willing to submit PR?

abhiaagarwal commented 1 year ago

I did a bit of investigation by myself, and discovered a few things:

1) The standard follows PEP-249, the DBAPI standard, so it's technically correct to have type_code returned rather than the type itself. 2) While type_code should be an integer, it is actually a string. For example, a BOOLEAN trino value returns boolean. I'll attach a table below of the various mappings I've found.

I think the best solution is to change type_code to actually an integer, and then introduce a mapping function/table from these integers to their string representation, both trino and python-based types. This can also help reduce the boilerplate in functions like RowMapperFactory. The constructs relevant to PEP249 type_code matching are already available.

I'm willing to open a PR for this, but it might be a somewhat wide-reaching change.

ASIDE: some types, and their type_code representation Trino Type type_code
BOOLEAN boolean
TINYINT tinyint
SMALLINT smallint
INTEGER integer
BIGINT bigint
REAL real
DOUBLE double
DECIMAL decimal(10, 7)
VARCHAR varchar
CHAR char(1)
VARBINARY varbinary
JSON json
DATE date
TIMESTAMP timestamp(0)
TIMESTAMP_TIMEZONE timestamp(3) with time zone
INTERVAL_MONTH INTERVAL YEAR TO MONTH
INTERVAL_DAY INTERVAL DAY TO SECOND
ARRAY array(integer)
IPADDRESS ipaddress
UUID uuid
hashhar commented 1 year ago

Nice find @abhiagarwal01, this makes sense to fix. It'll be a breaking change but it'll make us follow the PEP correctly and allow third party tools to work better which make use of this information.

We'd appreciate if you want to submit a PR for this. Just beware that it probably won't be in the upcoming release since we like to include breaking changes in separate releases.

abhiaagarwal commented 1 year ago

@hashhar Thank you! I'll likely be refactoring any type-specific logic to a separate types.py file (and this can be used in the sqlalchemy module as well).

hashhar commented 1 year ago

Is it possible to do this in two steps - separate PRs or different commits in same PR? Fix the type codes being used and any REQUIRED changes. Refactor sqlalchemy and dbapi.py to reduce boilerplate type handling?

abhiaagarwal commented 1 year ago

I think the idea of a type_code requires some level of refactoring, given that this library currently has no logical way to group multiple different data types. What I'll do is first a minimalistic PR that just fixes the type_code by defining constants in constants.py, and then add a separate PR that does any major refactoring. Does that sound good?

hashhar commented 1 year ago

That sounds ideal. Thanks.