sfu-db / connector-x

Fastest library to load data from DB to DataFrames in Rust and Python
https://sfu-db.github.io/connector-x
MIT License
2k stars 162 forks source link

Support JSONB fallback for polars #442

Open sslivkoff opened 1 year ago

sslivkoff commented 1 year ago

The problem

When loading postgres JSONB columns to polars, connectorx currently throws

RuntimeError: No conversion rule from JSONB(true) to connectorx::destinations::arrow::typesystem::ArrowTypeSystem.

Related: #121

The solution

For sqlite tables with JSON columns, connectorx simply loads these columns as strs without parsing them into JSON. This is nice fallback and it would be nice to have this fallback for postgresql as well instead of throwing an error.

This would be very helpful in situations where you are trying to load some set of rows that happen to have JSON columns. It would prevent the query from breaking and you could still decode the JSON text in the app if you need it.

Minimal example to reproduce

# create table and insert rows

import psycopg

connect_str = 'dbname={dbname} user={username}'

create_sql = 'CREATE TABLE IF NOT EXISTS example ( id BIGINT, name TEXT, note JSONB )'
insert_sql = 'INSERT INTO example VALUES (%s, %s, %s)'
insert_rows = [
    [1, 'first', psycopg.types.json.Jsonb({'version': 2})],
    [2, 'second', psycopg.types.json.Jsonb({'version': 3})],
    [3, 'third', psycopg.types.json.Jsonb({'version': 2})],
]
with psycopg.connect(connect_str) as conn:
    with conn.cursor() as cursor:
        cursor.execute(create_sql)
        cursor.executemany(insert_sql, insert_rows)
# query data with connectorx

import connectorx

result = connectorx.read_sql(
    conn='postgresql://{username}@localhost:5432/{dbname}',
    query='SELECT * FROM example',
    return_type='polars',
)

>>> RuntimeError: No conversion rule from JSONB(true) to connectorx::destinations::arrow::typesystem::ArrowTypeSystem.
universalmind303 commented 1 year ago

Polars now has a native "Binary" type. It seems that it would be a better fit, as well as more performant to just pass it in to the Binary type

marcosdetry commented 1 year ago

The same happens with the redshift "super" type where usually json data is put into and at the moment gives an error of type not implemented.