pacman82 / arrow-odbc-py

Read Apache Arrow batches from ODBC data sources in Python
MIT License
54 stars 5 forks source link

how to deal with varchar(max) columns in mssql #56

Open TheDataScientistNL opened 11 months ago

TheDataScientistNL commented 11 months ago

Hi, I am using polars==0.19.7, which now includes ODBC support through arrow-odbc-py (arrow-odbc==1.2.8).

When running the code, see example below, an error occurs from arrow-odbc.

SRNM = ''
PWD = ''
DBNAME = ''
HOST = ''
PORT = ''

CONN = f"Driver={{ODBC Driver 17 for SQL Server}};Server={HOST};Port={PORT};Database={DBNAME};Uid={USERNM};Pwd={PWD}"

df = pl.read_database(
    connection=CONN,
    query="SELECT varchar_max_col FROM [dbo].[tablname]",
)

with the error being:

_arrow_odbc.error.Error: There is a problem with the SQL type of the column with name: varchar_maxcol and index 0: ODBC reported a size of '0' for the column. This might indicate that the driver cannot specify a sensible upper bound for the column. E.g. for cases like VARCHAR(max). Try casting the column into a type with a sensible upper bound. The type of the column causing this error is Varchar { length: 0 }.

I can easily resolve this by editing the query to

df = pl.read_database( connection=CONN, query="SELECT CAST(varchar_max_col AS VARCHAR(100)) AS varchar_max_col FROM [dbo].[tablname]", ) which then resolves the issue (or change the column type in the database, but that is not something you want to do or always can do).

However, as varchar(max) columns still occur frequently in databases, I was wondering if there could be native support in arrow-odbc for this? In other words, it catches varchar(max) columns and optimizes the query to return these columns without throwing an error.

I hope this is the right place to ask the question, because I am not sure if this is arrow-odbc related or ODBC driver related...

pacman82 commented 11 months ago

Hello @TheDataScientistNL ,

the best way to deal with VARCHAR(max) ist to set the max_text_size parameter. See the documentation here: https://arrow-odbc.readthedocs.io/en/latest/arrow_odbc.html#arrow_odbc.read_arrow_batches_from_odbc

You are not using the read_arrow_batches_from_odbc directly but via polars, which I think was added yesterday. Please ask the maintainers of polars how to forward this parameters or use arrow-odbc directly.

Best, Markus

pacman82 commented 11 months ago

I hope this is the right place to ask the question, because I am not sure if this is arrow-odbc related or ODBC driver related...

Neither it is ODBC standard related. It is an inherent limitation in the API. Avoid VARCHAR(max), TEXT or similar unbounded types in schema declarations, if you want fast bulk fetches. I take back what I said earlier. Best way to deal with this is to fix the schema, if possible.

alexander-beedie commented 11 months ago

And I was so hoping to avoid a mystery-meat **kwargs pass-through for all the different connection flavours we now support 🤣 I'll think about the cleanest thing we can expose.

pacman82 commented 11 months ago

Just typing on my phone right now, so I will keep it short. I can sympathise with that. I wouldn't recommend a passthrough at all.