pacman82 / arrow-odbc

Fill Apache Arrow record batches from an ODBC data source in Rust.
MIT License
52 stars 10 forks source link

When setting a max binary/text size, choose the larger value? #96

Closed mach-kernel closed 6 months ago

mach-kernel commented 7 months ago

Hello! Thank you for these crates, it was really easy to get up and running!!

I am trying to read an ODBC data source (Simba/Databricks) from a columnar store that has a particularly large column (a file, potentially several megabytes). Unfortunately, something is wrong with the result set metadata, where the reported result set column size is too small for the values returned. This also happens with large text columns (e.g. base64 representation of bytes):

Bin case:

Varbinary { length: Some(32767) }
called `Result::unwrap()` on an `Err` value: ArrowError { source: ExternalError(TooLargeValueForBuffer { indicator: Some(73456), buffer_index: 3 }) }

Text case: result set metadata is e.g. Varchar(255) for 10k+ of text.


EDIT: Bummer, looks like the driver provided metadata is incorrect:

Query select raw_email from messages

statement.execute().unwrap();
let mut desc = ColumnDescription::default();
statement.describe_col(1, &mut desc);
ColumnDescription { name: [114, 97, 119, 95, 101, 109, 97, 105, 108], data_type: Varbinary { length: Some(32767) }, nullability: Nullable }

thread 'tokio-runtime-worker' panicked at crates/db_connection_pool/src/dbconnection/odbcconn.rs:142:78:
called `Result::unwrap()` on an `Err` value: ArrowError { source: ExternalError(TooLargeValueForBuffer { indicator: Some(842374), buffer_index: 0 }) }

Is it correct to assume the metadata is coming from sql_describe_col and by extension from the driver?


I understand the semantics as written are to choose an upper bound (which this undoes for large results with accurate size), but wanted to ask if there could be an affordance made for choosing the larger one of the two (maybe a Quirk like "result_set_alloc_max_sizes")? With this change I am able to read result batches fine. But in the case that the buffer is too small, we cannot proceed. I will try to assemble some test case data, I figure this is enough to start the conversation 😸.

pacman82 commented 6 months ago

Hello @mach-kernel ,

Yeah, this is IMHO a Bug in the Simba drivers. They should report an upper bound. However I can roughly understand why they implemented it this way. Many Software packages can not deal with really large buffers. However, it is even worse to my knowledge (indirectly deduced of logs in Bug reports submitted to me) the Simba drivers do not even support the trunaction correctly. This can lead to silent data loss. Not the best option.

However, if I remember correctly there may be an option which could be passed in the connection string which allows you to replace the default reported upper buffer size of 255 with a value of your choosing.

Best, Markus

pacman82 commented 6 months ago

This PR, I have to reject. For ODBC drivers which are implemented correctly the user needs the ability to define an upper bound, not a lower bound. So reversing the meaning as done here, would break a lot of code.

I am open to give the user a fine grained control, over buffer size for individual columns.

mach-kernel commented 6 months ago

Markus,

Thank you very much for pointing us in the right direction! If anybody else runs into this, the parameters are:

pacman82 commented 6 months ago

Hello @mach-kernel ,

you are welcome. Thanks for sharing your findings!