pacman82 / arrow-odbc

Fill Apache Arrow record batches from an ODBC data source in Rust.
MIT License
52 stars 10 forks source link

TooLargeValueForBuffer error despite increasing max_bytes_per_batch and setting with_max_text_size to absurd numbers #97

Closed matthewgapp closed 6 months ago

matthewgapp commented 6 months ago

Hey, @pacman82! Again, I can't thank you enough for this lib!

Running into an issue where we continue getting the following error despite increasing MAX_TEXT_SIZE to a value greater than 1384. Any help would be much appreciated!

CleanShot 2024-05-01 at 16 23 23@2x

Arrow { source: ExternalError(TooLargeValueForBuffer { indicator: Some(1384), buffer_index: 1 }) }

As additional context, I've tried increasing the MAX_BYTES_PER_BATCH to some absurd numbers but no dice. I've also tried removing with_fallable_allocations and also removing with_max_text_size

Update: the SQLType for the problematic field was coming in as VARCHAR(255). When converted to octets (utf8), it produces an expected length of 1020 (255 4). As you can see see in the error message, the actual length of the column was 1384. The source is databricks. So perhaps databricks isn't respecting its own schema? As a workaround, I modified this line to be the max of max_str_len and the len as reported by the type. But this causes issues with my other sources because now the row size is determined as the roughly max_str_len number of columns which turns into a huge number.

So, to accommodate the messiness of the world I'm working with, I think I'd prefer as a user of this API to be able to provide an expected text size and a max text size. The expected text size would be used to allocate the vets, and the max text size would be used to return an error or panic.

matthewgapp commented 6 months ago

I provided an update. I can take a stab at implementing the expected_size hint feature if you think that would be a useful facet to add to the API. I often find myself not knowing what value to pick, so I pick a large value but that requires a ton of allocation upfront which is slow. My suggestion is to find some middle ground where the allocation is done upfront for the expected amount, but the limit is still provided to an error in the worst case.

pacman82 commented 6 months ago

This crops up more often recently. You are not using by any chance a Simba ODBC driver? See: https://github.com/pacman82/arrow-odbc/pull/96

The Simba drivers choose to always report 255 as a column size by default. Instead of correctly reporting that they do not know the upper bound. Since the application thinks it knows the column size it does not apply the max text size configured, which is intended as a fallback for this case.

In my eyes it could make sense to overwrite the driver reported values by the user. I don't see how a distinction between expected and max size is helpful though. Anything over the size of preallocated buffers won't work.

matthewgapp commented 6 months ago

Thanks @pacman82, I set DefaultStringColumnLength=0 in the simba config file and everything works as expected (since this lib uses the max length setting when it encounters a zero-length text column).

Fwiw, I didn't realize that anything over the preallocated buffers wouldn't work. The expected size configuration option makes no sense if the underlying buffers can't be dynamically extended.

pacman82 commented 6 months ago

Hello @matthewgapp ,

Yeah, sadly, extending the buffers in case if truncation is not in the cards. I've prototypes for this using scroll cursers, but it's not the intended use. Even if supported by drivers, the behavior is inconsistent between the supporting ones.

ODBC 4.0 seems to be able to specify behavior. I am unsure how the standard is governed, when and if this ever will be rolled out.