pacman82 / arrow-odbc-py

Read Apache Arrow batches from ODBC data sources in Python
MIT License
57 stars 5 forks source link

A value (at least one) is too large to be written into the allocated buffer without truncation #52

Closed leijer closed 1 year ago

leijer commented 1 year ago

when a query against databricks has a string column, and that has more than 256 character string, then the following error is returned:

arrow_odbc.error.Error: External error: A value (at least one) is too large to be written into the allocated buffer without truncation.

This error occurs even when setting max_text_size to a size larger than the string data in the column.

reader = read_arrow_batches_from_odbc( query=sql_query, connection_string=self.odbc_connection_string, batch_size=batch_size, max_text_size=1024000 )

odbc driver is Simba Spark ODBC Driver

odbc_connection_string = f""" Driver=Simba Spark ODBC Driver; HOST={dbx_hostname}; PORT=443; Schema=default; SparkServerType=3; AuthMech=3; UID=token; PWD={dbx_token}; ThriftTransport=2; SSL=1; HTTPPath={dbx_endpoint};
"""

pacman82 commented 1 year ago

Hello @leijer ,

Thanks for filing a bug report.

I'll probably have a hard time reproducing this myself, so I would need your help here.

Could you enable logging and share the log output with me.

from arrow_odbc import log_to_stderr

log_to_stderr(3) # Enable logging with Debug verbosity 3 ~ Debug

# Other code using arrow_odbc

In addition to that the relational types of the table you are querying would be helpful. As in the types used in a CREATE TABLE statement.

Best, Markus

pacman82 commented 1 year ago

Hello @leijer ,

followed a hunch and indeed found a bug which could cause false positives in reporting truncation. Of course it remains to be seen if it is related to your issue.

A fix has already been released. arrow-odbc 1.2.5 is ready to be tested by you.

Best, Markus

leijer commented 1 year ago

Hi Markus that was so fast, I'll need a day or so to test this out. Thanks!

leijer commented 1 year ago

Hi Markus, unfortunately getting the same error. Not seeing any additional logs with verbosity 3 ~ Debug

File "c:....venv\lib\site-packages\arrow_odbc**reader.py", line 76, in next** raise_on_error(error) File "c:....venv\lib\site-packages\arrow_odbc\error.py", line 30, in raise_on_error raise Error(error_out) arrow_odbc.error.Error: External error: A value (at least one) is too large to be written into the allocated buffer without truncation.

pacman82 commented 1 year ago

There should be some log message emitted to Standard out before that. Unless you would capture that stream somehow. There must be at least one diagonstic ODBC record emitted.

Without additional information I would just have to take the error at face value, which would read that there is at least one string larger than max_text_size=1024000 in the data you query. The error is intended to be raised so users do not destroy information without knowing. However admittedly there is no simple boolean yet in the artefact to just order arrow-odbc to truncate the string.

Best, Markus

leijer commented 1 year ago

Hi Markus. The max length is 1056 characters in just one row. When I exclude this row the query runs fine. I do get a seperate log entry of "DEBUG - ODBC Environment created." with log_to_stderr turned on but that is it.

I've tested select LEFT(column, x) to work out where the error starts occurring and it is at 511 characters. So probably coming against a 512 byte limit somewhere. OS: windows server 2019, windows 10 enterprise.

Simba driver can be downloaded here: https://www.databricks.com/spark/odbc-drivers-download Databricks is free to try: https://www.databricks.com/try-databricks#account

Here is some code to create a table which can run inside a notebook.

`from pyspark.sql import SparkSession import pandas as pd from delta.tables import *

spark = SparkSession.builder.appName("TEST").master("local").getOrCreate()

spark.sql("CREATE DATABASE IF NOT EXISTS dbx_table_demo") spark.sql("DROP TABLE IF EXISTS dbx_table_demo.table_creation_demo")

df = pd.DataFrame({'my_strong_col' : ['a long string...', None ]})

automatically create databricks table from the dataframe

df_spark = spark.createDataFrame(df) df_spark.write \ .mode("append") \ .saveAsTable("dbx_table_demo.table_creation_demo")`

pacman82 commented 1 year ago

Hello @leijer ,

thanks for the additional detail. Frankly I am baffled. Here is there my bewildermend is coming from. This is the only place in the entire ODBC bindings, this Error is raised:

if error_for_truncation
        && result == SqlResult::SuccessWithInfo(())
        && buffer.has_truncated_values()
    {
        Err(Error::TooLargeValueForBuffer)
    }

SqlSuccessWithInfo indicates diagnostic records to be present. Yet we do see none. This is actually an optimization, to only perform the expensive trunacation checking if there is at least one diagnostic record. This should work because every truncation should cause one. So right there the driver seems to be doing something sketchy. However so far, all we would do is performing a needlessly expensive check. has_truncated_values would compare the size indicated by the driver, with the size the buffer could hold at maximum.

I currently suspect one or more bugs in the ODBC driver. To verify that suspicion I'll release a new version of arrow-odbc with more details in the error message. This should enable us to see what (size) indicator the driver returned. If it is smaller, than the fault is likely to be somewhere in arrow-odbc or the odbc-api bindings. If it is a lot larger, than the error is likely in the Simba ODBC driver, or we made a mistake detrimining the maximum characters in that column.

After we know what is the case, we can think about how to fix/workaround the situation.

Best, Markus

pacman82 commented 1 year ago

Hello @leijer ,

better diagnostics for this error are still ongoing. In the meanwhile, could you run your test again with the newest version arrow-odbc 1.2.6. By accident I released the bugfix only for variadic binaries, not text columns. The issue would only occur with NULL values anyway, so at least from your description so far it seems different. Yet, you could give it a spin if you want. Otherwise I would keep you posted, once I have released the better diagnostics.

Best, Markus

leijer commented 1 year ago

Hi @pacman82 , thanks for 1.2.6, have tested it but get the same error.

pacman82 commented 1 year ago

Thanks for testing, expecting as much from your description that this is something else. I'll forward some details in the error message which should help us decide whether this is a driver error or not.

pacman82 commented 1 year ago

Hello @leijer ,

arrow-odbc 1.2.7 is out. Should you run this version you will see a similar error message, yet this one will tell you the value of the indicator which caused the truncation. The indicator is a value reported by the ODBC driver, which tells you the size in bytes of the variadic value you are fetching. If you could kindly rerun and report the error here.

I am curious to see what length is reported. This is the value indicating how long your ODBC driver says the encoded value is in bytes.

Best, Markus

leijer commented 1 year ago

Hi @pacman82 This is the new error:

arrow_odbc.error.Error: External error: A value (at least one) is too large to be written into the allocated buffer without truncation. Size in bytes indicated by ODBC driver: 2112

A sql statement querying length of the field len(fieldname) returns 1056 character length.

pacman82 commented 1 year ago

Hello @leijer ,

thanks for running again and collaborating!

We are starting to get somewhere. First lets note that 2112 = 2 x 1056 which is the size in bytes a String with the length 1056 might take on a windows system which defaults to UTF16 encoding (in transit, the final arrow buffer would contain utf-8 again).

I am surprised. I would have suspected the driver to report an unreasonable large value. My understanding is that you set max_text_size=1024000 which is a high enough upper bound. It is however only an upper bound. If the driver would report upfront that the maximum element size smaller than max_text_size arrow-odbc would be mindful and to save memory only allocate a buffer that large. For example: A VARCHAR(512) column might only allocate a transit buffer of 512 or 1024 bytes, even if max_text_size is larger.

Three potential sources for the Bug come to mind:

  1. The max_text_size argument is handled incorrectly.
  2. The allocation code in arrow-odbc is buggy.
  3. The ODBC driver reports a too small column size.

I'll double check 1 and 2. If I find nothing I'll add logging for 3. Then we will know.

Best, Markus

pacman82 commented 1 year ago

Hello @leijer ,

arrow-odbc 1.2.8 is released featuring debug logging around relational types reported by the ODBC driver. If you would kindly share the debug log with me, if running the new version.

Best, Markus

leijer commented 1 year ago

Hi @pacman82 , correct that max_text_size=1024000

Here is the extra logging:

DEBUG - ODBC driver reported for column 1. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'mt_run_requested_by'; DEBUG - ODBC driver reported for column 2. Relational type: Timestamp { precision: 6 }; Nullability: Nullable; Name: 'mt_run_scheduled_time'; DEBUG - ODBC driver reported for column 3. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'mt_sim_name'; DEBUG - ODBC driver reported for column 4. Relational type: Integer; Nullability: Nullable; Name: 'mt_sim_days'; DEBUG - ODBC driver reported for column 5. Relational type: Integer; Nullability: Nullable; Name: 'mt_sim_paths'; DEBUG - ODBC driver reported for column 6. Relational type: Integer; Nullability: Nullable; Name: 'mt_n_sims'; DEBUG - ODBC driver reported for column 7. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'mt_run_comment'; DEBUG - ODBC driver reported for column 8. Relational type: Integer; Nullability: Nullable; Name: 'dbx_run_id'; DEBUG - ODBC driver reported for column 9. Relational type: Timestamp { precision: 6 }; Nullability: Nullable; Name: 'dbx_run_start_time'; DEBUG - ODBC driver reported for column 10. Relational type: Timestamp { precision: 6 }; Nullability: Nullable; Name: 'dbx_run_finish_time'; DEBUG - ODBC driver reported for column 11. Relational type: Bit; Nullability: Nullable; Name: 'dbx_run_succeeded'; DEBUG - ODBC driver reported for column 12. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'dbx_run_error'; DEBUG - Relational type of column 1: Varchar { length: 255 } DEBUG - Relational type of column 3: Varchar { length: 255 } DEBUG - Relational type of column 7: Varchar { length: 255 } DEBUG - Relational type of column 12: Varchar { length: 255 } ERROR NoneType: None

File ".venv\lib\site-packages\arrow_odbc\reader.py", line 76, in next raise_on_error(error) File ".venv\lib\site-packages\arrow_odbc\error.py", line 30, in raise_on_error raise Error(error_out) arrow_odbc.error.Error: External error: A value (at least one) is too large to be written into the allocated buffer without truncation. Size in bytes indicated by ODBC driver: 2112

Thanks

pacman82 commented 1 year ago

Okay, understanding for the issue is increasing. This is display size reported for each Text column:

DEBUG - Relational type of column 1: Varchar { length: 255 }
DEBUG - Relational type of column 3: Varchar { length: 255 }
DEBUG - Relational type of column 7: Varchar { length: 255 }
DEBUG - Relational type of column 12: Varchar { length: 255 }

On the plus side it matches the Column description emitted by the driver above it:

DEBUG - ODBC driver reported for column 1. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'mt_run_requested_by';
DEBUG - ODBC driver reported for column 2. Relational type: Timestamp { precision: 6 }; Nullability: Nullable; Name: 'mt_run_scheduled_time';
DEBUG - ODBC driver reported for column 3. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'mt_sim_name';
DEBUG - ODBC driver reported for column 4. Relational type: Integer; Nullability: Nullable; Name: 'mt_sim_days';
DEBUG - ODBC driver reported for column 5. Relational type: Integer; Nullability: Nullable; Name: 'mt_sim_paths';
DEBUG - ODBC driver reported for column 6. Relational type: Integer; Nullability: Nullable; Name: 'mt_n_sims';
DEBUG - ODBC driver reported for column 7. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'mt_run_comment';
DEBUG - ODBC driver reported for column 8. Relational type: Integer; Nullability: Nullable; Name: 'dbx_run_id';
DEBUG - ODBC driver reported for column 9. Relational type: Timestamp { precision: 6 }; Nullability: Nullable; Name: 'dbx_run_start_time';
DEBUG - ODBC driver reported for column 10. Relational type: Timestamp { precision: 6 }; Nullability: Nullable; Name: 'dbx_run_finish_time';
DEBUG - ODBC driver reported for column 11. Relational type: Bit; Nullability: Nullable; Name: 'dbx_run_succeeded';
DEBUG - ODBC driver reported for column 12. Relational type: Varchar { length: 255 }; Nullability: Nullable; Name: 'dbx_run_error';

We can see that all text columns are reported by the driver to hold at maximum 255 characters. It is not strictly consistent between different implementation of ODBC drivers, whether the column length (here 255) is to be interpreted as binary length or characters. To be on the safe side arrow-odbc chooses to interpret it is characters. On windows to be independent from the system encoding (which is still likely to be e.g. latin-1 on western systems and not utf-8) a utf-16 encoding is choosen. In Utf-16 encoding each codepoint (character) takes at most two 16-Bit integers. Yet most characters would neatly fit into one. So to be safe and also work for "strange" characters there is twice as much memory allocated then what would be required for a "normal" string. This is why you see it breaking above 510 characters. The ODBC driver reports a size of 255 characters top. To be working in the presence of Klingon characters the actual amount is doubled, yet your strings are normal text.

I find it very suspicious that all text columns are reported to be of length 255 and then they do infact contain larger values. Frankly it feels like the Simba Spark ODBC Driver is blatently lying about the length of the datatypes.

I have to think about this. In any case I would advise you to create a bug for the ODBC driver.

Best, Markus

pacman82 commented 1 year ago

An idea for a workaround you could try. It still depends on the driver to play along, but it may be cheap to try. What would happen if in your query you would explicitly cast the size of the column to something longer? e.g. CAST AS VARCHAR(6000) ? Would this cause the driver to report larger column sizes?

Best, Markus

leijer commented 1 year ago

Hi @pacman82, when casting larger or smaller than 255 and 255 is still reported. I think we find a similar discussion here for MongoDB. https://stackoverflow.com/questions/26022535/simba-mongo-odbc-driver-returned-data-that-does-not-match-expected-data-length#:~:text=The%20driver%20instead%20uses%20a,to%20something%20larger%2C%20like%20512.

When underlying data is json, it makes sense that the driver is not able to report a real string column length.

Considering your code at some point knows the real length "Size in bytes indicated by ODBC driver: 2112", a solution could be to increase the buffer when you discover that the buffer is no longer big enough.

I'm guessing that pyodbc project would be doing something like that, as pyodbc works when running this query with the same simba driver.

The simba driver is quite widely used - google BigQuery, Mongodb etc.

I have found a work-around and that is to add DefaultStringColumnLength in the connecition string.

Driver=Simba Spark ODBC Driver; HOST=xxxx; PORT=443; Schema=default; SparkServerType=3; AuthMech=3; UID=token; PWD=xxxx; ThriftTransport=2; SSL=1; HTTPPath=xxxx; DefaultStringColumnLength=10000;

Given that it is not possible for the driver to report an accurate length, I think it may be best to handle arrow-odbc side like pyodbc does. Whilst changing DefaultStringColumnLength is a work-around it doesn't seem like a good solution as a production system can still break when new data is stored that exceeds that length, or setting to an extremely large value to avoid that would likely cause poor performance. It is effectively varchar(MAX). Does your code have some special buffer increasing logic when handling varchar(MAX) ? Maybe you could re-use that logic and have a arrow-odbc setting to "treat all string columns as varchar(max)" or something.

Cheers

pacman82 commented 1 year ago

Hello @leijer ,

When underlying data is json, it makes sense that the driver is not able to report a real string column length.

Yes. This is not a Json specific situation either. Same for VarChar(max). ODBC has ways to communicate that the upper bound is not known. Snowflake also chooses to report a high upper bound (something like 2GiB). Both of which are fine. Guessing a lower upper bound is the route the authors of the driver might have taken intentionally, but it is IMHO still wrong. I would suggest configuring a high upper bound in DefaultStringColumnLength. You can use max_str_len in arrow-odbc to lower the upper bound depending on the query you are trying to execute.

Considering your code at some point knows the real length "Size in bytes indicated by ODBC driver: 2112", a solution could be to increase the buffer when you discover that the buffer is no longer big enough.

At this point it is too late. arrow-odbc is fast becauses it avoids io overhead and unneccessary buffer using pre-bound buffers. ODBC 3.81 offers no way to fetch the rest of values which turned out to be too large. ODBC 4.0 will, but I do not know if it will ever be rolled out. Only way to recover at runtime is to reexecute the entire query with a larger max_str_len (given that the DefaultStringColumnLength is really high. This could be done on the application level. Due to the side effects some queries might have I would be hesitant to just do this without the user knowing. I could try to think of ways pipeing a sensible error through, though 🤔 .

I'm guessing that pyodbc project would be doing something like that, as pyodbc works when running this query with the same simba driver.

I am guessing pyodbc does not use pre-bound buffers by default. This has performance implications for most drivers. At least I have no idea how to recover from a truncation, in the current ODBC standard without reexecuting the statement. This artefact is designed about column based bulk fetching into arrow arrays. For now a slow row by row siphoning of data is out of scope. If you do not mind the performance implications, I would recommend just using pyodbc instead.

Whilst changing DefaultStringColumnLength is a work-around it doesn't seem like a good solution as a production system can still break when new data is stored that exceeds that length, or setting to an extremely large value to avoid that would likely cause poor performance. It is effectively varchar(MAX).

Long standing dilemma of bulk fetching. Automatic handling is sadly not possible at the moment.

Best, Markus

pacman82 commented 1 year ago

Ps. The situation is not as dire on the Application level. Here you can usually find a robust solution. E.g. firing a query determinin the max value first. Afterwards this value can be used to inform a good value for max_str_len. Such solutions require knowledge of the domain and the schema though.

leijer commented 1 year ago

I would suggest configuring a high upper bound in DefaultStringColumnLength

Doesn't this mean arrow-odbc will pre-allocate a large buffer e.g. 2 GiB per row and have a huge memory usage? And that is the same way you treat Varchar(MAX)? e.g. if there are 3 varchar(max) columns py-arrow will reserve 6 GiB memory, or if there are 3 varchar(max) and 10 rows, would it reserve 60 GiB memory?

Thanks

pacman82 commented 1 year ago

Yes. There is no good way to solve it on top of the existing ODBC 3.81 standard. You either go row by row and field by field in which case you can do arbitrary large values or you do the fast block cursors using pre-allocated buffers, which need to be able to hold the largest elements (or accept truncation of the value).

This arrow-odbc uses the fast block cursors with pre allocated buffers.

However, instead of allocating 60GiB of memory I would suggest setting max_str_len to a more sane value. This important from claiming short relational type in an important way:

Saying a field is VARCHAR(255) and then having longer values in it is a lie. Having a field being VARCHAR(max) and using a max_str_len=5000 is a decision. A conicious tradeoff between risk of truncation and memory usage.

Difference might become especially important to you due to a bug in Simba ODBC driver. It did not put any diagnostic record about the truncation. The one error we actually see, is not emitted by a driver but a manual check in odbc-api binding. Which is quite expensive, because it involves iterating over the entire batch and comparing indicators with max column lengths. Future Downstream optimizations will remove this check for small columns (like VARCHAR(255)). This means due to the combination of wrong metadata and missing diagnostics of the ODBC Simba driver you might end up with silent data loss. Using configuration to fix one bad decision of the driver might protect you from this.

There is no way to have it fast and work with any length of string. It is an inherent shortcoming of the current ODBC API. I could come up with a better standard (as could many other people), but would there be any drivers for it?

Best, Markus

pacman82 commented 1 year ago

Correction: The parameter is named max_text_size not max_str_len. Sorry

pacman82 commented 1 year ago

Closing this issue, as there is nothing actionable left to do.