Question regarding SQL_DESC_OCTET_LENGTH and multibyte encodings

markdirish commented 3 years ago

Hello,

I help maintain the ODBC driver for IBM i and we have a question regarding SQLColumnAttribute getting SQL_DESC_OCTET_LENGTH when it comes to multibyte encodings.

There are many code pages on IBM i that store non-ASCII characters as single byte. When these are transferred to the client, the user may (and probably will) define the client character set as UTF-8. Because UTF-8 is a variable-width encoding from 1 to 4 bytes, the length of this conversion can't be known until the data is actually returned from a query (some characters may still be single byte, some may be 2 bytes, all the way up to 4 bytes).

The question: What should SQL_DESC_OCTET_LENGTH return in this situation? If a field is defined as CHAR(10), it may take up to 40 bytes to transfer if it is encoded using some exotic code page and converted for the client to UTF-8. Should SQL_DESC_OCTET_LENGTH return 40? That seems reasonable, but the Transfer Octet Length documentation notes for character data:

The defined or the maximum (for variable type) length of the column in bytes. This is the same value as the descriptor field SQL_DESC_OCTET_LENGTH.

With variable here not meaning variable-byte encoding, but variable field length (VARCHAR instead of CHAR). It would seem that it should also apply to variable byte encodings, but I just wanted to get some clarification.

Thanks!

edwig commented 3 years ago

Hello Mark,

Good question!!

It is my understanding that the "octet_length" of a character field in a database - and thusly the SQL_DESC_OCTET_LENGTH in the ODBC specification - point back to the physical buffer storing the character field. Intentionally in the early days of MS-Windows with an UTF-16 encoding a "WCHAR(10)" database column had a octet length of 20. Because the UTF-16 encoding needs 2 bytes per character.

The UTF-8 encoding is so to speak the ugly sister in the UTF family. UTF-16 and UTF-32 (e.g. Chinese) have a certified multiplicator for the buffer length. A 32 bit (UTF-32) WCHAR(10) field would need 40 characters of buffer.

Of course UTF-8 characters expand to 2, 3 or 4 characters or do not expand at all!. So the physical buffer you are speaking about can be anything between 10 and 40 bytes, depending on the data. You just cannot be sure.

But strange things happen in ODBC. Standard software would crash if we do not multiply the character buffers by 2, even if we use regular 8 byte character strings as ISO8859 encoding. Somewhere in between the driver manager or the driver will expand our 8 byte character strings to 16 byte wide-character-strings for MS-Windows. Even if we never use UTF-16 at all. It became regular practice befor a call to SQLBindCol to multiply the buffer size by 2 or 4, just to be sure. Just because we have the memory and are afraid that we will crash in a production situation.

Maby we need a SQL_DESC_MAX_OCTET_LENGTH, or something of the sort?

kadler commented 2 years ago

The UTF-8 encoding is so to speak the ugly sister in the UTF family.

Personally, I'd say UTF-16 is the ugly one, since it tried to be a constant-width encoding but admitted defeat and had the messy surrogate pair mechanism tacked on, making it variable width. UTF-8 has been consistent in being variable width the whole time and never pretended it was something it's not.

Back to the question at hand, though, we have done more investigation here and decided to just change our driver to return the maximum conversion size when the SQL_DESC_OCTET_LENGTH is queried.

The doc that Mark linked above states:

The transfer octet length of a column may be different than the number of bytes required to store the data on the data source.

I'm not sure if this use case is what is meant by that line in the docs or not, but this change should fix the majority of modern application uses for our driver (notably PHP and Node.js).

If the ODBC spec would like to add a SQL_DESC_MAX_OCTET_LENGTH, we'd be fine with that, but this repo seems quite dead so I'm not holding my breath.

microsoft / ODBC-Specification

Question regarding SQL_DESC_OCTET_LENGTH and multibyte encodings #144