two-byte characters in char and varchar are truncated to one byte

tediousjs / tedious

Node TDS module for connecting to SQL Server databases.

http://tediousjs.github.io/tedious/

MIT License

1.58k stars 438 forks source link

two-byte characters in char and varchar are truncated to one byte #1294

Open Ceshion opened 3 years ago

Ceshion commented 3 years ago

Duplicate of #723

Char and varchar are currently configured to read the values passed to them purely as ascii, but char and varchar columns in MSQL can support the entire BMP (i.e. one- and two-byte UTF-16 characters). On read this is handled by decoding records with iconv, but the current state is that storing a two-byte character such as "\u2021" (‡) in a char or varchar column will truncate that character to only the second byte, resulting in an incorrect character being stored, such as "\u0021" (!).

We could resolve this by encoding char and varchar parameters as "ucs2" with nchar and nvarchar type IDs when serializing them into the RPC stream, as in the attached PR and similar to the approach used by ADO and JDBC per the linked issue. Is there any reason not to do this?

Only changing the encoding does not work, since it seems like downstream from tedious the bytes are read as separate characters.

arthurschreiber commented 3 years ago

👋 @Ceshion Hey there!

Is there any reason not to do this?

Well, yeah, because it's not doing what it's supposed to. 😬 Your solution "works", but it basically puts the task of converting from multibyte characters to single byte characters on the server, instead of handling this at the client level. If a user wants this conversion to happen on the database level, they should be using nvarchar/nchar directly. 😅

SQL Servers varchar and char do not store UTF16 encoded characters, they store characters in whatever encoding is specified on the database / table / column. Some of these encodings support multibyte characters, but those don't have to correspond to the bytes used in UTF16 / UCS2.

Based on the discussion over in #723, I think the proper fix would be to:

Allow specifying the target encoding / collation for varchar and char parameters.
transcode passed in values from UCS2 to the correct target encoding.

What do you think? Does that make sense to you? 🙇‍♂️

Ceshion commented 3 years ago

Hi @arthurschreiber! I appreciate you bearing with me here, I have only just been learning most of the relevant info about encodings in the past few days- trying my best to keep up though! Yes, I see the error in my explanation and understanding- where we can't store the entire BMP on one column and characters are single-byte, just on a specific codepage- thank you for pointing that out 😁

🤔 My initial thought had been that because the server already has the information it needs in order to convert whichever multibyte characters it can for a given column (based on collation, which it knows about), it would take less negotiation to just allow it to convert what is technically a unicode parameter to whatever it should be--where a client would need to get that info from the server anyway (or else know specific details about the server ahead of time), wouldn't it? Not to say of course that it wouldn't technically be right, but I had interpreted the decision to use nchar/nvarchar by default in ADO and JDBC to be based around that logic, and it seems reasonable to me.

What do you think? Should we still put the responsibility of interpreting the correct codepage for a column (in a specific table and database) on the client?

arthurschreiber commented 3 years ago

Not to say of course that it wouldn't technically be right, but I had interpreted the decision to use nchar/nvarchar by default in ADO and JDBC to be based around that logic, and it seems reasonable to me.

That is a valid decision to make, and it's something that I think the application that uses tedious can already do, by using nvarchar / nchar parameters directly. This has the same effect as the patch you proposed, with the additional "benefit" of not muddying the waters between n(var)char and (var)char.

This probably requires better documentation, something along the lines of "if you just want to send JavaScript string values to the database, use nvarchar and nchar, even if your target columns are varchar and char". 🤷

On another note, I still think we should fix the varchar and char data types to encode strings correctly from UCS2 to either what the user specifies as the collation to use (or the default database collation if no explicit collation was specified), and maybe also support passing in Buffer values (where no conversion would happen because we can assume the user knows what they're doing).

Ceshion commented 3 years ago

Oh I agree, allowing a consumer to specify a codepage for var/char (and text for that matter) definitely seems preferable to automatically interpreting whatever is sent as already correctly encoded- and it looks like we do currently support using buffer values https://github.com/tediousjs/tedious/blob/ceb73d35cf5e5779e89dfb3f1f6138a775ea015a/src/data-types/varchar.ts#L102-L119

So I can take the idea of using unicode parameters instead higher up the chain (again 😅) and work on adding encoding options.

arthurschreiber commented 3 years ago

@Ceshion have you tried the latest tedious@12.2.0? Tedious comes now with proper collation support for varchar/char/text (as "proper" as SQL Server allows us to be). It also comes now with UTF8 encoding support when used together with SQL Server 2019 or Azure SQL.

Can this issue be closed?