Open Ceshion opened 3 years ago
👋 @Ceshion Hey there!
Is there any reason not to do this?
Well, yeah, because it's not doing what it's supposed to. 😬 Your solution "works", but it basically puts the task of converting from multibyte characters to single byte characters on the server, instead of handling this at the client level. If a user wants this conversion to happen on the database level, they should be using nvarchar
/nchar
directly. 😅
SQL Servers varchar
and char
do not store UTF16 encoded characters, they store characters in whatever encoding is specified on the database / table / column. Some of these encodings support multibyte characters, but those don't have to correspond to the bytes used in UTF16 / UCS2.
Based on the discussion over in #723, I think the proper fix would be to:
varchar
and char
parameters.What do you think? Does that make sense to you? 🙇♂️
Hi @arthurschreiber! I appreciate you bearing with me here, I have only just been learning most of the relevant info about encodings in the past few days- trying my best to keep up though! Yes, I see the error in my explanation and understanding- where we can't store the entire BMP on one column and characters are single-byte, just on a specific codepage- thank you for pointing that out 😁
🤔 My initial thought had been that because the server already has the information it needs in order to convert whichever multibyte characters it can for a given column (based on collation, which it knows about), it would take less negotiation to just allow it to convert what is technically a unicode parameter to whatever it should be--where a client would need to get that info from the server anyway (or else know specific details about the server ahead of time), wouldn't it? Not to say of course that it wouldn't technically be right, but I had interpreted the decision to use nchar/nvarchar by default in ADO and JDBC to be based around that logic, and it seems reasonable to me.
What do you think? Should we still put the responsibility of interpreting the correct codepage for a column (in a specific table and database) on the client?
Not to say of course that it wouldn't technically be right, but I had interpreted the decision to use nchar/nvarchar by default in ADO and JDBC to be based around that logic, and it seems reasonable to me.
That is a valid decision to make, and it's something that I think the application that uses tedious
can already do, by using nvarchar
/ nchar
parameters directly. This has the same effect as the patch you proposed, with the additional "benefit" of not muddying the waters between n(var)char
and (var)char
.
This probably requires better documentation, something along the lines of "if you just want to send JavaScript string values to the database, use nvarchar
and nchar
, even if your target columns are varchar
and char
". 🤷
On another note, I still think we should fix the varchar
and char
data types to encode strings correctly from UCS2
to either what the user specifies as the collation to use (or the default database collation if no explicit collation was specified), and maybe also support passing in Buffer
values (where no conversion would happen because we can assume the user knows what they're doing).
Oh I agree, allowing a consumer to specify a codepage for var/char
(and text
for that matter) definitely seems preferable to automatically interpreting whatever is sent as already correctly encoded- and it looks like we do currently support using buffer values
https://github.com/tediousjs/tedious/blob/ceb73d35cf5e5779e89dfb3f1f6138a775ea015a/src/data-types/varchar.ts#L102-L119
So I can take the idea of using unicode parameters instead higher up the chain (again 😅) and work on adding encoding options.
@Ceshion have you tried the latest tedious@12.2.0
? Tedious comes now with proper collation support for varchar
/char
/text
(as "proper" as SQL Server allows us to be). It also comes now with UTF8 encoding support when used together with SQL Server 2019 or Azure SQL.
Can this issue be closed?
Duplicate of #723
Char and varchar are currently configured to read the values passed to them purely as ascii, but char and varchar columns in MSQL can support the entire BMP (i.e. one- and two-byte UTF-16 characters). On read this is handled by decoding records with iconv, but the current state is that storing a two-byte character such as
"\u2021"
(‡) in a char or varchar column will truncate that character to only the second byte, resulting in an incorrect character being stored, such as"\u0021"
(!).We could resolve this by encoding char and varchar parameters as
"ucs2"
with nchar and nvarchar type IDs when serializing them into the RPC stream, as in the attached PR and similar to the approach used by ADO and JDBC per the linked issue. Is there any reason not to do this?Only changing the encoding does not work, since it seems like downstream from tedious the bytes are read as separate characters.