hgourvest / node-firebird

Pure javascript and asynchronous Firebird client for Node.js.
Mozilla Public License 2.0
254 stars 125 forks source link

Request: Add de option to set the default encoding from utf-8 to latin1 #336

Open nickolasdeluca opened 2 months ago

nickolasdeluca commented 2 months ago

When you set the encoding to WIN1252 you need to change the encoding from UTF-8 to latin1 manually on the package files.

Sometimes we forget to change this setting when updating the production environment and it causes a bit of a headache until we finally figure out what happened.

Would it be possible to also change this encoding when setting the connection?

braaar commented 2 months ago

I agree, this is a sorely needed feature. What's the point of the encoding option if you have to modify the source code to make this work?

braaar commented 2 months ago

This is a duplicate of #283

Fixing this would solve:

mreis1 commented 1 month ago

This is not just about changing the constant in index.js

In fact, if you change the constant in your index.js you are just telling the node-firebird to take the buffer content and convert it from latin1 (ISO8859_1) to utf8 But, the same way, you are telling node-firebird lib that strings you provide are in latin1 which is not true - node uses utf8 no matter what. So for basic characters you wont run in any issues but with special characters you realize that characters you insert doesn't represent what you meant to insert on the first place and depending on your connection encoding, you may also run in transliteration issues.

For latin1, a common example of special character is the € symbol. More information is available here:

image

[1] https://www.firebirdsql.org/file/community/ppts/fbcon11/FbCon2011-Charsets-Heymann.pdf

The fix involves working with a connection in encoding 'NONE' and requires strings to be transliterated by a middleware function (i.g: iconv-lite) that would take the utf8 value and convert it to the target encoding in it's purest way which means - writting the buffer directly to the database without any other transliteration.

nickolasdeluca commented 1 month ago

@mreis1 basically you're saying that by modifying the index.js file of node-firebird to latin1 we're just being lucky? because it has been working great so far with our database that is WIN1252 encoded.

mreis1 commented 1 month ago

@nickolasdeluca That's right. If you want to benefit from WIN1252 which is the case of most users using ISO8859_1, it wont be enough.

On the image below you see 2 rows that demonstrate how the data is being stored using different methods of insertion:

The string sent to the database is this one which contains those special characters in win1252 encoding:

const v = 'Héllõ-sámplè-ã§a€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ'.slice(0,33);

The first row shows how data looks once stored by hacking the DEFAULT_ENCODING. The second row shows the insertion of data that was converted from utf8 to win1252 and inserted in hex format.

image

As you can see, the second row accurately represents the data sent to the database (v const above).

The insertion above was made using a connection with charset none.

Switching the connection encoding to win1252 results in: Error: Arithmetic exception, numeric overflow, or string truncation, Cannot transliterate character between character sets I get this because my database default encoding is ISO8859_1.

Changing the connection to ISO8859_1 works fine because no transliteration will occur as long connection and db share the same encoding settings

So here's an isolated explanation that compares the data sent by the database by hacking the index.js and forcing LATIN1 when you wish to work with win1252 encoding.

var x = Buffer.alloc(33)
x.write('Héllõ-sámplè-ã§a€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ'.slice(0,33),'latin1')
x.toString('hex')
 => <Buffer 48 e9 6c 6c f5 2d 73 e1 6d 70 6c e8 2d e3 a7 61 ac 1a 92 1e 26 20 21 c6 30 60 39 52 7d 18 19 1c 1d>

This is what you need to send in order to have the data correctly represented in WIN1252 database.

=> <Buffer 48 e9 6c 6c f5 2d 73 e1 6d 70 6c e8 2d e3 a7 61 80 82 83 84 85 86 87 88 89 8a 8b 8c 8e 91 92 93 94>

If you dont care about those extra characters in your database, then setting latin1 in index.js will do the trick for you.

mateusvieites commented 1 month ago

@mreis1 i think iconv-lite solve this.

import iconv from 'iconv-lite';
const str = 'Héllõ-sámplè-ã§a€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ'.slice(0, 33);

//Error
const x = Buffer.alloc(33)
x.write(str,'latin1')
console.log(x)
=> <Buffer 48 e9 6c 6c f5 2d 73 e1 6d 70 6c e8 2d e3 a7 61 ac 1a 92 1e 26 20 21 c6 30 60 39 52 7d 18 19 1c 1d>

//Error
const y = Buffer.alloc(33);
const latin1String = Buffer.from(str, 'latin1');
y.write(latin1String.toString('latin1'), 0, 'latin1');
console.log(y);
=> <Buffer 48 e9 6c 6c f5 2d 73 e1 6d 70 6c e8 2d e3 a7 61 ac 1a 92 1e 26 20 21 c6 30 60 39 52 7d 18 19 1c 1d>

//Sucess
const win1252Buffer = iconv.encode(str, 'win1252');
console.log(win1252Buffer);
=><Buffer 48 e9 6c 6c f5 2d 73 e1 6d 70 6c e8 2d e3 a7 61 80 82 83 84 85 86 87 88 89 8a 8b 8c 8e 91 92 93 94>

BTW

import iconv from 'iconv-lite';
const win1252Chars = [
    0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F,
    0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F,
    0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2A, 0x2B, 0x2C, 0x2D, 0x2E, 0x2F,
    0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3A, 0x3B, 0x3C, 0x3D, 0x3E, 0x3F,
    0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x49, 0x4A, 0x4B, 0x4C, 0x4D, 0x4E, 0x4F,
    0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5A, 0x5B, 0x5C, 0x5D, 0x5E, 0x5F,
    0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6A, 0x6B, 0x6C, 0x6D, 0x6E, 0x6F,
    0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7A, 0x7B, 0x7C, 0x7D, 0x7E, 0x7F,
    0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x8A, 0x8B, 0x8C, 0x8D, 0x8E, 0x8F,
    0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0x9A, 0x9B, 0x9C, 0x9D, 0x9E, 0x9F,
    0xA0, 0xA1, 0xA2, 0xA3, 0xA4, 0xA5, 0xA6, 0xA7, 0xA8, 0xA9, 0xAA, 0xAB, 0xAC, 0xAD, 0xAE, 0xAF,
    0xB0, 0xB1, 0xB2, 0xB3, 0xB4, 0xB5, 0xB6, 0xB7, 0xB8, 0xB9, 0xBA, 0xBB, 0xBC, 0xBD, 0xBE, 0xBF,
    0xC0, 0xC1, 0xC2, 0xC3, 0xC4, 0xC5, 0xC6, 0xC7, 0xC8, 0xC9, 0xCA, 0xCB, 0xCC, 0xCD, 0xCE, 0xCF,
    0xD0, 0xD1, 0xD2, 0xD3, 0xD4, 0xD5, 0xD6, 0xD7, 0xD8, 0xD9, 0xDA, 0xDB, 0xDC, 0xDD, 0xDE, 0xDF,
    0xE0, 0xE1, 0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7, 0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF,
    0xF0, 0xF1, 0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF
];

const uint8Array = new Uint8Array(win1252Chars);
const decoder = new TextDecoder('windows-1252');
const str = decoder.decode(uint8Array);

//I know this is wrong but...
const x = Buffer.alloc(33)
x.write(str,'latin1');
console.log(x);
=> <Buffer 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20>

const y = Buffer.alloc(33);
const latin1String = Buffer.from(str, 'latin1');
y.write(latin1String.toString('latin1'), 0, 'latin1');
console.log(y);
=> <Buffer 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20>

const win1252Buffer = iconv.encode(str, 'win1252');
console.log(win1252Buffer);
=> <Buffer 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 ... 206 more bytes>

Sources: https://firebirdsql.org/file/documentation/chunk/en/refdocs/fblangref30/fblangref30-datatypes-chartypes.html https://www.firebirdsql.org/file/community/ppts/fbcon11/FbCon2011-Charsets-Heymann.pdf https://stackoverflow.com/questions/28594498/converting-a-string-from-utf8-to-latin1-in-nodejs https://nodejs.org/api/buffer.html#static-method-bufferfromstring-encoding

mreis1 commented 1 month ago

@mateusvieites i've been iconv-lite for a while for that purpose but making iconv a dependency of node-firebird doesn't seem right. Having a hooks configuration for encoding and decoding would be much more reliable and flexible leaving the libs' user the possibility to chose if he wants to use iconv or pick an alternative that accomplishes the same task.

I'll see if i can make a PR for this, to be honest i already have a working fork. Just need some tweaks and some testing to ensure that no side effects were created.