ibarwick / firebird_fdw

A PostgreSQL foreign data wrapper (FDW) for Firebird - latest version 1.4.0 (2024-05-11)
https://sql-info.de/postgresql/firebird-fdw/index.html
Other
36 stars 9 forks source link

Character encoding issues with ISO-8859-1 fields #12

Closed jonathands closed 1 year ago

jonathands commented 4 years ago

FIREBIRD VERSION: 2.5.3 POSTGRE VERSION: 10.11

Some forewords: I'm using this firebird_fdw to query data from a database for which I have no control over how the data is stored nor it's structure, also I'm not a C developer, I can compile libs and understand the code, but not much else.

I've set up the data wrapper and have been querying the database for almost 3 months now, so it works but I had to write the queries converting every varchar to ISO-8859-1 with convert_from(COLUMNNAME::BYTEA, 'ISO8859-1') which is annoying but works fine for me for now...

I'm still trying to fix this to avoid having to deal with the conversion in the future, I tried forcing LATIN1 or ISO8859-1 (and even NONE since it's the original fb database's default) into the client_encoding on firebirdGetConnection() inside connection.c , by changing the lines:

val[i] = GetDatabaseEncodingName(); to val[i] = 'LATIN1'; or val[i] = 'ISO8859-1';

to see if maybe I could try and implement a flag to force the encoding on the connection, but data still comes jumbled,

DBEAVER will show the following exception on the collumn: "DBCException: SQL Error [22000]: Invalid character data was found. This is most likely caused by stored data containing characters that are invalid for the character set the database was created in. The most common example of this is storing 8bit data in a SQL_ASCII database."

PSQL will just ignore the character and show the rest of the string, like this query for the name of the contact:

NOME_CONTATO | NOME_CONTATO::BYTEA | convert_from(NOME_CONTATO::BYTEA, 'ISO8859-1')
ANDRIA | \x414e4452c94941 | ANDRÉIA |

I'm kind of stuck in the mud here, any tips would be appreciated, even some guidance on testing the issue

mkgrgis commented 4 years ago

Please show your situation in DBs

  1. From firebird: what is in attribute RDB$CHARACTER_SET_NAME of RDB$DATABASE table
  2. From Postgresql 2.1 SHOW SERVER_ENCODING; 2.2 select datname, pg_encoding_to_char(encoding) from pg_database; Then @ibarwick can decide, is the problem in firebird_fdw or outside of this program.
jonathands commented 4 years ago

SELECT RDB$CHARACTER_SET_NAME FROM RDB$DATABASE is NONE

SHOW SERVER_ENCODING is UTF8 select datname, pg_encoding_to_char(encoding) from pg_database; is UTF8 for mydatabase

mkgrgis commented 4 years ago

Unfortunately, the author, @ibarwick, rarely writes here. In Your situation, the program cannot correctly process input characters, since NONE indicates a refusal converting. It seems, Your Postgres "believes" that the data will come in ASCII. I think that you have almost correct chosen the direction of your search. Your attempt to change the code is also like solving a problem val [i] = GetDatabaseEncodingName (); to val [i] = 'LATIN1'; or val [i] = 'ISO8859-1'; at https://github.com/ibarwick/firebird_fdw/blob/master/src/connection.c But there is no using of "client_encoding" in the program https://github.com/ibarwick/firebird_fdw/search?q=client_encoding&unscoped_q=client_encoding It's problem... There is a comment of @ibarwick

    /* Here we're taking a calculated risk and passing the PostgreSQL
     * encoding name directly to Firebird. Firebird seems to be pretty
     * good at parsing encoding names but it's possible there might be
     * errors with more obscure encoding combinations.
     */

I think the comment is about your situation. It looks like You have a problem in writing/passing the name of the Firebird's encoding or in determining the actual encoding of Firebird. LATIN1 and ISO8859-1 are similar but not consistent standards. It may be that convert_from (NOME_CONTATO :: BYTEA, 'ISO8859-1') is a working, but not entirely correct form. I ask you to analyze the texts (varchar/char) in the Firebird and find out more characters that are ignoring by psql SELECT. Please show 2-3 samples of BYTEA in base64 by analogy with the previous sample. It's hard to decode \x414e4452c94941 for me. For DBeaver, please describe the encoding settings in the connections: pg and fb.

jonathands commented 4 years ago

DBeaver is connected using UTF-8 to PostgreSQL and ISO-8859-1 to Firebird, I can't quite test all the special characters from ISO-8859-1 set, but I could match the following with this table: https://cs.stanford.edu/people/miles/iso8859.html

É : C9 Ã : C3 Ç : C7

\x 4a 4f c3 4f 20 44 4f 53 20 50 41 53 53 4f 53                                        
\x 4a 4f 53 c9 20 52 49 42 41 4d 41 52 20 42 41 52 42 4f 53 41 20 45 20 53 49 4c 56 41 
\x 52 55 59 20 47 4f 4e c7 41 4c 56 45 53 20 44 45 20 41 4c 4d 45 49 44 41             

image

(I've put my test strings above in an image to avoid people names ending up on the search for this PR and biting me on the ass in the future)

I have forced the variations I could find for LATIN1 in encnames.c from witch libfq takes the encoding names to no avail.

I think my current PostgreSQL database being UTF-8 is factoring in with the problem, probably the FDW reads the data with the encoding it's given alright , but it makes no diference since the database gives the data in UTF-8 when I query it anyway which does not fit the stored bytes.

If I'm not mistaken and this is the case, I would need to create on the option on the FDW and actually make the conversion routine from my LATIN1 database to UTF-8 inside , but It might me out of my league.

PS: Not to worry about any delay or timing, I know everyone is busy this days and I'm grateful for all your help, as this is a nice to solve issue but not system breaking , I do have a lot in my plate too and will try my best to provide my insight as I try to deal with this.

ibarwick commented 4 years ago

PS: Not to worry about any delay or timing, I know everyone is busy this days and I'm grateful for your help, but I do understand as this is a nice to solve issue , but I do have a lot in my plate too.

Thank you for your understanding, I am aware of this and will look when I have a chance, unfortunately work, family and other responsibilities limit the amount of time available. It might be a few weeks more before I get round to looking at this.

mkgrgis commented 4 years ago

Thanks @jonathands. I believe that at the moment the description of the situation is complete. Since I am now creating several Firebirds databases, I will do an experiment with your data.

mkgrgis commented 4 years ago

@jonathands Unfortunately, I was not able to detect character loss problems in my environment. Please do a similar test with your environment. My environment: Firebird 3.0 (linux) + PosgreSQL 12 (linux): no problems.

bash, creating the Firebird databases

t=NONE;
echo "CREATE DATABASE '/var/lib/firebird/test_$t.fdb'
USER 'sysdba' PASSWORD 'acessoestranho'
PAGE_SIZE = 8192
DEFAULT CHARACTER SET $t;" | isql-fb;
t=UTF8;
echo "CREATE DATABASE '/var/lib/firebird/test_$t.fdb'
USER 'sysdba' PASSWORD 'acessoestranho'
PAGE_SIZE = 8192
DEFAULT CHARACTER SET $t;" | isql-fb;
t=LATIN1;
echo "CREATE DATABASE '/var/lib/firebird/test_$t.fdb'
USER 'sysdba' PASSWORD 'acessoestranho'
PAGE_SIZE = 8192
DEFAULT CHARACTER SET $t;" | isql-fb;
-- Firebird SQL for all new databases 
CREATE TABLE "Pessoas aleatórias" (
"Nome" varchar(128) NOT NULL, 
"Id" integer NOT NULL
);
ALTER TABLE "Pessoas aleatórias" ADD CONSTRAINT PESSOAS_A_PK PRIMARY KEY ("Id");
INSERT INTO "Pessoas aleatórias" ("Nome", "Id")
VALUES('Conceição', 1);
INSERT INTO "Pessoas aleatórias" ("Nome", "Id")
VALUES('Dores', 2);
INSERT INTO "Pessoas aleatórias" ("Nome", "Id")
VALUES('Céu', 3);
INSERT INTO "Pessoas aleatórias" ("Nome", "Id")
VALUES('Luz', 4);
INSERT INTO "Pessoas aleatórias" ("Nome", "Id")
VALUES('Fátima', 5);
INSERT INTO "Pessoas aleatórias" ("Nome", "Id")
VALUES('Antônia', 6);

Than we use firebird fdw

-- postgres SQL, for administrator
CREATE EXTENSION firebird_fdw;

CREATE FOREIGN DATA WRAPPER "firebird"
       HANDLER firebird_fdw_handler
       VALIDATOR firebird_fdw_validator;

-- LATIN1
CREATE SERVER "verificar_LATIN1"
       FOREIGN DATA WRAPPER "firebird"
       OPTIONS (
         address 'localhost',
         database '/var/lib/firebird/test_LATIN1.fdb'
       );

CREATE USER MAPPING FOR "postgres"
       SERVER "verificar_LATIN1"
       OPTIONS (
         username 'sysdba', 
         password 'acessoestranho'
       );

CREATE FOREIGN TABLE "Pessoas aleatórias LATIN1"(
"Nome"           VARCHAR(128) OPTIONS (column_name 'Nome'),
"Id"             INTEGER OPTIONS (column_name 'Id')
)
SERVER "verificar_LATIN1"
OPTIONS ( table_name 'Pessoas aleatórias');

select  * from "Pessoas aleatórias LATIN1";

-- UTF-8
CREATE SERVER "verificar_UTF-8"
       FOREIGN DATA WRAPPER "firebird"
       OPTIONS (
         address 'localhost',
         database '/var/lib/firebird/test_UTF8.fdb'
       );

CREATE USER MAPPING FOR "postgres"
       SERVER "verificar_UTF-8"
       OPTIONS (
         username 'sysdba', 
         password 'acessoestranho'
       );

CREATE FOREIGN TABLE "Pessoas aleatórias UTF-8"(
"Nome"           VARCHAR(128) OPTIONS (column_name 'Nome'),
"Id"             INTEGER OPTIONS (column_name 'Id')
)
SERVER "verificar_UTF-8"
OPTIONS ( table_name 'Pessoas aleatórias');

select  * from "Pessoas aleatórias UTF-8";

-- NONE
CREATE SERVER "verificar_NONE"
       FOREIGN DATA WRAPPER "firebird"
       OPTIONS (
         address 'localhost',
         database '/var/lib/firebird/test_NONE.fdb'
       );

CREATE USER MAPPING FOR "postgres"
       SERVER "verificar_NONE"
       OPTIONS (
         username 'sysdba', 
         password 'acessoestranho'
       );

CREATE FOREIGN TABLE "Pessoas aleatórias NONE"(
"Nome"           VARCHAR(128) OPTIONS (column_name 'Nome'),
"Id"             INTEGER OPTIONS (column_name 'Id')
)
SERVER "verificar_NONE"
OPTIONS ( table_name 'Pessoas aleatórias');

select  * from "Pessoas aleatórias NONE";

We can read in https://www.firebirdsql.org/file/community/ppts/fbcon11/FbCon2011-Charsets-Heymann.pdf

NONE: Plain octets, no character set applied. With this character set setting, Firebird is unable to perform conversion operations like UPPER() correctly on anything other than the standard 26 latin letters.

My result from psql for my postgres database "Verificação de codificação"

Verificação de codificação=# select  * from "Pessoas aleatórias NONE";
   Nome    | Id 
-----------+----
 Conceição |  1
 Dores     |  2
 Céu       |  3
 Luz       |  4
 Fátima    |  5
 Antônia   |  6
 NONE      |  0
(7 строк)

Verificação de codificação=# select  * from "Pessoas aleatórias UTF-8";
   Nome    | Id 
-----------+----
 Conceição |  1
 Dores     |  2
 Céu       |  3
 Luz       |  4
 Fátima    |  5
 Antônia   |  6
 UTF-8     |  0
(7 строк)

Verificação de codificação=# select  * from "Pessoas aleatórias LATIN1";
   Nome    | Id 
-----------+----
 Conceição |  1
 Dores     |  2
 Céu       |  3
 Luz       |  4
 Fátima    |  5
 Antônia   |  6
 LATIN1    |  0
(7 строк)

Verificação de codificação=#

Postgres

SHOW SERVER_ENCODING;
 server_encoding 
-----------------
 UTF8

Firebird SELECT RDB$CHARACTER_SET_NAME FROM RDB$DATABASE diffrerent by example.

mkgrgis commented 4 years ago

@jonathands, there is additional tests. Im my envoronment there is no problems with loss characters. My version of the issue is You have any character transformation on Firebird side (DB view?), but need move this transformation to PostgreSQL.

CREATE FOREIGN TABLE "Pessoas aleatórias ↑↓ LATIN1"(
"Nome"           VARCHAR(128),
"Nome ↑"         VARCHAR(128),
"Nome ↓"         VARCHAR(128),
"Id"             INTEGER
)
SERVER "verificar_LATIN1"
OPTIONS(
  query 'SELECT "Nome", upper("Nome"), lower("Nome"), "Id" FROM "Pessoas aleatórias"'
);

select  * from "Pessoas aleatórias ↑↓ LATIN1";

CREATE FOREIGN TABLE "Pessoas aleatórias ↑↓ UTF-8"(
"Nome"           VARCHAR(128),
"Nome ↑"         VARCHAR(128),
"Nome ↓"         VARCHAR(128),
"Id"             INTEGER
)
SERVER "verificar_UTF-8"
OPTIONS(
  query 'SELECT "Nome", upper("Nome"), lower("Nome"), "Id" FROM "Pessoas aleatórias"'
);

select  * from "Pessoas aleatórias ↑↓ UTF-8";

CREATE FOREIGN TABLE "Pessoas aleatórias ↑↓ NONE"(
"Nome"           VARCHAR(128),
"Nome ↑"         VARCHAR(128),
"Nome ↓"         VARCHAR(128),
"Id"             INTEGER
)
SERVER "verificar_NONE"
OPTIONS(
  query 'SELECT "Nome", upper("Nome"), lower("Nome"), "Id" FROM "Pessoas aleatórias"'
);

select  * from "Pessoas aleatórias ↑↓ NONE";

CREATE FOREIGN TABLE "Pessoas aleatórias ↑↓ NONE text"(
"Nome"           TEXT,
"Nome ↑"         TEXT,
"Nome ↓"         TEXT,
"Id"             INTEGER
)
SERVER "verificar_NONE"
OPTIONS(
  query 'SELECT "Nome", upper("Nome"), lower("Nome"), "Id" FROM "Pessoas aleatórias"'
);

select  * from "Pessoas aleatórias ↑↓ NONE text";

It will be usefull, if You verify original form of NOME_CONTATO in Firebird by Firebird SQL for Your source object. What is original type of NOME_CONTATO in Firebird?

select rdb$view_source
from rdb$relations
where rdb$relation_name = ' SOURCE '
mkgrgis commented 4 years ago

@jonathands, have You tested Your environment by https://github.com/ibarwick/firebird_fdw/issues/12#issuecomment-601588374 ?

jonathands commented 4 years ago

I'm sorry for the delay, I'm no longer part of that specific project and things are really busy right now, I'll try to simulate the issue by myself and get back at you

mkgrgis commented 3 years ago

I'm unable to detect this issue for Firebird 3.0 and firebird_fdw 1.2.1. I think there was a problem with upper or lower function in Frirebird's view, which was a source for firebird_fdw. In Firebird it was NONE default character set and Firebird don't know how to deal with non-ASCII characters. @jonathands , can you test the issue with new version of firebird_fdw or close this issue?

AndreyMai commented 1 year ago

I have a same problem. I need to convert all strings after receiving data: convert_from(replace(“STRING_COLUMN”, '\', '\')::BYTEA, 'WIN1251')

mkgrgis commented 1 year ago

@AndreyMai , please write us Your environment configuration: versions of Firebird, PostgreSQL and firebird_fdw. Then, please, show problematic for You results for tests from https://github.com/ibarwick/firebird_fdw/issues/12#issuecomment-598652389 and https://github.com/ibarwick/firebird_fdw/issues/12#issuecomment-601588374 It'll great if your tests will use for data and table names not pt_BR but Your's native country names (Win1251 is an old cyrillic encoding for BG, BY, RU, SR, UA). Please don't forget to change LATIN1 to pseudonym for Win1251.

AndreyMai commented 1 year ago

Versions: FB: 3.0.3. Charset of database is NONE. PostgreSQL 14.5. Server encoding is UTF8. firebird_fdw 1.2.3

Unfortunately I can't recreate the FB database. This is very old production system (~20 years old) with big data storage. But I successfully connected to the FB database via ODBC driver. The default encoding ‘WIN1251’ is set in the ODBC source settings. And no problem with strings. See the picture in the next post.

Where i can set a similar setting for firebird_fdw?

AndreyMai commented 1 year ago

1DE20EBD-3621-42E6-A4D6-F33C2E9FB397

mkgrgis commented 1 year ago

Thanks, @AndreyMai! FB 3.0.3 + PostgreSQL 14.5 is a good configuration with no problems with data transfer. We can concentrate on encodings and char's transformations.

Unfortunately I can't recreate the FB database. This is very old production system (~20 years old) with big data storage.

My reply wasn't about recreating of the old FB DB but about creating a new smallest FB DB's for tests. Can You?

So, maybe You can add a table to old production FB DB and show a results for select * from isql-fb, psql+firebind_fdw and some ODBC client? First we must see what the transformation is wrong and what is the wrong result for You.

-- Firebird SQL
CREATE TABLE "Люди" (
"Имя" varchar(128) NOT NULL, 
"Id" integer NOT NULL
);
ALTER TABLE "Люди" ADD CONSTRAINT "Люди_PK" PRIMARY KEY ("Id");
INSERT INTO "Люди" ("Имя", "Id")
VALUES('Анна', 1);
INSERT INTO "Люди" ("Имя", "Id")
VALUES('Борис', 2);
INSERT INTO "Люди" ("Имя", "Id")
VALUES('Владимир', 3);
INSERT INTO "Люди" ("Имя", "Id")
VALUES('Галина', 4);
INSERT INTO "Люди" ("Имя", "Id")
VALUES('Дарья', 5);
INSERT INTO "Люди" ("Имя", "Id")
VALUES('Егор', 6);

Where i can set a similar setting for firebird_fdw?

Nowhere. The firebird_fdw program bases on PostgreSQL's DB's encoding and Firebird's DB's charset. You can add any additional transformation in pg foreign table as query or column option.

ibarwick commented 1 year ago

I was able to take a quick look at this while it's fresh on my radar. The underlying issue in both cases mentioned here is that the character set in the Firebird database is set to NONE. This is pretty much the equivalent of PostgreSQL's SQL_ASCII, i.e. a pseudo-character set/encoding which enables the user to store pretty much any data they care to input without any kind of validation, so e.g. it's perfectly possible to insert a mix of data in ISO-8859-1 and UTF8 encoding.

This does however mean that Firebird can't know what encoding the data is supposed to be in, so it can't convert the data to whatever encoding the client is requesting. Consqeuently whatever value is set for the client_encoding parameter has absolutely no effect.

Note that in this case the client is PostgreSQL/firebird_fdw, which is expecting to receive data in the PostgreSQL server's default encoding, typically UTF8, but what it is actually receiving in this case is a bunch of bytes which are not a problem if they happen to be stored in the same encoding in Firebird, but which are otherwise effectively bytea values which need to be explictly converted.

This issue does not occur if the Firebird database is created with an explicit character set; in that case Firebird converts the data into the character set requested by the client and everything is fine.

I should note that a similar issue occurs with postgres_fdw and a remote database using SQL_ASCII; in this case it is not possible for PostgreSQL to retrieve data from the remote database which doesn't match the local server's encoding.

For firebird_fdw it might be possible to add a setting (e.g. server_encoding) to explicitly specify the encoding in which data in the Firebird NONE-character set database is expected to be stored, and convert it to the PostgreSQL server's encoding on the fly, however assuming it's feasbile, that would entail development work which I'm not able to prioritise at the moment.

The best solution would be to convert the source Firebird database to use an explicit character set. That might be non-trivial though, I'd guess it would involve creating an entirely new database and possibly cleaning up any data not compatible with the specified character set.

Otherwise, the workaround with convert_from() is available.

It is also possible to create the PostgreSQL database with an encoding to match that of the data in the Firebird NONE-character set database, though there might be issues if there's any data in an incompatible encoding in Firebird.

tl;dr version: avoid NONE and SQL_ASCII.

mkgrgis commented 1 year ago

@ibarwick , a new paragraph for README.md ? "Character data transformation depends on encoding of Firebird database"? Can I create with examples or without or You'll add?

mkgrgis commented 1 year ago

NONE Firebird encoding described in new version of documentation, see. https://github.com/ibarwick/firebird_fdw/pull/28. Auto conversion for legacy encodings are added too.

ibarwick commented 1 year ago

It occurred to me there's a possible, if ugly, workaround for accessing (say) LATIN1/ISO-8859-1 data in a Firebird database with encoding NONE from a PostgreSQL database with (say) server encoding UTF8:

Closing this issue as it is now understood and documented.