tds-fdw / tds_fdw

A PostgreSQL foreign data wrapper to connect to TDS databases (Sybase and Microsoft SQL Server)
Other
380 stars 102 forks source link

Add support for new character sets #1

Open GeoffMontee opened 10 years ago

GeoffMontee commented 10 years ago

Currently, tds_fdw can't set the character set for the connection.

The FreeTDS implementation of DB-Library does have the DBSETLCHARSET macro to set the character set:

http://www.freetds.org/reference/a00284.html#ga114

However, it says it doesn't work on TDS 7.0+ connections. This makes it less useful for most versions of MS SQL Server, since those try to use 7.0+ by default:

http://www.freetds.org/userguide/choosingtdsprotocol.htm

The reason it is not supported for 7.0+ is described here:

http://www.freetds.org/userguide/localization.htm

"It is also worth clarifying that TDS 7.0 and above do not accept any specified character set during login, as 4.2 does. A TDS 7.0 login packet uses UCS-2."

There's also the dbsetdefcharset function:

http://www.freetds.org/reference/a00284.html#ga86

But it looks like that might be unimplemented in the current versions of FreeTDS.

Character sets can be set in freetds.conf:

http://www.freetds.org/userguide/freetdsconf.htm

It is probably possible to change character sets with DB-Library in a way that is compatible with versions 7.0+ of TDS also. I should figure out how.

On the PostgreSQL side, I also wonder how BuildTupleFromCStrings will handle character sets with multi-byte characters. I might need to find another way to build tuples.

GeoffMontee commented 10 years ago

From the FreeTDS localization page:

"To learn what character set the client wants, FreeTDS prefers the applicable freetds.conf client charset property. If that is not set, it parses the LANG environment variable. In either case, the found string is passed to iconv(3) (or its built-in replacement). [1]. If neither is found, UCS-2 data are converted to ISO 8859-1."

GeoffMontee commented 10 years ago

One user said that setting "client charset" in freetds.conf to UTF-8 worked.

sumariva commented 9 years ago

Today I got a foreign table returning error related to charset below:

NOTE:  DB-Library notice: Msg #: 5701, Msg state: 2, Msg: Changed database context to 'SIARM'., Server: WINDOWS-0CMIFUI, Process: , Line: 1, Level: 0
NOTE:  DB-Library notice: Msg #: 5703, Msg state: 1, Msg: Changed language setting to us_english., Server: WINDOWS-0CMIFUI, Process: , Line: 1, Level: 0
ERRO:  DB-Library error: DB #: 2403, DB Msg: Some character(s) could not be converted into client's character set.  Unconverted bytes were changed to question marks ('?'), OS #: 0, OS Msg: Sucess, Level: 4
********** Error **********

ERRO: DB-Library error: DB #: 2403, DB Msg: Some character(s) could not be converted into client's character set.  Unconverted bytes were changed to question marks ('?'), OS #: 0, OS Msg: Sucess, Level: 4
SQL state: HV00L

Tried put configuration client charset = UTF-8 on freetds.conf, first on global session then on server session but still same issue. Using tds version = 7.0

Does not message pointing that itself is just an warning? It says that will replace unconvertable chars to question marks.

GeoffMontee commented 9 years ago

Hi @sumariva,

I believe that client charset = UTF-8 in freetds.conf should work, as long as tds version = 7.0 (note: not tds vrsion, which is what you typed up above).

Can you please show the full contents of your freetds.conf?

sumariva commented 9 years ago

Yes, this is my freetds.conf

#   $Id: freetds.conf,v 1.12 2007/12/25 06:02:36 jklowden Exp $
#
# This file is installed by FreeTDS if no file by the same
# name is found in the installation directory.
#
# For information about the layout of this file and its settings,
# see the freetds.conf manpage "man freetds.conf".

# Global settings are overridden by those in a database
# server specific section
[global]
        # TDS protocol version
;       tds version = 4.2

        # Whether to write a TDSDUMP file for diagnostic purposes
        # (setting this to /tmp is insecure on a multi-user system)
;       dump file = /tmp/freetds.log
;       debug flags = 0xffff

        # Command and connection timeouts
;       timeout = 10
;       connect timeout = 10

        # If you get out-of-memory errors, it may mean that your client
        # is trying to allocate a huge buffer for a TEXT field.
        # Try setting 'text size' to a more reasonable limit
        text size = 64512
        client charset = UTF-8

# A typical Sybase server
[egServer50]
        host = symachine.domain.com
        port = 5000
        tds version = 5.0

# A typical Microsoft server
[duquedecaxias]
        host = 192.168.2.30
        port = 1433
        tds version = 7.0
        client charset = UTF-8
GeoffMontee commented 9 years ago

Thanks @sumariva!

The definition of duquedecaxias looks OK to me.

Can you show me the definition of your foreign server and foreign table please?

sumariva commented 9 years ago

Relative to postgresql conversions, there are iconv like functions on strings: http://www.postgresql.org/docs/9.3/static/functions-string.html#FUNCTIONS-STRING-OTHER convert_from will convert to database encoding.

GeoffMontee commented 9 years ago

The character set conversions are done by FreeTDS using the iconv library before PostgreSQL sees the strings, so those functions shouldn't be necessary for this.

sumariva commented 9 years ago

The configured server

CREATE SERVER siarm
FOREIGN DATA WRAPPER tds_fdw
OPTIONS (servername '192.168.2.30',port '1433');

The table mapping system_information tables, here I also noted a weird caracter ASCII(2) at end of a column

CREATE FOREIGN TABLE sgm.information_schema_columns_tds (
    table_catalog varchar
  , table_schema varchar
  , table_name varchar
  , column_name varchar
  , ordinal_position integer
  , column_default varchar
  , is_nullable varchar
  , data_type varchar
  , character_maximum_length integer
  , character_octect_length integer
  , numeric_precision integer
  , numeric_precision_radix integer
  , numeric_scale integer
  , datetime_precision integer
  , interval_type varchar
  , interval_precision integer
  , character_set_catalog varchar
  , character_set_schema varchar
  , character_set_name varchar
  , collation_catalog varchar
  , collation_schema varchar
  , collation_name varchar
  , domain_catalog varchar
  , domain_schema varchar
  , domain_name varchar
  , udt_catalog varchar
  , udt_schema varchar
  , udt_name varchar
  , scope_catalog varchar
  , scope_schema varchar
  , scope_name varchar
  , maximun_cardinality integer
  , dtd_identifier varchar
  /*
  , is_self_referencing varchar
  , is_identity varchar
  , identity_generation varchar
  , identity_start varchar
  , identity_increment varchar
  , identity_maximum varchar
  , identity_minimum varchar
  , identity_cycle varchar
  , is_generated varchar
  , generation_expression varchar
  , is_updatable varchar */
  )
    SERVER siarm
    OPTIONS ( query 'SELECT * FROM information_schema.columns ORDER BY table_schema, table_name, ordinal_position' )

The table as harvested using the system information table

    CREATE FOREIGN TABLE cliente."MV_IPTU_Contribuintes_CadastroHistoricos"
    (
    "Inscricao" integer NOT NULL,
    "ID" smallint NOT NULL,
    "Processo" character varying(12) ,
    "Historico" text NOT NULL,
    "Data" timestamp without time zone ,
    "Departamento" smallint ,
    "Data_Atualizacao" timestamp without time zone
    )
    SERVER siarm
    OPTIONS (table 'MV_IPTU_Contribuintes_CadastroHistoricos' );
GeoffMontee commented 9 years ago

@sumariva,

Your foreign server isn't using the duquedecaxias server definition in freetds.conf because you have servername set to 192.168.2.30. If you want it to use this definition, set servername to duquedecaxias.

Otherwise, you should set tds version = 7.0 and client charset = UTF-8 in the [global] section of freetds.conf. That way, those settings will apply to all connections.

sumariva commented 9 years ago

Humm, updated the freetds.conf as suggested

[global]
        # TDS protocol version
        tds version = 7.0
        client charset = UTF-8
[duquedecaxias]
        host = 192.168.2.30
        port = 1433
        tds version = 7.0
        client charset = UTF-8

but still with same message. I also checked postgresql client_encoding

SHOW client_encoding
UNICODE
GeoffMontee commented 9 years ago

Did you restart the PostgreSQL server after making the change to freetds.conf?

sumariva commented 9 years ago

Yes, the postgresql server has been restarted after each change since I do not known when those parameters are reread by the fdw_tds driver. I trying to discover the 2403 error code meaning at mssql documentation.

GeoffMontee commented 9 years ago

Maybe your configuration is being read correctly, but one of your rows has non-UTF-8 data.

The table mapping system_information tables, here I also noted a weird caracter ASCII(2) at end of a column

What do you mean by this? Are you saying that ASCII(2) is the data type of a column on the MS SQL Server side? Or something else?

sumariva commented 9 years ago

When I executed a select statement on that system information table(this is a database internal table) I got the following result in psql

 table_catalog | table_schema |           table_name           | column_name | ordinal_position | column_default | is_nullable | data_type | character_maximum_length | character_octect_length | numeric_precision | numeric_precision_radix | numeric_scale | datetime_precision | interval_type | interval_precision | character_set_catalog | character_set_schema | character_set_name | collation_catalog | collation_schema | collation_name | domain_catalog | domain_schema | domain_name | udt_catalog | udt_schema | udt_name | scope_catalog | scope_schema | scope_name | maximun_cardinality | dtd_identifier 
---------------+--------------+--------------------------------+-------------+------------------+----------------+-------------+-----------+--------------------------+-------------------------+-------------------+-------------------------+---------------+--------------------+---------------+--------------------+-----------------------+----------------------+--------------------+-------------------+------------------+----------------+----------------+---------------+-------------+-------------+------------+----------+---------------+--------------+------------+---------------------+----------------
 SIARM_CAXIAS  | dbo          | MV_IPTU_Contribuintes_Cadastro | Inscricao   |                1 |                | NO          | int       |                          |                         |                10 |                      10 |             0 |                    |               |                    |                       |                      |                    |                   |                  |                |                |               |             |             |            |          |               |              |            |                     | Y\x02

Note the the \x02 caracter reported at last column named dtd_identifier, what I called ASCII(2). Unfortunely I do not have a mssql client to issue the same query to see if that unicode \x02 is reported also.

sumariva commented 9 years ago

Found 2403 error code defined at freetds as an constant called SYBEICONVI define in file https://github.com/FreeTDS/freetds/blob/master/include/sybdb.h. Do not need to catch this "error code" as an warning and allow normal execution?

GeoffMontee commented 9 years ago

As far as I can tell, the column dtd_identifier doesn't exist in information_schema.columns in MS SQL Server:

https://msdn.microsoft.com/en-us/library/ms188348.aspx

This probably means that this column is being populated with whatever random bytes are already in memory when tds_fdw is fetching results. This looks like a new bug, so I've submitted that here:

https://github.com/GeoffMontee/tds_fdw/issues/25

Either way, this seems completely unrelated to any character set issues you are experiencing.

It still sounds to me like you may have data in one of your rows that is incompatible with UTF-8. Maybe try to find out which column is failing? (e.g. maybe try fetching only one varchar column at a time?)

Also, what version of MS SQL Server are you using? If you are using 2008 or later, maybe try tds version = 7.3:

http://www.freetds.org/userguide/choosingtdsprotocol.htm

GeoffMontee commented 9 years ago

@sumariva,

I don't know if this warning should be treated in a non-fatal manner without knowing exactly what is going wrong here for you. Can you please try to find out what data is failing? You should be able to find out which varchar column is failing by creating two new foreign tables:

CREATE FOREIGN TABLE cliente."MV_IPTU_Contribuintes_CadastroHistoricos_Processo"
(
"Processo" character varying(12)
)
SERVER siarm
OPTIONS (query 'SELECT Processo FROM MV_IPTU_Contribuintes_CadastroHistoricos' );

CREATE FOREIGN TABLE cliente."MV_IPTU_Contribuintes_CadastroHistoricos_Historico"
(
"Historico" text NOT NULL
)
SERVER siarm
OPTIONS (query 'SELECT Historico FROM MV_IPTU_Contribuintes_CadastroHistoricos' );

Then you could figure out which column is failing conversion with:

SELECT * FROM cliente."MV_IPTU_Contribuintes_CadastroHistoricos_Processo";
SELECT * FROM cliente."MV_IPTU_Contribuintes_CadastroHistoricos_Historico";
sumariva commented 9 years ago

I followed yours sugestions and found that the failing column is on second query

SELECT * FROM cliente."MV_IPTU_Contribuintes_CadastroHistoricos_Historico";

beyond 10000 rows fetched.

sumariva commented 9 years ago

Digging into MSSQL information_schema.columns table, I found that the type used on column is text(max??) using iso_1 and COLLATION Latin1_General_CI_AI.

The following attempt to translate the character to a binary reprentation failed with an error reported on footer

CREATE FOREIGN TABLE cliente."MV_IPTU_Contribuintes_CadastroHistoricos_Historico"
(
  "Historico" character varying
)
SERVER siarm
OPTIONS (query 'Select CAST( CAST( Historico AS varchar(max) ) AS varbinary(max) ) AS Historico FROM MV_IPTU_Contribuintes_CadastroHistoricos' );
======= Backtrace: =========
/lib64/libc.so.6(+0x7d4e6)[0x7fa5b5ba54e6]
/usr/lib64/libsybdb.so.5(dbconvert+0x7ae)[0x7fa5b1107fce]
/usr/lib64/postgresql-9.2/lib64/tds_fdw.so(+0x1eaa)[0x7fa5b1361eaa]

If I understood the documentation, when text type is selected, the database stores data on the configured database codepage, that could be an single byte not UCS-2(unicode). Did you known how to tell mssql to convert character on a different codepage on query before freetds receive it?

GeoffMontee commented 9 years ago

You might want to try downloading the most recent commit of tds_fdw and then compiling tds_fdw with DEBUG defined. Then tds_fdw will give you a lot of information about what's going on.

You might also want to consider enabling a FreeTDS log file.

sumariva commented 9 years ago

Just for reference I record the server version used:

Microsoft SQL Server 2014 - 12.0.4100.1 (X64) 
Apr 20 2015 17:29:27 
Copyright (c) Microsoft Corporation
Standard Edition (64-bit) on Windows NT 6.3 <X64> (Build 9600: )
sumariva commented 9 years ago

Currently trying read of column "Historico"(history) as binary

CREATE FOREIGN TABLE cliente."MV_IPTU_Contribuintes_CadastroHistoricos_HistoricoHex"
("Historico" text NOT NULL)
SERVER siarm
OPTIONS (query 'SELECT convert(varbinary(8000), convert(varchar(8000),Historico), 0) AS Historico FROM MV_IPTU_Contribuintes_CadastroHistoricos');

A simple select on postgresql showed that data has arrived.

select
  char_length( "Historico" ), octet_length( "Historico" ), "Historico", encode( "Historico"::bytea, 'hex' )
--, convert_to( "Historico"::bytea, 'iso_8859_1') mssql_varchar_from_varbinary
from
cliente."MV_IPTU_Contribuintes_CadastroHistoricos_HistoricoHex"
limit 100
wprigollopes commented 5 years ago

I had the same problem.

What happened: I'm configured my freetds.conf with default options and imported from a SQL Server the foreign schema into postgresql.

If I fetch some integer columns,freetds works well but, if I fetch some text columns, the error 2403 happens.

To find a solution, I've tried different charsets and, finally, the client_charset to CP1252 works fine. Seems a very similar problem that reported by @sumariva

Follow my freetds.conf (the read-only intent option was used because my postgresql database reads and refresh some materialized views, I'm not writing on SQL server database.

[totvs]
        host = <ip>
        port = 1433
        tds version = 7.4
        database = <dabatabe>
        client charset = CP1252 
        read-only intent yes

It seems that the sql server has data registered with "CP1252" format, so iconv returns an error because the command can't understand the some charset definitions in database. Using "CP1252" the conversion was done correctly and without errors.

Setup info SQL Server version: 2017 - Encoding LATIN1 - Win Postgresql: 10 - Encoding UTF-8 (default) - Ubuntu 18.04 LTS