marijnh / Postmodern

A Common Lisp PostgreSQL programming interface
http://marijnhaverbeke.nl/postmodern
Other
392 stars 90 forks source link

CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR {10084930F3} #340

Closed JosefMachytkaNetApp closed 2 months ago

JosefMachytkaNetApp commented 3 months ago

Hi good people, maybe you can help me.

I am using pgloader to copy data from foreign tables (source is Sybase ASE 16) created by tds_fdw extension, into target PostgreSQL local tables. In PostgreSQL all works as expected. I can select data from FDW table, I can do INSERT INTO target_table SELECT * FROM fdw_table, all works. But when I start pgloader to copy data from all tables, it fails on some tables with an error message - "A thread failed with error: Invalid byte at start of character: 0xFC" - or other 0x.. code

Error is reported in the CL-POSTGRES-TRIVIAL-UTF-8 - which seems to be maintained by you.

2024-02-23T08:34:24.097092Z ERROR A thread failed with error: Invalid byte at start of character: 0xFC
2024-02-23T08:34:24.101092Z ERROR Date/time: 2024-02-23-09:34!
An unhandled error condition has been signalled:
   Invalid byte at start of character: 0xFC

Backtrace for: #<SB-THREAD:THREAD "lparallel" RUNNING {10059F0073}>
0: (TRIVIAL-BACKTRACE:PRINT-BACKTRACE-TO-STREAM #<SB-IMPL::STRING-OUTPUT-STREAM {1008493303}>)
1: (TRIVIAL-BACKTRACE:PRINT-BACKTRACE #<CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR {10084930F3}> :OUTPUT NIL :IF-EXISTS :APPEND :VERBOSE NIL)
2: ((LAMBDA (CONDITION) :IN PGLOADER.LOAD:COPY-FROM) #<CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR {10084930F3}>)
3: (LPARALLEL.KERNEL::CONDITION-HANDLER #<CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR {10084930F3}>)
4: (SB-KERNEL::%SIGNAL #<CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR {10084930F3}>)
5: (ERROR CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR :BYTE 252 :MESSAGE "Invalid byte at start of character: 0x~X")
6: (CL-POSTGRES-TRIVIAL-UTF-8:READ-UTF-8-STRING #<unavailable argument> :NULL-TERMINATED #<unavailable argument> :STOP-AT-EOF #<unavailable argument> :CHAR-LENGTH #<unavailable argument> :BYTE-LENGTH #<unavailab
le argument>)
7: ((LAMBDA (#:G0 #:G1) :IN "/usr/share/common-lisp/source/postmodern/cl-postgres/interpret.lisp") #<SB-SYS:FD-STREAM for "socket, peer: /var/run/postgresql/.s.PGSQL.5432" {1007EED623}> 540)
8: ((LAMBDA (#:G39 PGLOADER.SOURCES:FIELDS) :IN PGLOADER.SOURCES:MAP-ROWS) #<SB-SYS:FD-STREAM for "socket, peer: /var/run/postgresql/.s.PGSQL.5432" {1007EED623}> #(#<CL-POSTGRES::FIELD-DESCRIPTION {1007F0AD63}> 
#<CL-POSTGRES::FIELD-DESCRIPTION {1007F0AF13}> #<CL-POSTGRES::FIELD-DESCRIPTION {1007F0B0C3}> #<CL-POSTGRES::FIELD-DESCRIPTION {1007F0B273}> #<CL-POSTGRES::FIELD-DESCRIPTION {1007F0B423}> #<CL-POSTGRES::FIELD-DE
SCRIPTION {1007F0B5D3}> #<CL-POSTGRES::FIELD-DESCRIPTION {1007F0B783}>))
9: (CL-POSTGRES::SEND-QUERY #<SB-SYS:FD-STREAM for "socket, peer: /var/run/postgresql/.s.PGSQL.5432" {1007EED623}> #<unavailable argument> #<unavailable argument>)
10: ((LABELS #:G4 :IN CL-POSTGRES:EXEC-QUERY))
11: ((:METHOD PGLOADER.SOURCES:MAP-ROWS (PGLOADER.SOURCE.PGSQL:COPY-PGSQL)) #<PGLOADER.SOURCE.PGSQL:COPY-PGSQL {1005D33833}> :PROCESS-ROW-FN #<FUNCTION (LAMBDA (PGLOADER.LOAD::ROW) :IN PGLOADER.LOAD:QUEUE-RAW-DA
TA) {1007EEBD0B}>) [fast-method]

Underlying database should be in UTF-8 according to the content of Sybase sys tables. But it looks like some data could be in UTF-16.

I tried to configure FreeTDS to convert UTF-16 to UTF-8 but it is unclear if this conversion really works in FreeTDS version 5.0 which is necessary for Sybase. Docu mentions it only for version 7+.

It is really confusing for me that on PostgreSQL level all works but this common lisp library is throwing an error. This problem is now serious blocker for migrations. Could you give me at least some advise?

I already tried many different recipes from web, none of them worked.

Thank you very much.


During additional tests I see code failing both with error message on the line 108 - https://github.com/marijnh/Postmodern/blob/master/cl-postgres/trivial-utf-8.lisp#L108 and on the line 135 - https://github.com/marijnh/Postmodern/blob/master/cl-postgres/trivial-utf-8.lisp#L135

I wrote short python script which reads data from those tables and checks if they can be interpreted as UTF-8 string - all works. I also checked repeatedly data in PostgreSQL, strings look absolutely correct, tds_fdw and PostgreSQL can handle them without any problems.

sabracrolleton commented 3 months ago

Hmm. It certainly looks like cl-postgres-trivial-utf-8 THINKS it is getting a utf-16 encoded string coming from the database. At that point it errors out because it was only written to handle ascii and utf-8.

I will take a look, but I cannot promise a quick fix.

svantevonerichsen6906 commented 3 months ago

An #xFC byte in UTF-16 would be part of some arabic ligature if I see that right, but maybe that's just what it is. On the other hand, maybe it's some binary data, maybe it's compressed?

JosefMachytkaNetApp commented 3 months ago

Unfortunately I do not know precise row or column on which code fails. And as I said, in PostgreSQL or Python it does not cause any problems at all. I read all those long error messages many times to see if I can find something useful there, but do not see anything. It looks like all important information is missing - #<unavailable argument>. But if you could give me some advice how to debug it on deep level, I will do it.

Backtrace for: #<SB-THREAD:THREAD "lparallel" RUNNING {1005A88303}>
0: (TRIVIAL-BACKTRACE:PRINT-BACKTRACE-TO-STREAM #<SB-IMPL::STRING-OUTPUT-STREAM {100B3964F3}>)
1: (TRIVIAL-BACKTRACE:PRINT-BACKTRACE #<CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR {100B3962E3}> :OUTPUT NIL :IF-EXISTS :APPEND :VERBOSE NIL)
2: ((LAMBDA (CONDITION) :IN PGLOADER.LOAD:COPY-FROM) #<CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR {100B3962E3}>)
3: (LPARALLEL.KERNEL::CONDITION-HANDLER #<CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR {100B3962E3}>)
4: (SB-KERNEL::%SIGNAL #<CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR {100B3962E3}>)
5: (ERROR CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR :BYTE 115 :MESSAGE "Invalid byte 0x~X inside a character.")
6: ((LABELS CL-POSTGRES-TRIVIAL-UTF-8::SIX-BITS :IN CL-POSTGRES-TRIVIAL-UTF-8::GET-UTF-8-CHARACTER) #<unavailable argument>)
7: (CL-POSTGRES-TRIVIAL-UTF-8::GET-UTF-8-CHARACTER #(228 115 101 0) #<unavailable argument> 2)
8: (CL-POSTGRES-TRIVIAL-UTF-8:READ-UTF-8-STRING #<SB-SYS:FD-STREAM for "socket, peer: /var/run/postgresql/.s.PGSQL.5432" {1007747D13}> :NULL-TERMINATEDNIL :STOP-AT-EOF T :CHAR-LENGTH #<unavailable argument> :BYTE-LENGTH #<unavailable argument>)
sabracrolleton commented 3 months ago

What operating system are you using?

sabracrolleton commented 3 months ago

Can you post your python script? Since we do not have access the the data triggering the error in cl-postgres, I would like to set up some kind of test base that passes on the python side and fails on the common lisp side.

JosefMachytkaNetApp commented 3 months ago

Hi, here are details:

But after some tests I think pyodbc library already does the trick and converts strings into UTF8, so in other steps all works as expected.

But I am trying something different now - I found that pgloader parallel workers are spoiling error messages from the crashed threads. So I am testing now just with single worker and checking output from crash. It gives now much more information. I will post details if I find something interesting.

JosefMachytkaNetApp commented 3 months ago

Unfortunately my attempts to catch problematic data in one specific table using only single worker and very low prefetch rows number was not successful. It looks like error message from the crash does not show problematic row content - just hundreds of "0"s. But looking at the content of that table, there is a column "value" of the Sybase data type "text", which according to SAP docu should be able to store up to 2GB of "printable characters". On the other hand Sybase has special command "writetext" for inserts and updates into this type of columns. So question is what it really stores internally.

When I look into its content, it contains what ever you can image, because it stores configuration properties of all types. There are:

Beside of "text" data type, Sybase also has "unitext" data type, which is dedicated for "Unicode characters", so that "text" data type will most likely use different character sets, but so far I did not find any info how to check which one is really used.

Tds_fdw casts it as PostgreSQL type "text" and as I mentioned, copying data on PostgreSQL level using INSERT INTO ... SELECT ... works. And looks like that pyodbc library also deals with it as text and it seems to work.

sabracrolleton commented 3 months ago

Just as a side note, as you know, encoding is not the same as data type. At this point in the process, cl-postgres-trivial-utf-8 is reading a byte stream from pgloader and that byte stream will be different depending on the encoding. So consider the nonsense text string "ÄÖÜÈäes8".

Encoded for utf-8, the octet vector will look like:

(195 132 195 150 195 156 195 136 195 164 101 115 56)

Encoded for utf-16, the octet vector will look like:

(196 0 214 0 220 0 200 0 228 0 101 0 115 0 56 0)

Encoded for latin-1 or ISO-8859-1, the octet vector will look like:

(196 214 220 200 228 101 115 56)

cl-postgres-trivial-utf-8 reading the first vector will generate the expected string. Attempts to read these particular utf-16 or latin-1 encoded vectors will generate invalid byte 0x~X inside a character errors. Other vectors will trigger the invalid byte at start of character which gets generated when cl-postgres-trivial-utf-8 is determining the number of bytes in the character. Of course, that does not mean that there is not a problem in cl-postgres-trivial-utf-8. It just explains why my first reaction is that the errors look like different encoding. By the way, having different encodings in a single database can lead to subtle data corruption issues.

sabracrolleton commented 3 months ago

I am still trying to figure out what valid byte sequence might be causing the problem.

Another side note: If you look at https://www.postgresql.org/docs/current/multibyte.html, postgresql itself does not support utf-16. You can prove this yourself doing the following:

1. Create utf8 csv file

echo ÄÖÜÈäes8 > utf8.csv

Verify that it is utf8

file -i utf8.csv

utf8.csv: text/plain; charset=utf-8

2. Create utf16 csv file

Take the utf8 file, use iconv to convert it to utf-16 as a new file

iconv -f UTF-8 -t UTF-16 utf8.csv -o utf16.csv

Verify that it is utf16

file -i utf16.csv

utf16.csv: text/plain; charset=utf-16le

3. Start psql

4. Create utf8 database

Replace "sabra" with the user of your choice.

CREATE DATABASE "test8"
WITH OWNER "sabra"
ENCODING 'UTF8'

5. Now switch to the test8 database

\c test8

6. Create 2 tables

create table test8 (data text);
create table test16 (data text);

7. Use postgresql copy command to insert the utf8 file into the proper table

\copy test8 from '/home/sabra/tmp/utf8.csv' delimiter ',' csv;

Validate that the data got in and you can see it:

select * from test8;

  data   
----------
 ÄÖÜÈäes8
(1 row)

8. Now try the same thing with the utf16 file

    \copy test16 from '/home/sabra/tmp/utf16.csv' delimiter ',' csv;
ERROR:  invalid byte sequence for encoding "UTF8": 0xff
CONTEXT:  COPY test16, line 1

Postgresql will not import the utf-16 file.

JosefMachytkaNetApp commented 3 months ago

Thank you for that example. Can I somehow debug your function using some Lisp code to catch problematic data? I tried pgloader -d but in debug messages from crash I somehow do not see anything useful.

sabracrolleton commented 3 months ago

I have never used pgloader, so I am just guessing here.

If you build pgloader from source, does it create a directory ~/quicklisp/dists/quicklisp/software/ ? If so, you are looking for a subdirectory named postmodern-xxx.git/cl-postgres/ where the xxx is some date. The file you would be looking for would be trivial-utf-8.lisp and strings-utf-8.lisp. Then you would have to rebuild pgloader from source and it should pick up changes you make to those files.

If pgloader does not create that directory, I do not know where it might stash source files.

Someone else might know how you can keep the lisp instance that pgload creates running so you can connect to it and recompile the lisp functions on the fly.

JosefMachytkaNetApp commented 3 months ago

Hi, looks like I found it - that table is probably storing data in plain ASCII. I did some additional testing and debugging and in one moment Sybase gave me suddenly an error ASE is now using a multi-byte character set, and the TEXT character counts have not been re-calculated using this character set for table '....'. Use DBCC FIX_TEXT on this table. So it looks like problem comes from older versions of Sybase. Your function seems to refuse ASCII codes of special German letters like ä or ü. Which is of course completely OK. And now I also see ASCII values of these characters in error messages from crashes. It is just a bit confusing for me that PostgreSQL somehow recognizes this problem and converts data properly - I will look at it too. So please wait with further actions until I find more. Thank you so far very much for your help. :1st_place_medal: :star: :superhero:

sabracrolleton commented 3 months ago

Plain ASCII is a subset of UTF-8. I am surprised that there is a problem.

svantevonerichsen6906 commented 3 months ago

If it contains Umlauts, it's not ASCII, but probably something like ISO 8859-1 (a. k. a. latin-1) or some »Windows Codepage« (1274 or thereabouts IIRC).

JosefMachytkaNetApp commented 3 months ago

Yes, looks like it is latin-1. That was original default on very old versions of Sybase.

JosefMachytkaNetApp commented 3 months ago

Hi, column was converted using DBCC FIX_TEXT as that error message requested, all my checks and tests are now working, no further error messages, but unfortunately the Lisp function is still failing with the same error. Does your function depend somehow on information about coding read from the source database?

sabracrolleton commented 3 months ago

CL-POSTGRES-TRIVIAL-UTF-8:READ-UTF-8-STRING, as indicated by its name, is predicated on the bytes it is reading being encoded as UTF-8. The error says that it received bytes that are invalid UTF-8 text characters. So, the possibilities are:

  1. The function is wrong about invalid UTF-8 encoded characters or
  2. The function is getting called when it should not be called (something is calling it, but giving it non-UTF-8 encoded characters - e.g. binary data or non-utf-8 encoded data such as Latin-1).

My current bet would be on (2). That means we need to go deeper in the backtrace to see what is calling CL-POSTGRES::ENC-READ-STRING. That is likely to be a function found in cl-postgres/interpret.lisp, possibly CL-POSTGRES::INTERPRET-AS-TEXT, CL-POSTGRES::SET-SQL-READER or something created by the macro CL-POSTGRES::BINARY-READER.

As I understand dbcc fix_text, it only applies to text values, so it would not have done anything to binary data. Assuming it did its job on all the text data, there might some binary data stored that cl-postgres is misinterpreting, just throwing up its hands and assuming it is text, and then calling ENC-READ-STRING which calls READ-UTF-8-STRING. If we can figure out what that is, we can define a binary interpreter for that type and get you moving again. (You can see the currently defined interpreters in cl-postgres/interpret.lisp beginning on line 137. For example, if Sybase encoded floats in binary form, Postgresql might deal with it properly but cl-postgres currently does not have any interpreter for that.)

Does your backtrace show more levels than the 8 shown in your first comment? I am interested in seeing everything it provides.

JosefMachytkaNetApp commented 2 months ago

I modified lisp code in the pgloader trivial-utf-8 file to not fail with errors but to just print warnings and built our own version of pgloader with this modification. Almost all rows with these "invalid byte" warnings are now accepted by PostgreSQL, just very few of them are rejected because content cannot be interpreted as UTF8 string. Pgloader with "on error resume next" setting skips there problematic rows and reports them into special log files and this seems to be a reasonable result for this use case. So I am closing this issue. Thank you.

sabracrolleton commented 2 months ago

That seems like the appropriate immediate fix and something that should be optional in the error conditioning handling for trivial-utf-8. Is there anything common in the problematic rows?

JosefMachytkaNetApp commented 2 months ago

In this use case all warnings are about special German letters encoded with Latin1. And rows rejected by PostgreSQL show one of these 2 errors: Database error 22021: invalid byte sequence for encoding "UTF8": 0xf6 0xb2 0xb5 0xae Database error 22021: invalid byte sequence for encoding "UTF8": 0xf6 0xb2 0xa4 0xa5 But only handful of rows out of millions were rejected so result is acceptable.

sabracrolleton commented 2 months ago

Thank you. I will think about the best fall-back option when the error gets triggered.