Closed JosefMachytkaNetApp closed 2 months ago
Hmm. It certainly looks like cl-postgres-trivial-utf-8 THINKS it is getting a utf-16 encoded string coming from the database. At that point it errors out because it was only written to handle ascii and utf-8.
I will take a look, but I cannot promise a quick fix.
An #xFC
byte in UTF-16 would be part of some arabic ligature if I see that right, but maybe that's just what it is. On the other hand, maybe it's some binary data, maybe it's compressed?
Unfortunately I do not know precise row or column on which code fails. And as I said, in PostgreSQL or Python it does not cause any problems at all.
I read all those long error messages many times to see if I can find something useful there, but do not see anything. It looks like all important information is missing - #<unavailable argument>
. But if you could give me some advice how to debug it on deep level, I will do it.
Backtrace for: #<SB-THREAD:THREAD "lparallel" RUNNING {1005A88303}>
0: (TRIVIAL-BACKTRACE:PRINT-BACKTRACE-TO-STREAM #<SB-IMPL::STRING-OUTPUT-STREAM {100B3964F3}>)
1: (TRIVIAL-BACKTRACE:PRINT-BACKTRACE #<CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR {100B3962E3}> :OUTPUT NIL :IF-EXISTS :APPEND :VERBOSE NIL)
2: ((LAMBDA (CONDITION) :IN PGLOADER.LOAD:COPY-FROM) #<CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR {100B3962E3}>)
3: (LPARALLEL.KERNEL::CONDITION-HANDLER #<CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR {100B3962E3}>)
4: (SB-KERNEL::%SIGNAL #<CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR {100B3962E3}>)
5: (ERROR CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR :BYTE 115 :MESSAGE "Invalid byte 0x~X inside a character.")
6: ((LABELS CL-POSTGRES-TRIVIAL-UTF-8::SIX-BITS :IN CL-POSTGRES-TRIVIAL-UTF-8::GET-UTF-8-CHARACTER) #<unavailable argument>)
7: (CL-POSTGRES-TRIVIAL-UTF-8::GET-UTF-8-CHARACTER #(228 115 101 0) #<unavailable argument> 2)
8: (CL-POSTGRES-TRIVIAL-UTF-8:READ-UTF-8-STRING #<SB-SYS:FD-STREAM for "socket, peer: /var/run/postgresql/.s.PGSQL.5432" {1007747D13}> :NULL-TERMINATEDNIL :STOP-AT-EOF T :CHAR-LENGTH #<unavailable argument> :BYTE-LENGTH #<unavailable argument>)
What operating system are you using?
Can you post your python script? Since we do not have access the the data triggering the error in cl-postgres, I would like to set up some kind of test base that passes on the python side and fails on the common lisp side.
Hi, here are details:
OS: Debian GNU/Linux 12 (bookworm)
Installed packages for connection to Sybase: ii freetds-bin 1.3.17+ds-2 amd64 FreeTDS command-line utilities ii freetds-common 1.3.17+ds-2 all configuration files for FreeTDS SQL client libraries ii freetds-dev 1.3.17+ds-2 amd64 MS SQL and Sybase client library (static libs and headers) ii postgresql-16-tds-fdw 2.0.3-3.pgdg120+1 amd64 PostgreSQL foreign data wrapper for TDS databases ii tdsodbc:amd64 1.3.17+ds-2 amd64 ODBC driver for connecting to MS SQL and Sybase SQL servers
FreeTDS configured to version 5.0 which seems to be the proper version for Sybase. Unfortunately it looks advanced options are available only for 7+
my python script is quite long, testing different topics, so just shortly:
But after some tests I think pyodbc library already does the trick and converts strings into UTF8, so in other steps all works as expected.
But I am trying something different now - I found that pgloader parallel workers are spoiling error messages from the crashed threads. So I am testing now just with single worker and checking output from crash. It gives now much more information. I will post details if I find something interesting.
Unfortunately my attempts to catch problematic data in one specific table using only single worker and very low prefetch rows number was not successful. It looks like error message from the crash does not show problematic row content - just hundreds of "0"s. But looking at the content of that table, there is a column "value" of the Sybase data type "text", which according to SAP docu should be able to store up to 2GB of "printable characters". On the other hand Sybase has special command "writetext" for inserts and updates into this type of columns. So question is what it really stores internally.
When I look into its content, it contains what ever you can image, because it stores configuration properties of all types. There are:
<?xml version="1.0" encoding="ISO-8859-1"?>
Beside of "text" data type, Sybase also has "unitext" data type, which is dedicated for "Unicode characters", so that "text" data type will most likely use different character sets, but so far I did not find any info how to check which one is really used.
Tds_fdw casts it as PostgreSQL type "text" and as I mentioned, copying data on PostgreSQL level using INSERT INTO ... SELECT ...
works. And looks like that pyodbc library also deals with it as text and it seems to work.
Just as a side note, as you know, encoding is not the same as data type. At this point in the process, cl-postgres-trivial-utf-8 is reading a byte stream from pgloader and that byte stream will be different depending on the encoding. So consider the nonsense text string "ÄÖÜÈäes8".
Encoded for utf-8, the octet vector will look like:
Encoded for utf-16, the octet vector will look like:
Encoded for latin-1 or ISO-8859-1, the octet vector will look like:
cl-postgres-trivial-utf-8 reading the first vector will generate the expected string. Attempts to read these particular utf-16 or latin-1 encoded vectors will generate invalid byte 0x~X inside a character errors. Other vectors will trigger the invalid byte at start of character which gets generated when cl-postgres-trivial-utf-8 is determining the number of bytes in the character. Of course, that does not mean that there is not a problem in cl-postgres-trivial-utf-8. It just explains why my first reaction is that the errors look like different encoding. By the way, having different encodings in a single database can lead to subtle data corruption issues.
I am still trying to figure out what valid byte sequence might be causing the problem.
Another side note: If you look at https://www.postgresql.org/docs/current/multibyte.html, postgresql itself does not support utf-16. You can prove this yourself doing the following:
echo ÄÖÜÈäes8 > utf8.csv
Verify that it is utf8
file -i utf8.csv
utf8.csv: text/plain; charset=utf-8
Take the utf8 file, use iconv to convert it to utf-16 as a new file
iconv -f UTF-8 -t UTF-16 utf8.csv -o utf16.csv
Verify that it is utf16
file -i utf16.csv
utf16.csv: text/plain; charset=utf-16le
Replace "sabra" with the user of your choice.
CREATE DATABASE "test8"
WITH OWNER "sabra"
ENCODING 'UTF8'
\c test8
create table test8 (data text);
create table test16 (data text);
\copy test8 from '/home/sabra/tmp/utf8.csv' delimiter ',' csv;
Validate that the data got in and you can see it:
select * from test8;
data
----------
ÄÖÜÈäes8
(1 row)
\copy test16 from '/home/sabra/tmp/utf16.csv' delimiter ',' csv;
ERROR: invalid byte sequence for encoding "UTF8": 0xff
CONTEXT: COPY test16, line 1
Postgresql will not import the utf-16 file.
Thank you for that example.
Can I somehow debug your function using some Lisp code to catch problematic data?
I tried pgloader -d
but in debug messages from crash I somehow do not see anything useful.
I have never used pgloader, so I am just guessing here.
If you build pgloader from source, does it create a directory ~/quicklisp/dists/quicklisp/software/ ? If so, you are looking for a subdirectory named postmodern-xxx.git/cl-postgres/ where the xxx is some date. The file you would be looking for would be trivial-utf-8.lisp and strings-utf-8.lisp. Then you would have to rebuild pgloader from source and it should pick up changes you make to those files.
If pgloader does not create that directory, I do not know where it might stash source files.
Someone else might know how you can keep the lisp instance that pgload creates running so you can connect to it and recompile the lisp functions on the fly.
Hi, looks like I found it - that table is probably storing data in plain ASCII. I did some additional testing and debugging and in one moment Sybase gave me suddenly an error ASE is now using a multi-byte character set, and the TEXT character counts have not been re-calculated using this character set for table '....'. Use DBCC FIX_TEXT on this table
. So it looks like problem comes from older versions of Sybase. Your function seems to refuse ASCII codes of special German letters like ä or ü. Which is of course completely OK. And now I also see ASCII values of these characters in error messages from crashes. It is just a bit confusing for me that PostgreSQL somehow recognizes this problem and converts data properly - I will look at it too. So please wait with further actions until I find more. Thank you so far very much for your help. :1st_place_medal: :star: :superhero:
Plain ASCII is a subset of UTF-8. I am surprised that there is a problem.
If it contains Umlauts, it's not ASCII, but probably something like ISO 8859-1 (a. k. a. latin-1) or some »Windows Codepage« (1274 or thereabouts IIRC).
Yes, looks like it is latin-1. That was original default on very old versions of Sybase.
Hi, column was converted using DBCC FIX_TEXT
as that error message requested, all my checks and tests are now working, no further error messages, but unfortunately the Lisp function is still failing with the same error. Does your function depend somehow on information about coding read from the source database?
CL-POSTGRES-TRIVIAL-UTF-8:READ-UTF-8-STRING, as indicated by its name, is predicated on the bytes it is reading being encoded as UTF-8. The error says that it received bytes that are invalid UTF-8 text characters. So, the possibilities are:
My current bet would be on (2). That means we need to go deeper in the backtrace to see what is calling CL-POSTGRES::ENC-READ-STRING. That is likely to be a function found in cl-postgres/interpret.lisp, possibly CL-POSTGRES::INTERPRET-AS-TEXT, CL-POSTGRES::SET-SQL-READER or something created by the macro CL-POSTGRES::BINARY-READER.
As I understand dbcc fix_text, it only applies to text values, so it would not have done anything to binary data. Assuming it did its job on all the text data, there might some binary data stored that cl-postgres is misinterpreting, just throwing up its hands and assuming it is text, and then calling ENC-READ-STRING which calls READ-UTF-8-STRING. If we can figure out what that is, we can define a binary interpreter for that type and get you moving again. (You can see the currently defined interpreters in cl-postgres/interpret.lisp beginning on line 137. For example, if Sybase encoded floats in binary form, Postgresql might deal with it properly but cl-postgres currently does not have any interpreter for that.)
Does your backtrace show more levels than the 8 shown in your first comment? I am interested in seeing everything it provides.
I modified lisp code in the pgloader trivial-utf-8 file to not fail with errors but to just print warnings and built our own version of pgloader with this modification. Almost all rows with these "invalid byte" warnings are now accepted by PostgreSQL, just very few of them are rejected because content cannot be interpreted as UTF8 string. Pgloader with "on error resume next" setting skips there problematic rows and reports them into special log files and this seems to be a reasonable result for this use case. So I am closing this issue. Thank you.
That seems like the appropriate immediate fix and something that should be optional in the error conditioning handling for trivial-utf-8. Is there anything common in the problematic rows?
In this use case all warnings are about special German letters encoded with Latin1. And rows rejected by PostgreSQL show one of these 2 errors: Database error 22021: invalid byte sequence for encoding "UTF8": 0xf6 0xb2 0xb5 0xae Database error 22021: invalid byte sequence for encoding "UTF8": 0xf6 0xb2 0xa4 0xa5 But only handful of rows out of millions were rejected so result is acceptable.
Thank you. I will think about the best fall-back option when the error gets triggered.
Hi good people, maybe you can help me.
I am using pgloader to copy data from foreign tables (source is Sybase ASE 16) created by tds_fdw extension, into target PostgreSQL local tables. In PostgreSQL all works as expected. I can select data from FDW table, I can do
INSERT INTO target_table SELECT * FROM fdw_table
, all works. But when I start pgloader to copy data from all tables, it fails on some tables with an error message - "A thread failed with error: Invalid byte at start of character: 0xFC" - or other 0x.. codeError is reported in the CL-POSTGRES-TRIVIAL-UTF-8 - which seems to be maintained by you.
Underlying database should be in UTF-8 according to the content of Sybase sys tables. But it looks like some data could be in UTF-16.
I tried to configure FreeTDS to convert UTF-16 to UTF-8 but it is unclear if this conversion really works in FreeTDS version 5.0 which is necessary for Sybase. Docu mentions it only for version 7+.
It is really confusing for me that on PostgreSQL level all works but this common lisp library is throwing an error. This problem is now serious blocker for migrations. Could you give me at least some advise?
I already tried many different recipes from web, none of them worked.
Thank you very much.
During additional tests I see code failing both with error message on the line 108 - https://github.com/marijnh/Postmodern/blob/master/cl-postgres/trivial-utf-8.lisp#L108 and on the line 135 - https://github.com/marijnh/Postmodern/blob/master/cl-postgres/trivial-utf-8.lisp#L135
I wrote short python script which reads data from those tables and checks if they can be interpreted as UTF-8 string - all works. I also checked repeatedly data in PostgreSQL, strings look absolutely correct, tds_fdw and PostgreSQL can handle them without any problems.