sympa-community / sympa

Sympa, Mailing List Management Software
https://www.sympa.community/sympa
GNU General Public License v2.0
239 stars 95 forks source link

sympasoap oddity with utf-8 input #1862

Closed dpc22 closed 2 weeks ago

dpc22 commented 2 weeks ago

Version

6.2.72

Installation method

My own rpm, derived from "official" RHEL 9 rpm.

Expected behavior

If someone calls the SOAP "add" method with a GeCOS value which contains non-ASCII characters, the data should be processed as UTF-8.

Actual behavior

The PostgreSQL database back end throws an exception:

Jul 8 09:19:27 lists-2 sympasoap[298198]: err main::#85 > Sympa::WWW::SOAP::Transport::handle#118 > SOAP::Transport::HTTP::CGI::handle#627 > SOAP::Transport::HTTP::Server::handle#459 > SOAP::Server::handle#2844 > (eval)#2878 > (eval)#2893 > Sympa::WWW::SOAP::add#812 > Sympa::Spindle::spin#95 > Sympa::Request::Handler::add::_twist#80 > Sympa::List::add_list_member#3291 > Sympa::DatabaseDriver::PostgreSQL::do_prepared_query#112 > Sympa::Database::do_prepared_query#383 Unable to execute SQL statement "INSERT INTO subscriber_table (subscribed_subscriber, reception_subscriber, update_epoch_subscriber, number_messages_subscriber, date_epoch_subscriber, visibility_subscriber, user_subscriber, comment_subscriber, list_subscriber, robot_subscriber) SELECT ?, ?, ?, ?, ?, ?, ?, ?, ?, ? FROM dual WHERE NOT EXISTS ( SELECT 1 FROM subscriber_table WHERE user_subscriber = ? AND list_subscriber = ? AND robot_subscriber = ? )": (22021) ERROR: invalid byte sequence for encoding "UTF8": 0xa3

"0xa3" is the single byte ISO-8859-1 character "£".

This is correctly encoded using the 2 byte UTF-8 sequence: "0xc2 0xa3" in my SOAP client.

Something has trans-coded UTF-8 to ISO-8859-1, but the database backend is expecting UTF-8.

Steps to reproduce

SOAP client script (written in Python) available on request.

Additional information

I have an unpleasant feeling that this is in some way related to:

https://github.com/sympa-community/sympa/issues/1407

"This behavior seems due to bug (or buggy behavior) of SOAP::Lite".

(We are using the version of SOAP-Lite which ships with RHEL 9, which is: perl-SOAP-Lite-1.27-8.el9.noarch).

If I add a "Encode::_utf8_off($gecos);" to: lib/Sympa/WWW/SOAP.pm:

sub add {
    my $class    = shift;
    my $listname = shift;
    my $email    = shift;
    my $gecos    = shift;
    my $quiet    = shift;

    Encode::_utf8_off($gecos);

Then things start to work in the way that I would expect. However it isn't clear to me whether this is a safe or sensible thing to do.

ikedas commented 2 weeks ago

Hi @dpc22 ,

This is correctly encoded using the 2 byte UTF-8 sequence: "0xc2 0xa3" in my SOAP client.

Please provide a sample of the input data, the client script you used and how you made sure the client encoded it correctly.

dpc22 commented 2 weeks ago

I attach my example Python script which fails (.txt extension required by github)

sync.py.txt

The equivalent Perl script seems to work:

sync.pl.txt

The only obvious difference is:

$soap->default_ns('urn:sympasoap');

I can't find a direct equivalence to "$soap->default_ns()" in the Zeep library that I am using in Python.

There is: "zeep.set_ns_prefix()", but that takes two arguments.

     |  set_ns_prefix(self, prefix, namespace)
     |      Set a shortcut for the given namespace.

The following didn't help:

zeep.set_ns_prefix(None, 'urn:sympasoap');

Afraid that I don't know what SOAP namespaces do, so I'm blundering around in the dark rather.

dpc22 commented 2 weeks ago

I'm pretty sure that my Python code was originally derived from: https://pypi.org/project/sympasoap/.

That doesn't seem to do anything with namespaces either.

(Edit to add)

It also has a normalize method which just discards any non-ASCII characters on the GeCOS field before invoking the SOAP add method. Presumably the author ran into the same issue, but didn't come up with a more sensible fix.

dpc22 commented 2 weeks ago

https://docs.python-zeep.org/en/master/transport.html#debugging

tells me how to dump the raw XML which is sent to the sympasoap server.

The raw HTTP POST request was:

zeep.transports: HTTP Post to https://test.lists.cam.ac.uk/sympasoap:
<?xml version='1.0' encoding='utf-8'?>
<soap-env:Envelope xmlns:soap-env="http://schemas.xmlsoap.org/soap/envelope/"><soap-env:Body><ns0:add xmlns:ns0="urn:sympasoap"><list>test-dpc22</list><email>dpc99@cam.ac.uk</email><gecos>Test £</gecos><quiet>true</quiet></ns0:add></soap-env:Body></soap-env:Envelope>

We have <xml ... encoding='utf-8'>

The <gecos> field appears to be correctly encoded as UTF-8: if I send the output to a file and use "od -c", I see the two byte sequence: "0xc2 0xa3" sent by the SOAP client.

0000440   l   >   <   g   e   c   o   s   >   T   e   s   t     302 243
0000460   <   /   g   e   c   o   s   >   <   q   u   i   e   t   >   t

>>> hex(0o302)
'0xc2'
>>> hex(0o243)
'0xa3'

I have a dedicated test server if I can add useful debugging at the server end. The normal Sympa verbose logging didn't tell me anything.

ikedas commented 2 weeks ago

@dpc22, could you please apply #1592 and check if the problem will be solved?

dpc22 commented 2 weeks ago

Thank you.

That seems to have fixed the problem on my test server.

I did need to add a patch for src/lib/Makefile.in in order to backport your fix from the GIT repository to the 6.2.72 release tarball given:

rename from src/lib/Sympa/WWW/SOAP/Transport.pm rename to src/lib/Sympa/WWW/SOAP/FastCGI.pm

I will apply the fix to the live system either tomorrow morning or Monday morning.

ikedas commented 2 weeks ago

Duplicate of #1541.

dpc22 commented 2 weeks ago

Okay, that seems to have worked on the live system as well. Thanks for your help here!