nginx-shib / nginx-http-shibboleth

Shibboleth auth request module for nginx
https://github.com/nginx-shib/nginx-http-shibboleth/wiki
Other
209 stars 27 forks source link

charset setup in nginx #29

Closed druppy closed 6 years ago

druppy commented 6 years ago

Description of Issue/Question

We have a nice setup that works really well, except for the fact that some claims contain utf-8 characters, that converts wrongly.

We user a WSGI python3 script using uWSGI has handler, and when moving parameters from nginx to the uwsgi server, python (uWSGI plugin) thinks the environ is latin1 string, but really it is utf-8 as this is what shibboleth provides.

So what is the proper setup of charsets if we need to use utf-8 ?

Setup

We use uWSGI for our WSGI script, and uses python3.

The setup ini file for uwsgi is quite basic, and it only contain these things environment setup, in order to be sure it understands utf8 as much as possible.

[uwsgi]
plugins = python3
...
env = LC_ALL=en_US.UTF-8
env = LANG=en_US.UTF-8
env = PYTHONIOENCODING=UTF-8

Steps to Reproduce Issue

Add claims that hold utf-8 chars, and see how they end up in the WSGI environs.

This is not easy to reproduce, without a full setup of wsgi -> uwsgi -> nginx -> shibboleth -> idP

Versions and Systems

Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux shibboleth 2.5.3 nginx version: nginx/1.6.2

Yes yes, its old stable :-)

davidjb commented 6 years ago

Can you possibly pinpoint where the encoding issue occurs? For instance, echo'ing out the value of each of the Shibboleth attributes at various points in the stack to check its value can help identify what's causing your situation. As you say, it's a complex stack and hard to reproduce, but if you can step up or down through the stack (eg return statement in Nginx, or print() statements in uWSGI/WSGI, or debug logs in Shibboleth) may help reveal where the issue is.

I haven't had any specific issues related to this so can only allude to general suggestions unless you've got a isolated test I can run to debug.

druppy commented 6 years ago

I am trying to do exactly that, right now, thanks :-)

I will dive more into uWSGI code to see if in fact this is the problem (force param to be latin1), but I guess it all depend on the cgi handler, and not your plugin or nginx. I was just asking to see if anyone have any insight that I am lacking :-)

Btw. thanks for a nice nginx plugin, you have liberated us from Apache2 :-)

davidjb commented 6 years ago

My hunch is that it's your backend application - as you've suggested - so as a test, you could try manually passing a strings with Unicode from Nginx to your backend (eg uwsgi_param FAKE '🤔🐍';) to test passing a known value and see what it turns out like in your uWSGI/Python environment. In my case, I only have a FastCGI backend at this moment, but doing the equivalent with fastcgi_param yields the following in my backend:

b'\xf0\x9f\xa4\x94\xf0\x9f\x90\x8d'

which when you .decode('utf8') that byte sequence, you end up with the correct emoji again. Nginx's uWSGI handler might be different but it'd be some more debugging info for you.

Definitely also try echo'ing out (eg via https://github.com/openresty/echo-nginx-module) the Shibboleth variables directly from nginx too after you've set them with shib_request_set -- that may yield some more helpful data too.

druppy commented 6 years ago

Thanks for input, I now found a solution and ... I firmly believe that this is a uwsgi issue, and the solution here is python3 related.

 import codecs 

def force_utf8( v ):
    if type( v ) == str:
        es = v.encode( 'unicode_escape' )
        return codecs.escape_decode( es )[0].decode( 'utf-8' )
    else:
        return v

def application( environ, start_response):
    print( environ['ENCODE_TEST'] )

    ... 

Now it seems to me that uwsgi make ascii or latin1 strings, but using the raw utf8 string given by nginx, so the encoding of the raw string is wrong. In my fix I force the string back to a raw escaped string (could not find another way), and then decode the escaping (wasteful I know) and then decode it again to utf-8, and now it works !

This is not relevant for shibboleth nor your plugin, but maybe one unhappy developer some one day will find this issue when in distress because of utf-8 claims using uwsgi :-)

ProbstDJakob commented 3 years ago

Since HTTP headers are historically encoded as ISO-8859-1 (also known as latin1), most web servers decode them this way. Luckily ISO-8859-1 does not run into decoding errors while using other charsets (ISO-8859-1 supports the whole byte range from 0 to 255). To decode the headers with UTF-8 in python the following snippet should do the trick.

shib_header: str = ...
shib_header.encode("ISO-8859-1").decode("UTF-8")

This might be included in the docs.