Documentation request: Unicode conversions page

jbarlow83 commented 7 years ago

I think it would be helpful to have a new section under Type conversions that describes how pybind11 deals with Unicode conversions in Python 2.7 and 3. (I can't find this documented anywhere.)

jagerman commented 7 years ago

The quick version is that pybind11 loads and casts between std::string and python strings assuming UTF-8 (calling Python core functions to do the interpretation/conversion), and assumes UTF-16 or UTF-32 when using std::wstring (the former if wchar_t is 2 bytes, the latter if 4 bytes).

jbarlow83 commented 7 years ago

Here are some more specific questions whose answers I think should be documented:

If a bytes (Py2 str) is passed to a C++ function accepting std::string, is it implicitly converted to UTF-8 or left alone?
Is it possible to return std::string and have Python receive it bytes (Py2 str)?
In Python 2, will a returned std::string be converted to unencoded str, UTF-8 encoded str or unicode?
What happens in Py2 and 3 if a std::string cannot be implicitly converted to/from UTF-8?
Is there any way to disable UTF-8 conversion and treat all std::string as bytes (str)?

jagerman commented 7 years ago

I agree that it would be good to have this documented. Based on my reading of the code (the template <> class type_caster<std::string> { in include/pybind11/cast.h), and Python C API documentation, I believe the answers (and remaining questions!) are:

when going from Python to C++ std::string (i.e. type_caster<std::string>::load()) if the passed object is a unicode object (or subclass) the created std::string will be the UTF-8 encoding of the unicode string. If you give it bytes, it'll be left alone.
when casting a std::string into Python (i.e. returning a std::string) we call PyUnicode_FromStringAndSize on it unconditionally: we have no way to know whether the string came in as bytes or unicode. The documentation for the Python function just says that it interprets it as UTF-8, but it doesn't say what happens if it is passed invalid UTF-8 data.
unicode
Good question. The Python C API documentation is remarkably lacking in description of error handling (the Python API is better).
Not directly, but you can interact with bytes (or str in Python 2) via the py::bytes class, e.g. by returning a py::bytes(s) where s is a std::string.

jbarlow83 commented 7 years ago

By code inspection it looks like Python will raise a UnicodeDecodeError if PyUnicode_FromStringAndSize fails.

However the current behavior from pybind11 2.0.1 (arguably a bug) is to return this kind of error:

TypeError: Unable to convert function return value to a Python type! The signature was
    () -> str

It's possibly a bug because it suppresses information that could be used to solve the problem.

My test function was:

    m.def("bad_utf8",
        []() -> std::string {
            return std::string("\xd0\xd0\xd0"); // not utf-8
        }
    );

jbarlow83 commented 7 years ago

It would also be useful to document what pybind11 does with single character literals and wchar_t in each direction.

jagerman commented 7 years ago

I'm not sure if it should just report a better error, or actually return a bytes in that case. (The latter would make round-tripping of bytes data work, as long as the data didn't happen to be a valid UTF-8 sequence with high-bit bytes).

jbarlow83 commented 7 years ago

My thinking is that there should be a 1:1 correspondence between std::string and Python3 str. It is already true that any str can be represented as a utf-8 encoded std::string. The wrapper code then has the burden of ensuring that any strings generated in C++ are normalized to utf-8 before being returned to Python. (Another thing to explain in documentation.)

From Python, you almost never want a function that sometimes returns str and sometimes bytes. That breaks too many simple things that ought to be simple and reliable:

print("I talked to C++ and it said: " + wrapped_cpp_sometimes_returns_bytes())

Ideally the error would be the underlying UnicodeDecodeError rather than that TypeError, because the former gives the byte offset and offending character sequence.

Round-tripping bytes (possibly containing NULs) could be done with passing and returning py::bytes as you mentioned, and it's conveniently explicit.

jagerman commented 7 years ago

PR #624 addresses the error being propagated back to Python.

I didn't address the documentation (except to add u16/u32 types to the table).

jbarlow83 commented 7 years ago

Well thanks for this, I think the picture is a lot clearer.

I do think pybind11 core devs may want to evaluate whether implicit bytes -> std::string conversion should be allowed since it is not symmetric with the automatic std::string -> str conversion on return and required workaround py::bytes -> bytes.

anntzer commented 6 years ago

I agree that it would be nice at least to mark a function as disallowing an implicit bytes ->(utf8)-> std::string. (Here "bytes" and "str" have their Py3 meanings.) An example case would be pathnames: if python passes in a str, we want to encode it using the filesystem encoding (not necessarily utf-8), if python passes in a bytes, we should assume os.fsencode() has already been called on it and just pass it accordingly. If pybind11 always does the case, I believe we can't distinguish between the two cases (other than taking a py::object as argument and typechecking ourselves).

jbarlow83 commented 6 years ago

You can mark a function as such by accepting py::bytes as the argument. Then you can implement any conversion in the lambda before dispatching to the C++ codebase.

I was thinking the best thing for pathnames would be a special py::pathname type that takes care of all the cases in a version independent way. (Also handling os.PathLike and pathlib paths.)

anntzer commented 6 years ago

Ah, great, thanks. On recent pythons it's just a matter of calling fsencode so it's not too much effort to handle this. If pybind11 is going to have special support for this (not saying it has to, but possibly nice), perhaps it's better to provide casters between pathlikes and std::filesystem instead of inventing its own class...

jbarlow83 commented 6 years ago

I suppose std::filesystem would be better but it requires C++17. Maybe there isn't an elegant solution yet.

pybind / pybind11

Documentation request: Unicode conversions page #591