pybind / pybind11

Seamless operability between C++11 and Python
https://pybind11.readthedocs.io/
Other
15.61k stars 2.09k forks source link

Documentation request: Unicode conversions page #591

Closed jbarlow83 closed 7 years ago

jbarlow83 commented 7 years ago

I think it would be helpful to have a new section under Type conversions that describes how pybind11 deals with Unicode conversions in Python 2.7 and 3. (I can't find this documented anywhere.)

jagerman commented 7 years ago

The quick version is that pybind11 loads and casts between std::string and python strings assuming UTF-8 (calling Python core functions to do the interpretation/conversion), and assumes UTF-16 or UTF-32 when using std::wstring (the former if wchar_t is 2 bytes, the latter if 4 bytes).

jbarlow83 commented 7 years ago

Here are some more specific questions whose answers I think should be documented:

jagerman commented 7 years ago

I agree that it would be good to have this documented. Based on my reading of the code (the template <> class type_caster<std::string> { in include/pybind11/cast.h), and Python C API documentation, I believe the answers (and remaining questions!) are:

jbarlow83 commented 7 years ago

By code inspection it looks like Python will raise a UnicodeDecodeError if PyUnicode_FromStringAndSize fails.

However the current behavior from pybind11 2.0.1 (arguably a bug) is to return this kind of error:

TypeError: Unable to convert function return value to a Python type! The signature was
    () -> str

It's possibly a bug because it suppresses information that could be used to solve the problem.

My test function was:

    m.def("bad_utf8",
        []() -> std::string {
            return std::string("\xd0\xd0\xd0"); // not utf-8
        }
    );
jbarlow83 commented 7 years ago

It would also be useful to document what pybind11 does with single character literals and wchar_t in each direction.

jagerman commented 7 years ago

I'm not sure if it should just report a better error, or actually return a bytes in that case. (The latter would make round-tripping of bytes data work, as long as the data didn't happen to be a valid UTF-8 sequence with high-bit bytes).

jbarlow83 commented 7 years ago

My thinking is that there should be a 1:1 correspondence between std::string and Python3 str. It is already true that any str can be represented as a utf-8 encoded std::string. The wrapper code then has the burden of ensuring that any strings generated in C++ are normalized to utf-8 before being returned to Python. (Another thing to explain in documentation.)

From Python, you almost never want a function that sometimes returns str and sometimes bytes. That breaks too many simple things that ought to be simple and reliable:

print("I talked to C++ and it said: " + wrapped_cpp_sometimes_returns_bytes())

Ideally the error would be the underlying UnicodeDecodeError rather than that TypeError, because the former gives the byte offset and offending character sequence.

Round-tripping bytes (possibly containing NULs) could be done with passing and returning py::bytes as you mentioned, and it's conveniently explicit.

jagerman commented 7 years ago

PR #624 addresses the error being propagated back to Python.

I didn't address the documentation (except to add u16/u32 types to the table).

jbarlow83 commented 7 years ago

Well thanks for this, I think the picture is a lot clearer.

I do think pybind11 core devs may want to evaluate whether implicit bytes -> std::string conversion should be allowed since it is not symmetric with the automatic std::string -> str conversion on return and required workaround py::bytes -> bytes.

anntzer commented 6 years ago

I agree that it would be nice at least to mark a function as disallowing an implicit bytes ->(utf8)-> std::string. (Here "bytes" and "str" have their Py3 meanings.) An example case would be pathnames: if python passes in a str, we want to encode it using the filesystem encoding (not necessarily utf-8), if python passes in a bytes, we should assume os.fsencode() has already been called on it and just pass it accordingly. If pybind11 always does the case, I believe we can't distinguish between the two cases (other than taking a py::object as argument and typechecking ourselves).

jbarlow83 commented 6 years ago

You can mark a function as such by accepting py::bytes as the argument. Then you can implement any conversion in the lambda before dispatching to the C++ codebase.

I was thinking the best thing for pathnames would be a special py::pathname type that takes care of all the cases in a version independent way. (Also handling os.PathLike and pathlib paths.)

anntzer commented 6 years ago

Ah, great, thanks. On recent pythons it's just a matter of calling fsencode so it's not too much effort to handle this. If pybind11 is going to have special support for this (not saying it has to, but possibly nice), perhaps it's better to provide casters between pathlikes and std::filesystem instead of inventing its own class...

jbarlow83 commented 6 years ago

I suppose std::filesystem would be better but it requires C++17. Maybe there isn't an elegant solution yet.