Closed jbarlow83 closed 7 years ago
The quick version is that pybind11 loads and casts between std::string
and python strings assuming UTF-8 (calling Python core functions to do the interpretation/conversion), and assumes UTF-16 or UTF-32 when using std::wstring
(the former if wchar_t
is 2 bytes, the latter if 4 bytes).
Here are some more specific questions whose answers I think should be documented:
bytes
(Py2 str
) is passed to a C++ function accepting std::string
, is it implicitly converted to UTF-8 or left alone?std::string
and have Python receive it bytes
(Py2 str
)?std::string
be converted to unencoded str
, UTF-8 encoded str
or unicode
?std::string
cannot be implicitly converted to/from UTF-8?std::string
as bytes
(str
)?I agree that it would be good to have this documented. Based on my reading of the code (the template <> class type_caster<std::string> {
in include/pybind11/cast.h
), and Python C API documentation, I believe the answers (and remaining questions!) are:
when going from Python to C++ std::string
(i.e. type_caster<std::string>::load()
) if the passed object is a unicode
object (or subclass) the created std::string
will be the UTF-8 encoding of the unicode string. If you give it bytes
, it'll be left alone.
when casting a std::string
into Python (i.e. returning a std::string
) we call PyUnicode_FromStringAndSize
on it unconditionally: we have no way to know whether the string came in as bytes
or unicode
. The documentation for the Python function just says that it interprets it as UTF-8, but it doesn't say what happens if it is passed invalid UTF-8 data.
unicode
Good question. The Python C API documentation is remarkably lacking in description of error handling (the Python API is better).
Not directly, but you can interact with bytes
(or str
in Python 2) via the py::bytes
class, e.g. by returning a py::bytes(s)
where s
is a std::string
.
By code inspection it looks like Python will raise a UnicodeDecodeError
if PyUnicode_FromStringAndSize
fails.
However the current behavior from pybind11 2.0.1 (arguably a bug) is to return this kind of error:
TypeError: Unable to convert function return value to a Python type! The signature was
() -> str
It's possibly a bug because it suppresses information that could be used to solve the problem.
My test function was:
m.def("bad_utf8",
[]() -> std::string {
return std::string("\xd0\xd0\xd0"); // not utf-8
}
);
It would also be useful to document what pybind11 does with single character literals and wchar_t in each direction.
I'm not sure if it should just report a better error, or actually return a bytes
in that case. (The latter would make round-tripping of bytes data work, as long as the data didn't happen to be a valid UTF-8 sequence with high-bit bytes).
My thinking is that there should be a 1:1 correspondence between std::string
and Python3 str
. It is already true that any str
can be represented as a utf-8 encoded std::string
. The wrapper code then has the burden of ensuring that any strings generated in C++ are normalized to utf-8 before being returned to Python. (Another thing to explain in documentation.)
From Python, you almost never want a function that sometimes returns str
and sometimes bytes
. That breaks too many simple things that ought to be simple and reliable:
print("I talked to C++ and it said: " + wrapped_cpp_sometimes_returns_bytes())
Ideally the error would be the underlying UnicodeDecodeError rather than that TypeError, because the former gives the byte offset and offending character sequence.
Round-tripping bytes (possibly containing NULs) could be done with passing and returning py::bytes
as you mentioned, and it's conveniently explicit.
PR #624 addresses the error being propagated back to Python.
I didn't address the documentation (except to add u16/u32 types to the table).
Well thanks for this, I think the picture is a lot clearer.
I do think pybind11 core devs may want to evaluate whether implicit bytes
-> std::string
conversion should be allowed since it is not symmetric with the automatic std::string
-> str
conversion on return and required workaround py::bytes -> bytes
.
I agree that it would be nice at least to mark a function as disallowing an implicit bytes ->(utf8)-> std::string. (Here "bytes" and "str" have their Py3 meanings.) An example case would be pathnames: if python passes in a str, we want to encode it using the filesystem encoding (not necessarily utf-8), if python passes in a bytes, we should assume os.fsencode() has already been called on it and just pass it accordingly. If pybind11 always does the case, I believe we can't distinguish between the two cases (other than taking a py::object as argument and typechecking ourselves).
You can mark a function as such by accepting py::bytes as the argument. Then you can implement any conversion in the lambda before dispatching to the C++ codebase.
I was thinking the best thing for pathnames would be a special py::pathname type that takes care of all the cases in a version independent way. (Also handling os.PathLike and pathlib paths.)
Ah, great, thanks. On recent pythons it's just a matter of calling fsencode so it's not too much effort to handle this. If pybind11 is going to have special support for this (not saying it has to, but possibly nice), perhaps it's better to provide casters between pathlikes and std::filesystem instead of inventing its own class...
I suppose std::filesystem
would be better but it requires C++17. Maybe there isn't an elegant solution yet.
I think it would be helpful to have a new section under Type conversions that describes how pybind11 deals with Unicode conversions in Python 2.7 and 3. (I can't find this documented anywhere.)