Open barreyra opened 5 years ago
Superficially this is pretty easy to get the data exposed via SWIG in both directions.
The tricky bit is keeping the semantics right - if you used a string_view
it's probably because you didn't want to make a copy. For input argument this is easy enough. For return values it's basically not possible to get something that feels like the target language's String class and has 0 copies in most languages.
So I think in the first instance I'd probably aim to support just two cases of function arguments - value and const reference because those are easy to make exactly right.
I agree the return value is tricky. The safest solution here is to make a copy of the value, unless we can be smart about the object's lifetime.
I've been experimenting with this a little, but not got much useful to push yet as I've not actually decided which approach to run with. We can solve this in essentially 3 ways:
std::string
typemaps. This turns out to fairly rapidly get quite messy though - I ended up making a private sub-class of std::string_view
that let the existing std::string
typemaps work and then deliberately using object slicing to get a real, plain std::string_view
. It's "ok" because we wanted that to happen, but it's rather more hacky than I'd hopedstd::string
typemaps, including per-language support. There are quite a lot of languages though and I only know 5 or 6 well enough to be confident of doing a decent job.std::string
code to make each language give a nicer, pure C++ abstraction around reading/creating strings. This feels pretty intrusive though.For the output case making a copy, but adding a warning is probably the most useful, least bad option.
My 2c: I'd try to refactor the common part of string
and string_view
typemaps in a macro that would be used to define them for both classes. Of course, there would also be an additional part for all the stuff in string
not present in string_view
, but it looks like the separation could be done purely mechanically and so have a decent chance for not breaking something.
As for the output direction, I'd just make a copy. A warning would be annoying and it's not like you could do anything about it anyhow (other than defining a function just for SWIG returning a string
instead of string_view
but what would we gain from forcing people to do this?).
Any updates or workarounds for this? I have a library I am trying to use swig with and it uses string_view quite a bit.
What language are you targeting? I've not touched this in a long time as it's not obvious what the right solution really is here - doing something locally that works for me is simple enough; however what semantics would you expect to see ? Zero copy? Or usable like a native string in your target language? (Those two are pretty much mutually exclusive).
We're not realistically going to achieve returning as a target language string without copying in general (probably some target languages allow wrapping existing data in a string without copying, but most don't). Returning as a SWIG-specific std::string_view
wrapper is going to be clumsy to use in most target languages.
I think it's best to think of the "zero copy" feature as something that's made possible by using std::string_view
rather than guaranteed by using it. Even in C++, returning a std::string_view
doesn't guarantee the result won't be copied (though in that case it's dependent on what the application code does), e.g.:
std::string_view get_view(int x);
std::string g() { return std::string(get_view(1)); }
There's "prior art" here too as existing SWIG out typemaps for const std::string&
all copy the returned value, but in C++ it's only copied if application code does something to cause that to happen.
I wonder if it's actually best overall to just add a new std_string_view.i
for each target language based on the current std_string.i
. It's certainly appealing to share code, but not getting this implemented because that seems hard in practice isn't a good outcome. This does require per-target-language work, and likely means this won't get implemented for them all at once, but that's already the case for many newer C++ library features (even some that were in C++98 - e.g. go doesn't support std::set
yet: #2253)
I've started to work on this and have opened #2540 with my progress so far, which is support for Lua and PHP.
My personal target is to get this working for all the languages Xapian has bindings for so we can start to use std::string_view
in the Xapian C++ API and still be able to generate bindings (that means also: python3 java ruby tcl8 csharp perl). I started with Lua and PHP as the others either use the UTL or are statically typed (which means there are extra typemaps to worry about).
If anyone wants to help out, that'd be great - both with the languages we need for Xapian (I don't know all those well) and for the other SWIG target languages.
I've implemented this for C#, Java, Lua and PHP now.
My 2c: I'd try to refactor the common part of string and string_view typemaps in a macro that would be used to define them for both classes. Of course, there would also be an additional part for all the stuff in string not present in string_view, but it looks like the separation could be done purely mechanically and so have a decent chance for not breaking something.
I've done four languages now (though SWIG's Java and C# support are structured very similar, so maybe it's effectively 3 and a bit), and based on that I can see some obstacles to trying to share a single set of typemaps for string and string_view:
std::string_view::assign()
method - you have to write s = std::string_view(ptr, len);
or s = std::string_view(ptr);
instead.std::string_view::c_str()
method (and std::string_view::data()
doesn't give you a nul-terminated C string, which it does for std::string
as of C++11) - for C# and Java we need a nul-terminated string (for Java there doesn't seem to be a built-in way to convert UTF-8 data with a length; for C# our SWIG_csharp_string_callback()
mechanism currently requires it) which I've solved via a temporary std::string
object, e.g. passing std::string($1).c_str()
instead of $1.c_str()
where a nul-terminated string is needed.std::string_view
requires the data constructed from to remain valid - in the C# and Java director typemaps this requires additional static
variables compared to std::string
.Agreed, copy,paste,tweak of the std::string typemaps is the best option here.
I've implemented this for C#, Java, Lua and PHP now.
Now merged.
I've been looking at the UTL languages, and specifically Tcl and Python (somewhat arbitrary choice).
I'm struggling to hook in to the existing framework. Tcl seems to work OK (I think because its native string representation is UTF-8 (mostly, they encode \0
as a two byte overlong sequence IIRC) whereas with Python it goes wrong, I think because the in
typemap converts the Python string to UTF-8, puts the result in a std::string_view
, and then releases the UTF-8 copy before passing the (now invalid) std::string_view
to the wrapped function.
Python 3.3 added PyUnicode_AsUTF8AndSize()
which supports keeping a UTF-8 version of the string in the object, so that would probably work better (and it seems we now require >= 3.3 for Python 3). I guess maybe I need to convert SWIG to use that (which would be more efficient for workloads which involve passing the same python string to SWIG-wrapped functions multiple times). Currently it's only used in one place, SWIG_Python_str_AsChar()
, though I see there we have a fallback for the !defined(Py_LIMITED_API) || Py_LIMITED_API+0 >= 0x030A0000
case, so Py_LIMITED_API
would mean we can't rely on the nice stuff anyway...
Perhaps I need to ignore the UTL and just write some per-language typemaps that actually work. I still struggle to get my head around the maze-like way the UTL hangs together after all these years.
I managed to get the UTL machinery to work for Tcl and Ruby as their internal representation isn't wide character.
One fun wrinkle to be aware of for anyone working on this for additional languages is that std::string_view::data()
returns NULL
for a default-constructed (empty) string_view (and size() also returns zero), but SWIG_FromCharPtrAndSize()
implementations (at least for the target languages I checked) interpret that as an invalid value and map it to None
/Null
/undef
or similar in the target language. I solved this with for a special version of the macro for string_view, which does SWIG_FromCharPtrAndSize(s.data() ? s.data() : "", s.size())
to ensure the pointer is not NULL.
I've not looked at Perl yet, but that doesn't use wide characters internally IIUC, so probably can be done via the UTL too.
I think Python is just going to have to be non-UTL typemaps. Otherwise it's going to need a lot of custom UTL machinery which probably nothing else will use.
I've merged support for Perl. The directorout typemap doesn't seem to work fully - I've added a TODO to Examples/test-suite/perl5/director_string_view_runme.pl
for now and will investigate and try to resolve. Everything else seems to work.
I dug into the Perl directorout issue and have some idea what's going on, but I don't really know what should be going on. I slightly wonder if the issue is Perl decides to upcall when it doesn't need to, though if that is the problem it probably means the handling is flawed when such an upcall really is needed. Maybe i need to compare the control flow with a different target language.
Working on Python now. I'm going to punt on Python2 support as it's not something I need. If anyone really wants it they can contribute a patch for it.
One fun wrinkle to be aware of for anyone working on this for additional languages is that
std::string_view::data()
returnsNULL
for a default-constructed (empty) string_view (and size() also returns zero), butSWIG_FromCharPtrAndSize()
implementations (at least for the target languages I checked) interpret that as an invalid value and map it toNone
/Null
/undef
or similar in the target language. I solved this with for a special version of the macro for string_view, which doesSWIG_FromCharPtrAndSize(s.data() ? s.data() : "", s.size())
to ensure the pointer is not NULL.
I've been thinking about this and I think SWIG_FromCharPtrAndSize
is correct in converting NULL into a target language None
/Null
/undef
as this is the best mapping for C's char *
. Also usage for std::string
as follows:
SWIG_From_std_string (const std::string& s) {
return SWIG_FromCharPtrAndSize(s.data(), s.size());
}
is a good mapping for empty std strings as it will map to a non-null target language empty string as std::string::data()
is always non-null. I also think std::string_view
should also never map to None
/Null
/undef
as the string is held by value and is a view into an existing string, so SWIG_FromCharPtrAndSize(s.data() ? s.data() : "", s.size())
is the right approach.
I've been thinking about this and I think
SWIG_FromCharPtrAndSize
is correct in converting NULL into a target languageNone
/Null
/undef
as this is the best mapping for C'schar *
.
I'd certainly agree for SWIG_FromCharPtr(NULL)
(and also that empty std::string
and std::string_view
should map to an empty string in the target language).
I think it's much less clear for SWIG_FromCharPtr(NULL, 0)
. I was genuinely initially surprised by the current behaviour., as C/C++ APIs taking (pointer, length) often do allow (NULL, 0) for an empty string/container (e.g. in C++ std::string(NULL, 0)
seems to be valid, at least if the discussion here is correct about what a "valid range" is: https://groups.google.com/a/isocpp.org/g/std-discussion/c/aR5mlaBtDj4 - the standard seems to fail to define that term).
I mostly didn't suggest changing it as it would require checking all callers of SWIG_FromCharPtrAndSize()
to see if any actually make use of the current behaviour, which was a yak I really didn't fancy shaving at this point. However we can only really check and change our own uses of it, but user typemaps may use it and expect the current behaviour so I think overall it's best to leave SWIG_FromCharPtrAndSize(NULL, 0)
as-is, and if we find the SWIG_FromCharPtrAndSize(s.data() ? s.data() : "", s.size())
pattern proliferating add a new helper.
I've merged support for Python3 too now, and expanded the test coverage for the already supported languages a bit.
My next step is to try to actually use this new support for Xapian's SWIG-generated bindings which may uncover some problems. I'd also like to resolve the Perl director issue (and conclude if it's also an issue for any other languages that just doesn't manifest) but maybe trying to use this to wrap a real library will give useful insights about that.
If anyone wants to add support for other target languages I'd suggest reading through the notes in my comments above as a start, and if you have further questions I can try to answer them.
I found and fixed a bug in the Lua std::string_view support, which turns out to also be present for std::string (aa2a7f258ea8b9673a352d47f7fed22e70141288).
Is there any updates ? There any workaround for c# ?
I've implemented typemaps for C# (and some other target languages) but there's not been a release with them yet.
Obvious workarounds until 4.2.0 is released are (a) to just use SWIG git master or (b) to copy the new typemaps into your project - you need Lib/csharp/std_string_view.i
and to define this before you %include
in your interface file:
%define SWIG_TYPECHECK_STRINGVIEW 134 %enddef
ccf4f6ec72262623fb45663f87a6c0fc08152e76 (and changes from @wsfulton for Python on the python-stable-abi branch) adjust handling for directorout
which needs extra care to use safely as std::string_view
has the same issues that returning a pointer via directorout
does.
You now get a warning (which can be suppressed) - either SWIGWARN_TYPEMAP_THREAD_UNSAFE
(for C# and Java where a C++ static
is used) or SWIGWARN_TYPEMAP_DIRECTOROUT_PTR
for other languages, except I couldn't see where the directorout
typemap actually gets defined for Perl5, Ruby or Tcl where we currently reuse UTL machinery so I don't know where to add the warning=SWIGWARN_TYPEMAP_DIRECTOROUT_PTR_MSG
.
@wsfulton One remaining point - you deliberately leak in Python:
// Py_XDECREF(bytes); // Avoid undefined behaviour ($input will be pointing to a temporary if bytes is not NULL), for now we just leak by not calling Py_XDECREF
This seems to make it impossible to use directorout
for std::string_view
in Python without leaking in some setups (limited abi without requiring Python >= 3.10 IIUC), whereas if we decref here you can use this provided you do something like returning a cached object like the testcases do.
It also means an unnecessary difference between target languages. We could potentially deliberately leak in other languages, but we don't currently and it would need a way to manually adjust reference counts which I'm not sure how to do for all target languages (I'm not sure all even have ref counted objects). And for Perl, Ruby and Tcl I don't even know where the directorout typemap gets defined.
You say "for now" above suggesting this is a temporary thing so I think it would be more helpful to fix this before 4.2.0 so users don't have to deal with a later change of behaviour between release versions.
I also still need to add a note to the std::string_view
docs in Library.html about this issue.
Adding to the 4.2 milestone for resolving the points above.
Oh hang on, I think I see - with the limited API and Python < 3.10 we can't get at the cached UTF-8 representation so we can't just not leak because then the data we need to point to will go away.
We could do what C# and Java do and use a static std::string
in this case which doesn't leak but is not thread safe (and so gives warning 470:Thread/reentrant unsafe wrapping, consider returning by value instead."
), though that's a bit fiddly as we need to embed knowledge of when the returned bytes
might not be NULL
. Or maybe just drop the "for now" from the comment as realistically we're going to be stuck with this leak until we require Python 3.10 (or require it for the limited abi case at least) at which point the problem just goes away.
I also thought to try -E
for the perl case and the directorout typemap is defined by /*@SWIG:../../../Lib/typemaps/ptrtypes.swg,68,%ptr_directorout_typemap@*/
. I guess we could add a way to flag "type wraps a pointer?" to that macro so it adds warning 473 to the non-pointer directorout as well.
I think it is really hard to get any sensible, reliable directorout typemaps, hence it would be easier to just not support them as I suggested earlier. They can be left in but with a bunch of caveats and inconsistencies which are language dependent. For Python, if a cached string is returned, it will work reliably (caveat is not for stable abi < python 3.10). Other languages don't necessarily provide a non-undefined behaviour way to handle effectively a pointer to a string in directors.
The most consistent approach is to not provide directorout typemaps. Otherwise, provide a non-optimal consistent approach: one that is not thread-safe or one that leaks. Neither are good options. The alternative is a non-consistent approach. It's not a good space to be in!
I hadn't initially spotted that this leak was actually required for the case where it actually happens, but those same concerns don't apply to languages where the string data is held as UTF-8 natively and we can get a pointer to that buffer so that part of my questioning is resolved.
I agree it's a problematic area. It doesn't seem a great API pattern to have a virtual method returning std::string_view
even just considering working in C++, but sometimes you need to work with (or wrap) a poorly designed API you can't just change and we do have support for the similar cases of returning a pointer or reference from a virtual method which are likely to be similarly problematic.
I'll add a note to the docs and try to add the warning for Perl5, Ruby and Tcl now I've figured out where the typemap is actually getting defined.
I'll add a note to the docs
Done.
and try to add the warning for Perl5, Ruby and Tcl now I've figured out where the typemap is actually getting defined.
We don't support directors for Tcl so there's no directorout typemap there.
Trying to sort this out for Perl5 and Ruby still.
I notice there isn't a clear list of target languages for which I didn't implement this, so here we go:
New c++17 feature.