swig / swig

SWIG is a software development tool that connects programs written in C and C++ with a variety of high-level programming languages.
http://www.swig.org
Other
5.75k stars 1.23k forks source link

Support for std::string_view for more target languages #1567

Open barreyra opened 5 years ago

barreyra commented 5 years ago

New c++17 feature.

fflexo commented 5 years ago

Superficially this is pretty easy to get the data exposed via SWIG in both directions.

The tricky bit is keeping the semantics right - if you used a string_view it's probably because you didn't want to make a copy. For input argument this is easy enough. For return values it's basically not possible to get something that feels like the target language's String class and has 0 copies in most languages.

So I think in the first instance I'd probably aim to support just two cases of function arguments - value and const reference because those are easy to make exactly right.

barreyra commented 5 years ago

I agree the return value is tricky. The safest solution here is to make a copy of the value, unless we can be smart about the object's lifetime.

fflexo commented 5 years ago

I've been experimenting with this a little, but not got much useful to push yet as I've not actually decided which approach to run with. We can solve this in essentially 3 ways:

  1. Hack something that builds on the existing std::string typemaps. This turns out to fairly rapidly get quite messy though - I ended up making a private sub-class of std::string_view that let the existing std::string typemaps work and then deliberately using object slicing to get a real, plain std::string_view. It's "ok" because we wanted that to happen, but it's rather more hacky than I'd hoped
  2. Write something that's substantially similar to the std::string typemaps, including per-language support. There are quite a lot of languages though and I only know 5 or 6 well enough to be confident of doing a decent job.
  3. Refactor std::string code to make each language give a nicer, pure C++ abstraction around reading/creating strings. This feels pretty intrusive though.

For the output case making a copy, but adding a warning is probably the most useful, least bad option.

vadz commented 5 years ago

My 2c: I'd try to refactor the common part of string and string_view typemaps in a macro that would be used to define them for both classes. Of course, there would also be an additional part for all the stuff in string not present in string_view, but it looks like the separation could be done purely mechanically and so have a decent chance for not breaking something.

As for the output direction, I'd just make a copy. A warning would be annoying and it's not like you could do anything about it anyhow (other than defining a function just for SWIG returning a string instead of string_view but what would we gain from forcing people to do this?).

DeeeeLAN commented 3 years ago

Any updates or workarounds for this? I have a library I am trying to use swig with and it uses string_view quite a bit.

fflexo commented 3 years ago

What language are you targeting? I've not touched this in a long time as it's not obvious what the right solution really is here - doing something locally that works for me is simple enough; however what semantics would you expect to see ? Zero copy? Or usable like a native string in your target language? (Those two are pretty much mutually exclusive).

ojwb commented 1 year ago

We're not realistically going to achieve returning as a target language string without copying in general (probably some target languages allow wrapping existing data in a string without copying, but most don't). Returning as a SWIG-specific std::string_view wrapper is going to be clumsy to use in most target languages.

I think it's best to think of the "zero copy" feature as something that's made possible by using std::string_view rather than guaranteed by using it. Even in C++, returning a std::string_view doesn't guarantee the result won't be copied (though in that case it's dependent on what the application code does), e.g.:

std::string_view get_view(int x);

std::string g() { return std::string(get_view(1)); }

There's "prior art" here too as existing SWIG out typemaps for const std::string& all copy the returned value, but in C++ it's only copied if application code does something to cause that to happen.

I wonder if it's actually best overall to just add a new std_string_view.i for each target language based on the current std_string.i. It's certainly appealing to share code, but not getting this implemented because that seems hard in practice isn't a good outcome. This does require per-target-language work, and likely means this won't get implemented for them all at once, but that's already the case for many newer C++ library features (even some that were in C++98 - e.g. go doesn't support std::set yet: #2253)

ojwb commented 1 year ago

I've started to work on this and have opened #2540 with my progress so far, which is support for Lua and PHP.

My personal target is to get this working for all the languages Xapian has bindings for so we can start to use std::string_view in the Xapian C++ API and still be able to generate bindings (that means also: python3 java ruby tcl8 csharp perl). I started with Lua and PHP as the others either use the UTL or are statically typed (which means there are extra typemaps to worry about).

If anyone wants to help out, that'd be great - both with the languages we need for Xapian (I don't know all those well) and for the other SWIG target languages.

ojwb commented 1 year ago

I've implemented this for C#, Java, Lua and PHP now.

My 2c: I'd try to refactor the common part of string and string_view typemaps in a macro that would be used to define them for both classes. Of course, there would also be an additional part for all the stuff in string not present in string_view, but it looks like the separation could be done purely mechanically and so have a decent chance for not breaking something.

I've done four languages now (though SWIG's Java and C# support are structured very similar, so maybe it's effectively 3 and a bit), and based on that I can see some obstacles to trying to share a single set of typemaps for string and string_view:

wsfulton commented 1 year ago

Agreed, copy,paste,tweak of the std::string typemaps is the best option here.

ojwb commented 1 year ago

I've implemented this for C#, Java, Lua and PHP now.

Now merged.

ojwb commented 1 year ago

I've been looking at the UTL languages, and specifically Tcl and Python (somewhat arbitrary choice).

I'm struggling to hook in to the existing framework. Tcl seems to work OK (I think because its native string representation is UTF-8 (mostly, they encode \0 as a two byte overlong sequence IIRC) whereas with Python it goes wrong, I think because the in typemap converts the Python string to UTF-8, puts the result in a std::string_view, and then releases the UTF-8 copy before passing the (now invalid) std::string_view to the wrapped function.

Python 3.3 added PyUnicode_AsUTF8AndSize() which supports keeping a UTF-8 version of the string in the object, so that would probably work better (and it seems we now require >= 3.3 for Python 3). I guess maybe I need to convert SWIG to use that (which would be more efficient for workloads which involve passing the same python string to SWIG-wrapped functions multiple times). Currently it's only used in one place, SWIG_Python_str_AsChar(), though I see there we have a fallback for the !defined(Py_LIMITED_API) || Py_LIMITED_API+0 >= 0x030A0000 case, so Py_LIMITED_API would mean we can't rely on the nice stuff anyway...

Perhaps I need to ignore the UTL and just write some per-language typemaps that actually work. I still struggle to get my head around the maze-like way the UTL hangs together after all these years.

ojwb commented 1 year ago

I managed to get the UTL machinery to work for Tcl and Ruby as their internal representation isn't wide character.

One fun wrinkle to be aware of for anyone working on this for additional languages is that std::string_view::data() returns NULL for a default-constructed (empty) string_view (and size() also returns zero), but SWIG_FromCharPtrAndSize() implementations (at least for the target languages I checked) interpret that as an invalid value and map it to None/Null/undef or similar in the target language. I solved this with for a special version of the macro for string_view, which does SWIG_FromCharPtrAndSize(s.data() ? s.data() : "", s.size()) to ensure the pointer is not NULL.

I've not looked at Perl yet, but that doesn't use wide characters internally IIUC, so probably can be done via the UTL too.

I think Python is just going to have to be non-UTL typemaps. Otherwise it's going to need a lot of custom UTL machinery which probably nothing else will use.

ojwb commented 1 year ago

I've merged support for Perl. The directorout typemap doesn't seem to work fully - I've added a TODO to Examples/test-suite/perl5/director_string_view_runme.pl for now and will investigate and try to resolve. Everything else seems to work.

ojwb commented 1 year ago

I dug into the Perl directorout issue and have some idea what's going on, but I don't really know what should be going on. I slightly wonder if the issue is Perl decides to upcall when it doesn't need to, though if that is the problem it probably means the handling is flawed when such an upcall really is needed. Maybe i need to compare the control flow with a different target language.

Working on Python now. I'm going to punt on Python2 support as it's not something I need. If anyone really wants it they can contribute a patch for it.

wsfulton commented 1 year ago

One fun wrinkle to be aware of for anyone working on this for additional languages is that std::string_view::data() returns NULL for a default-constructed (empty) string_view (and size() also returns zero), but SWIG_FromCharPtrAndSize() implementations (at least for the target languages I checked) interpret that as an invalid value and map it to None/Null/undef or similar in the target language. I solved this with for a special version of the macro for string_view, which does SWIG_FromCharPtrAndSize(s.data() ? s.data() : "", s.size()) to ensure the pointer is not NULL.

I've been thinking about this and I think SWIG_FromCharPtrAndSize is correct in converting NULL into a target language None/Null/undef as this is the best mapping for C's char *. Also usage for std::string as follows:

SWIG_From_std_string  (const std::string& s) {
  return SWIG_FromCharPtrAndSize(s.data(), s.size());
}

is a good mapping for empty std strings as it will map to a non-null target language empty string as std::string::data() is always non-null. I also think std::string_view should also never map to None/Null/undef as the string is held by value and is a view into an existing string, so SWIG_FromCharPtrAndSize(s.data() ? s.data() : "", s.size()) is the right approach.

ojwb commented 1 year ago

I've been thinking about this and I think SWIG_FromCharPtrAndSize is correct in converting NULL into a target language None/Null/undef as this is the best mapping for C's char *.

I'd certainly agree for SWIG_FromCharPtr(NULL) (and also that empty std::string and std::string_view should map to an empty string in the target language).

I think it's much less clear for SWIG_FromCharPtr(NULL, 0). I was genuinely initially surprised by the current behaviour., as C/C++ APIs taking (pointer, length) often do allow (NULL, 0) for an empty string/container (e.g. in C++ std::string(NULL, 0) seems to be valid, at least if the discussion here is correct about what a "valid range" is: https://groups.google.com/a/isocpp.org/g/std-discussion/c/aR5mlaBtDj4 - the standard seems to fail to define that term).

I mostly didn't suggest changing it as it would require checking all callers of SWIG_FromCharPtrAndSize() to see if any actually make use of the current behaviour, which was a yak I really didn't fancy shaving at this point. However we can only really check and change our own uses of it, but user typemaps may use it and expect the current behaviour so I think overall it's best to leave SWIG_FromCharPtrAndSize(NULL, 0) as-is, and if we find the SWIG_FromCharPtrAndSize(s.data() ? s.data() : "", s.size()) pattern proliferating add a new helper.

ojwb commented 1 year ago

I've merged support for Python3 too now, and expanded the test coverage for the already supported languages a bit.

My next step is to try to actually use this new support for Xapian's SWIG-generated bindings which may uncover some problems. I'd also like to resolve the Perl director issue (and conclude if it's also an issue for any other languages that just doesn't manifest) but maybe trying to use this to wrap a real library will give useful insights about that.

If anyone wants to add support for other target languages I'd suggest reading through the notes in my comments above as a start, and if you have further questions I can try to answer them.

ojwb commented 1 year ago

I found and fixed a bug in the Lua std::string_view support, which turns out to also be present for std::string (aa2a7f258ea8b9673a352d47f7fed22e70141288).

CorrM commented 10 months ago

Is there any updates ? There any workaround for c# ?

ojwb commented 10 months ago

I've implemented typemaps for C# (and some other target languages) but there's not been a release with them yet.

Obvious workarounds until 4.2.0 is released are (a) to just use SWIG git master or (b) to copy the new typemaps into your project - you need Lib/csharp/std_string_view.i and to define this before you %include in your interface file:

%define SWIG_TYPECHECK_STRINGVIEW   134    %enddef
ojwb commented 10 months ago

ccf4f6ec72262623fb45663f87a6c0fc08152e76 (and changes from @wsfulton for Python on the python-stable-abi branch) adjust handling for directorout which needs extra care to use safely as std::string_view has the same issues that returning a pointer via directorout does.

You now get a warning (which can be suppressed) - either SWIGWARN_TYPEMAP_THREAD_UNSAFE (for C# and Java where a C++ static is used) or SWIGWARN_TYPEMAP_DIRECTOROUT_PTR for other languages, except I couldn't see where the directorout typemap actually gets defined for Perl5, Ruby or Tcl where we currently reuse UTL machinery so I don't know where to add the warning=SWIGWARN_TYPEMAP_DIRECTOROUT_PTR_MSG.

ojwb commented 10 months ago

@wsfulton One remaining point - you deliberately leak in Python:

          // Py_XDECREF(bytes); // Avoid undefined behaviour ($input will be pointing to a temporary if bytes is not NULL), for now we just leak by not calling Py_XDECREF

This seems to make it impossible to use directorout for std::string_view in Python without leaking in some setups (limited abi without requiring Python >= 3.10 IIUC), whereas if we decref here you can use this provided you do something like returning a cached object like the testcases do.

It also means an unnecessary difference between target languages. We could potentially deliberately leak in other languages, but we don't currently and it would need a way to manually adjust reference counts which I'm not sure how to do for all target languages (I'm not sure all even have ref counted objects). And for Perl, Ruby and Tcl I don't even know where the directorout typemap gets defined.

You say "for now" above suggesting this is a temporary thing so I think it would be more helpful to fix this before 4.2.0 so users don't have to deal with a later change of behaviour between release versions.

I also still need to add a note to the std::string_view docs in Library.html about this issue.

Adding to the 4.2 milestone for resolving the points above.

ojwb commented 10 months ago

Oh hang on, I think I see - with the limited API and Python < 3.10 we can't get at the cached UTF-8 representation so we can't just not leak because then the data we need to point to will go away.

We could do what C# and Java do and use a static std::string in this case which doesn't leak but is not thread safe (and so gives warning 470:Thread/reentrant unsafe wrapping, consider returning by value instead."), though that's a bit fiddly as we need to embed knowledge of when the returned bytes might not be NULL. Or maybe just drop the "for now" from the comment as realistically we're going to be stuck with this leak until we require Python 3.10 (or require it for the limited abi case at least) at which point the problem just goes away.

I also thought to try -E for the perl case and the directorout typemap is defined by /*@SWIG:../../../Lib/typemaps/ptrtypes.swg,68,%ptr_directorout_typemap@*/. I guess we could add a way to flag "type wraps a pointer?" to that macro so it adds warning 473 to the non-pointer directorout as well.

wsfulton commented 10 months ago

I think it is really hard to get any sensible, reliable directorout typemaps, hence it would be easier to just not support them as I suggested earlier. They can be left in but with a bunch of caveats and inconsistencies which are language dependent. For Python, if a cached string is returned, it will work reliably (caveat is not for stable abi < python 3.10). Other languages don't necessarily provide a non-undefined behaviour way to handle effectively a pointer to a string in directors.

The most consistent approach is to not provide directorout typemaps. Otherwise, provide a non-optimal consistent approach: one that is not thread-safe or one that leaks. Neither are good options. The alternative is a non-consistent approach. It's not a good space to be in!

ojwb commented 10 months ago

I hadn't initially spotted that this leak was actually required for the case where it actually happens, but those same concerns don't apply to languages where the string data is held as UTF-8 natively and we can get a pointer to that buffer so that part of my questioning is resolved.

I agree it's a problematic area. It doesn't seem a great API pattern to have a virtual method returning std::string_view even just considering working in C++, but sometimes you need to work with (or wrap) a poorly designed API you can't just change and we do have support for the similar cases of returning a pointer or reference from a virtual method which are likely to be similarly problematic.

I'll add a note to the docs and try to add the warning for Perl5, Ruby and Tcl now I've figured out where the typemap is actually getting defined.

ojwb commented 10 months ago

I'll add a note to the docs

Done.

and try to add the warning for Perl5, Ruby and Tcl now I've figured out where the typemap is actually getting defined.

We don't support directors for Tcl so there's no directorout typemap there.

Trying to sort this out for Perl5 and Ruby still.

ojwb commented 1 month ago

I notice there isn't a clear list of target languages for which I didn't implement this, so here we go: