swig / swig

SWIG is a software development tool that connects programs written in C and C++ with a variety of high-level programming languages.
http://www.swig.org
Other
5.97k stars 1.26k forks source link

Missing terminating character " in the generation of the string that represents a type's name #3127

Closed lluisalemanypuig closed 1 month ago

lluisalemanypuig commented 2 months ago

Context

I'm using SWIG version 4.3.0 to wrap a C++23 library that I'm writing. This library contains many enumerations inside. One function returns a std::array of as many values as values there are in one specific enumeration. A minimal example is the following of this context is this:

enum class my_enum {
    A, B, C, D,
    unknown
};
constexpr std::size_t __my_enum_size = 1 + static_cast<std::size_t>(my_enum::unknown);

[[nodiscard]] std::array<bool, __my_enum_size> do_something() {
    return std::array<bool, __my_enum_size>{true,true,true,true,false};
}

Now, this is wrapped all right into the enumeration.i module:

%module enumeration
%{
#include "enumeration.hpp"
%}
%include "enumeration.hpp"

Bug

However, if I split the declaration of __my_enum_size into multiple lines, like so:

enum class my_enum {
    A, B, C, D,
    unknown
};
constexpr std::size_t __my_enum_size =
    1 + static_cast<std::size_t>(
            my_enum::unknown
        );

[[nodiscard]] std::array<bool, __my_enum_size> do_something() {
    return std::array<bool, __my_enum_size>{true,true,true,true,true};
}

swig fails to produce a compilable .cxx. Swig generates an unterminated string:

static swig_type_info _swigt__p_char = {"_p_char", "char *", 0, 0, (void*)0, 0};
static swig_type_info _swigt__p_std__arrayT_bool_1_static_castT_std__size_t_tf____my_enum__unknown___F_t = {"_p_std__arrayT_bool_1_static_castT_std__size_t_tf____my_enum__unknown___F_t", "std::array< bool,1+static_cast< std::size_t >(
            my_enum::unknown
        ) > *", 0, 0, (void*)0, 0};

Summary

To summarize, swigs wraps fine the module when the declaration of __my_enum_size is written in a single line, but it fails to wrap it correctly, when the declaration is written over multiple lines. (note: the solution is not to write the declaration of __my_enum_size in a single line in my actual code since .clang-format won't stop splitting it. Yes, I can write // clang-format off but this does not fix the potential bug in swig).

Steps to reproduce

To reproduce the steps to find the bug, download the files attached in this issue and run the commands

$ mkdir library
$ make

In the library/enumeration.cxx generated by swig, search for the unterminated string written above.

No extra files are needed to reproduce the steps.

files.zip

ojwb commented 2 months ago

I don't think it's really a missing terminating ", but rather than literal newlines are being emitted in this string constant - the terminating " is there on the last line of the output you quote - the compiler presumably complains because it's not on the same line as the opening ".

Looks like newline should probably be normalised and get treated like a space somewhere in the handling of this.

lluisalemanypuig commented 2 months ago

Yes, I like your interpretation more. I thought that the issue could be fixed by adding more ", like this:

static swig_type_info _swigt__p_char = {"_p_char", "char *", 0, 0, (void*)0, 0};
static swig_type_info _swigt__p_std__arrayT_bool_1_static_castT_std__size_t_tf____my_enum__unknown___F_t = {"_p_std__arrayT_bool_1_static_castT_std__size_t_tf____my_enum__unknown___F_t", "std::array< bool,1+static_cast< std::size_t >("
            "my_enum::unknown"
        ") > *", 0, 0, (void*)0, 0};

An interesting feature of C++ is that strings can be written in different lines, in chunks, like this:

const std::string s = "this is"
    "a"
"long"
           "string";

so I simply thought that some " were missing...

ojwb commented 2 months ago

Just adding " like that could change the meaning - type long long (with a newline in the middle) would produce source:

"long"
"long"

which would be longlong.

We could escape as \n and include in the output, but it seems cleaner and simpler to just replace newlines with spaces.

Normally we rebuild expression values with consistent (and fairly minimal) use of spaces and no other whitespace, but in this case we get the value via parser error recovery so we get it with whitespace as in the source.

This seems to fix the generated code:

diff --git a/Source/CParse/parser.y b/Source/CParse/parser.y
index f3ed4040..63194755 100644
--- a/Source/CParse/parser.y
+++ b/Source/CParse/parser.y
@@ -7262,6 +7262,7 @@ exprcompound   : expr[lhs] PLUS expr[rhs] {
           qty = nstr;
         }
         $$.val = NewStringf("%s%s",qty,scanner_ccode);
+        Replaceall($$.val, "\n", " ");
         Clear(scanner_ccode);
         /* Try to deduce the type - this could be a C++ "constructor
          * cast" such as `double(4)` or a function call such as

With this applied I get:

static swig_type_info _swigt__p_std__arrayT_bool_1_static_castT_std__size_t_tf____my_enum__unknown___F_t = {"_p_std__arrayT_bool_1_static_castT_std__size_t_tf____my_enum__unknown___F_t", "std::array< bool,1+static_cast< std::size_t >(          my_enum::unknown        ) > *", 0, 0, (void*)0, 0};

I think it'd be better to have a helper function which properly normalises whitespace in expression code obtained in this way so it looks more like expressions we can properly parse. It could at least handle other whitespace (especially tabs) and collapse runs of whitespace to a single space, or perhaps eliminate in simple cases.

Also need to check other places that use parser error recovery like this (e.g. an expression with subscripting and newlines within the [...] is probably similarly problematic in this context; some member access expressions; also some decltype, sizeof, alignof, noexcept expressions).

And I need to add regression tests.

lluisalemanypuig commented 1 month ago

If this is a lot of work, (and I think it is) maybe I could try pitching in by adding some test cases. I tried to locate the tests but I couldn't find them. If you could point me to them, maybe I can work out how to add some (I would open a PR, of course).

ojwb commented 1 month ago

I don't think it's a lot of work, mostly I've just not found the time to sort it out.

I'm also not sure how best to clean up/normalise these expressions obtained via parser error recovery, but a simple replacement of \n with a space would probably do the job for an initial fix.

I can take a look now. If it gets more involved, you could usefully come up with examples which currently fail for each of the cases I noted above (decltype, etc). Just a small .i pasted in a comment here is fine, but FYI the test suite is in Examples/test-suite.

ojwb commented 1 month ago

I noticed a related case (which is a different problem) - we generate bad code for this:

constexpr std::size_t __my_enum_size = 1 + sizeof("literal");

The expression value is used in a string literal with no escaping so the " in it break things. I think I've seen a similar problem with sizeof("literal") before. Not sure if there's an open issue for it - the github search doesn't seem able to search for something like sizeof(".

ojwb commented 1 month ago

Some CI failures, but we're pretty close I think. Javascript and R look unrelated AFAICS.

SWIG/Ocaml seems to generate bad code trying to wrap the class enum members. I think simplest to just adjust the testcase so it doesn't use a class enum as that's not really relevant to the bug (fixing the bug would be good but SWIG/Ocaml is fairly unloved and marked "experimental" and I don't have time or energy to get sucked into fixing it).

We just need a testcase where an expression parse uses the rule:

| type LPAREN {

Looks like replacing my_enum::unknown with 4 still provides a regression test and then we can get rid of the enum entirely.

Also these cases aren't actually working via parser error recovery - they actually all grab program text using skip_balanced(). I'll adjust the commit message and change log description.

ojwb commented 1 month ago

R look unrelated AFAICS.

Actually R is failing on the modified testcase - I was thrown off because a different testcase name was shown just above the failure but that's due to testcases getting run in parallel.

I can reproduce locally at least, but sadly I'm no R expert.

ojwb commented 1 month ago

Oh, it's easy actually - you shouldn't be using the identifier __my_enum_size in C++ code as it contains a double underscore: If the programmer uses such identifiers, the program is ill-formed, no diagnostic required.:

https://en.cppreference.com/w/cpp/language/identifiers

In C code the restriction is only that it can't start with a double underscore:

https://en.cppreference.com/w/c/language/identifier

If I rename to just my_enum_size then R is happy too.

ojwb commented 1 month ago

I went for normalising runs of whitespace to a single space. The normalising code doesn't do anything smart about string literals though as I found any string literal in this context breaks the type_info anyway (because SWIG just substitutes the expression with double quotes into a double quoted string).

However it's just occurred to me that this can break other cases, e.g. this change stops the patched testcase from passing but shouldn't:

diff --git a/Examples/test-suite/constant_expr.i b/Examples/test-suite/constant_expr.i
index 78d9b5e3..546e5bc5 100644
--- a/Examples/test-suite/constant_expr.i
+++ b/Examples/test-suite/constant_expr.i
@@ -65,5 +65,5 @@ public:
         test_alignof_too;
 };

-%constant int WSTRING_LIT_LEN1 = (sizeof(L"1234")/sizeof(wchar_t) - 1);
-%constant int WSTRING_LIT_LEN2 = (sizeof(L"12" L"34")/sizeof(wchar_t) - 1);
+%constant int WSTRING_LIT_LEN1 = (sizeof(L"  34")/sizeof(wchar_t) - 1);
+%constant int WSTRING_LIT_LEN2 = (sizeof(L"  " L"34")/sizeof(wchar_t) - 1);

We could add a state machine to handle whether we're in a string or not, but the case for normalising the whitespace seems quite weak so I think we probably should just change newline to a space.

Except that breaks this patched version:

diff --git a/Examples/test-suite/constant_expr.i b/Examples/test-suite/constant_expr.i
index 78d9b5e3..3184fc0f 100644
--- a/Examples/test-suite/constant_expr.i
+++ b/Examples/test-suite/constant_expr.i
@@ -66,4 +66,5 @@ public:
 };

 %constant int WSTRING_LIT_LEN1 = (sizeof(L"1234")/sizeof(wchar_t) - 1);
-%constant int WSTRING_LIT_LEN2 = (sizeof(L"12" L"34")/sizeof(wchar_t) - 1);
+%constant int WSTRING_LIT_LEN2 = (sizeof(L"12\
+" "34")/sizeof(wchar_t) - 1);

With GCC this gives a warning at compile time which seems a bit confusing but I think it doesn't like the \ escape sequence:

constant_expr_wrap.cxx:2368:137: warning: unknown escape sequence: '\040'
 2368 |   zend_declare_class_constant_long(SWIG_Php_ce_constant_expr, "WSTRING_LIT_LEN2", sizeof("WSTRING_LIT_LEN2") - 1, (int)((sizeof(L"12\ " "34")/sizeof(wchar_t)-1)));
      |                                                                                                                                         ^~~~

Worse though is that this literal is 5 characters long whereas with the escaped newline it was 4.

Reopening until I sort these problems out.

ojwb commented 1 month ago

I went for normalising runs of whitespace to a single space. The normalising code doesn't do anything smart about string literals though as I found any string literal in this context breaks the type_info anyway (because SWIG just substitutes the expression with double quotes into a double quoted string).

I've adjusted this to replace each whitespace with one space, and to skip this and just use the expression text as-is if the expression text contains one or more double quotes. I believe this works for all cases which SWIG currently handles.

If anyone can find a case that worked before but doesn't now, please open a new issue with a self-contained reproducer. Similarly for related cases this doesn't fix.