microsoft / ifc-spec

IFC format specification
Creative Commons Attribution 4.0 International
71 stars 7 forks source link

String literal structure (i.e., "const.str") should forbid interior null characters #80

Closed DarkArc closed 11 months ago

DarkArc commented 1 year ago

MSVC produced IFCs currently can contain interior null characters in represented strings.

These null characters can be a source of bugs as implementations may not recognize the need to remove them when reading text from the IFC.

Additionally, their presence cause a repeated performance burden on the consumer as the consumer must always traverse the entire string, and discard null characters. If an implementation requires a copy of the text for its purposes, optimized copies (via memcpy) are made inaccessible via the presence of null characters (assuming the implementation does not handle interior null charters itself). Similarly, interior null characters unnecessarily inflate the size of the IFC file.

It would be ideal if the IFC documented the presence of a terminating null character (i.e., #79) and (both for easy of implementation in consumers, and performance) forbids interior null character.

GabrielDosReis commented 1 year ago

MSVC produced IFCs currently can contain interior null characters in represented strings.

That looks like an MSVC bug. Do you have a testcase for this?

GabrielDosReis commented 11 months ago

Do you have a testcase for this?

@DarkArc - ping.

GabrielDosReis commented 11 months ago

Is the suggestion that the IFC spec should ban a string literal from an input C++ source code like "foo\0bar"?

DarkArc commented 11 months ago

I'm going to close this for now. I think this is actually an EDG bug (improper conversion of IFC wide character literals).

I remember realizing (likely in the time that's passed since this was reported) that I'd misunderstood 3.1 Index type:

The representation of identifiers referenced by TextOffset indices uses UTF-8 encoding and are NUL-terminated. For string literals, see §10.1.

as specifying that all const.str entries were normalized to a UTF-8 encoding. As has become obvious as the implementation has matured, that's not correct.

I've opened https://github.com/microsoft/ifc-spec/issues/141 for further discussion.