sg16-unicode / sg16

SG16 overview and general information
46 stars 5 forks source link

Alias barriers; a replacement for the ICU hack #67

Open tahonermann opened 3 years ago

tahonermann commented 3 years ago

ICU defines a U_ALIASING_BARRIER macro that is used to allow ICU to use char16_t internally while also providing interfaces that work with text stored in wchar_t (when it is a 16-bit type) or uint16_t (when available) without having to copy the text to and from char16_t based storage. This is important for efficient operation on Windows and with other libraries that use UTF-16 internally, but that do not use char16_t as their UTF-16 character type.

For most compilers, the U_ALIASING_BARRIER macro is a no-op and ICU relies on the compiler not taking advantage of char16_t being a distinct non-aliasing type of the other ICU supported UTF-16 character types.

For Clang and gcc, ICU defines the macro as follows and invokes it immediately before using reinterpret_cast to convert between pointers to char16_t and other supported UTF-16 character types. The (volatile) inline assembly prevents the optimizer from reordering loads and stores across the inline assembly and the "memory" clobber informs the compiler that memory read before the inline assembly must be re-read, thus forming a read/write memory barrier. See the gcc documentation for more details.

#define U_ALIASING_BARRIER(ptr) asm volatile("" : : "rm"(ptr) : "memory")

The introduction of char8_t as a non-aliasing type in C++20 creates a similar need for some form of an alias barrier that allows limited interchange between libraries that use char8_t for UTF-8 data internally and those that use char or unsigned char for UTF-8 data internally. Though the same problem applies in principle for char8_t with respect to char16_t, in practice this is less of a concern because char and unsigned char are aliasing types.

Converting a pointer to one type to a pointer to another unrelated type requires use of reinterpret_cast and that prevents performing such conversions in constant expressions and, likely, introduces UB. An alias barrier could potentially allow such conversions in constant expressions between types that meet certain compatibility requirements; for example, a common underlying type.

Recent exploration of this area has uncovered some simple test cases that demonstrate that an alias barrier is needed in practice for some compilers. The following links contain code that does not perform as intended. In each case, three attempts are made to "fix" the example using various approaches. Of the approaches tried, ICU's volatile inline assembly trick is the only one that works in all cases. In each case, the intended behavior is that the program output "Hi" when run.

Though ICU's inline assembly trick does seem to work for all of these cases, it has the downsize of pessimizing optimizers more than is necessary or desired. A more targeted solution is therefore desired.

jensmaurer commented 3 years ago

On 06/03/2021 07.29, Tom Honermann wrote:

ICU defines a |U_ALIASING_BARRIER| https://github.com/unicode-org/icu/blob/master/icu4c/source/common/unicode/char16ptr.h#L30-L36 macro that is used to allow ICU to use |char16_t| internally while also providing interfaces that work with text stored in |wchar_t| (when it is a 16-bit type) or |uint16_t| (when available) without having to copy the text to and from |char16_t| based storage. This is important for efficient operation on Windows and with other libraries that use UTF-16 internally, but that do not use |char16_t| as their UTF-16 character type.

For most compilers, the |U_ALIASING_BARRIER| macro is a no-op and ICU relies on the compiler not taking advantage of |char16_t| being a distinct non-aliasing type of the other ICU supported UTF-16 character types.

That is a daring approach, and I'm flabbergasted that it appears to work for "most compilers".

For Clang and gcc, ICU defines the macro as follows and invokes it immediately before using |reinterpret_cast| to convert between pointers to |char16_t| and other supported UTF-16 character types. The (volatile) inline assembly prevents the optimizer from reordering loads and stores across the inline assembly and the "memory" clobber informs the compiler that memory read before the inline assembly must be re-read, thus forming a read/write memory barrier. See the gcc documentation https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html for more details.

|#define U_ALIASING_BARRIER(ptr) asm volatile("" : : "rm"(ptr) : "memory") |

The introduction of |char8_t| as a non-aliasing type in C++20 creates a similar need for some form of an alias barrier that allows limited interchange between libraries that use |char8_t| for UTF-8 data internally and those that use |char| or |unsigned char| for UTF-8 data internally. Though the same problem applies in principle for |char8_t| with respect to |char16_t|, in practice this is less of a concern because |char| and |unsigned char| are aliasing types.

Converting a pointer to one type to a pointer to another unrelated type requires use of |reinterpret_cast| and that prevents performing such conversions in constant expressions and, likely, introduces UB. An alias barrier could potentially allow such conversions in constant expressions between types that meet certain compatibility requirements; for example, a common underlying type.

The proper approach is not to have an alias barrier and a reinterpret_cast as independent things, but to have an underlying_sibling_cast<T>(x) (or similar) that tells the compiler, in a more targeted manner, that x and T may now alias. The problem is whether/when the scope of such aliasing ends; if the pointer escapes, we'd poison the entire program.

Jens

tahonermann commented 3 years ago

That is a daring approach, and I'm flabbergasted that it appears to work for "most compilers".

It is, and I suspect it does not actually suffice for "most compilers". My "most compilers" statement was derived from the fact that ICU defines the U_ALIASING_BARRIER macro as empty for all compilers other than gcc and Clang. However, someone building the library can define U_ALIASING_BARRIER as needed for the compiler being used.

ICU's platform support is documented here.

Markus Scherer has reported discussing alias concerns with Microsoft engineers where he was assured that Visual C++ will never treat wchar_t, char16_t, and unsigned short as non-aliasing types. I suspect other compilers that target Windows, like Intel's icc, follow suit.

Outside of Windows, gcc and Clang probably cover most of the real world use of ICU these days.

I suspect that, even where ICU is using the U_ALIASING_BARRIER macro today, there may be a fair amount of "getting lucky" going on.

jensmaurer commented 3 years ago

With Richard's example code,

template<typename T, typename U> U f(T *p, U *q) {
  *p = 1;
  U u = *q;
  *p = 2;
  return u;
}

only clang optimizes unsigned short / char16_t; icc, MSVC, and gcc do not. This feels like a bug. MSVC doesn't even optimize the long/int combination (but icc does). I've used "/Ox /Og /O2" for MSVC, knowing nothing about that compiler.

https://www.godbolt.org/z/Mxbf6W

tahonermann commented 3 years ago

I've found reports online that MSVC doesn't perform TBAA at all.

I agree the gcc behavior feels like a bug. The relevant code is here and here.

tahonermann commented 3 years ago

In off-list discussion, Richard Smith noted that P0593R6 discusses a std::start_lifetime_as() function template that could be used to address this issue. This would require one call to produce an object of the alias type from the storage of an existing object, and then another call to transition the (possibly modified) storage back to the original object type.

tahonermann commented 4 months ago

This issue was discussed in the context of P2626R0 (charN_t incremental adoption: Casting pointers of UTF character types) during the 2024-05-22 SG16 meeting.

No polls were taken, but it is clear that we need to get a better understanding of core language limitations to make further progress on this issue.

pinskia commented 4 months ago

With Richard's example code, ... only clang optimizes unsigned short / char16_t; icc, MSVC, and gcc do not. This feels like a bug. MSVC doesn't even optimize the long/int combination (but icc does). I've used "/Ox /Og /O2" for MSVC, knowing nothing about that compiler.

So looking again at the GCC code, I see char8_t was handled here: https://github.com/gcc-mirror/gcc/commit/2d91f79dc990f81dcea89a5087cad566238b2456

But when char16_t was added: https://github.com/gcc-mirror/gcc/commit/c466b2cd136139e0e9fef6019fa6f136e23c7a4c

Was not done the same. It is conseratively correct. Let me file a bug. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115658