The 32-bit assembly code for UNIX-like systems in machdep.h is not generated when -march=native is included as a gcc command line option when compiling for a 32-bit target system. Hercules normally includes -march=native. The fallback is to c code that does not appear to include locking.
Background
The 32-bit assembly code in machdep.h is generated for UNIX-like systems only when any of the following processor-specific preprocessor macros are defined.
The code was added to machdep.h about ten years ago and has not been changed much.
When Hercules is compiled for a 32-bit target UNIX-like system (-m32 specified or defaulted) and with -march=native, none of the needed preprocessor macros are defined by gcc, or by clang for that matter. The option -march=native became a default for 32-bit and 64-bit builds of Hercules about five years ago. The CMake scripts continue this default.
Reading gcc documentation, -march=native sounds like an excellent default. And having read the gcc code that probes the processor for -march=native, gcc does a very thorough job of identifying x86 processor capabilities. The same gcc probe code is used for -mtune=native, but fewer of the results are used by gcc to generate machine code.
The coding in machdep.h for 64-bit UNIX-like targets and for all Windows targets is not affected by this issue.
Proposed Change
Use gcc/clang atomic intrinsics in machdep.h where such intrinsics are supported by the compiler in use. These were first documented in gcc 4.1.2 and should be included in clang, which is based on gcc 4.2.1. The intrinsics are documented here (gcc 4.2.1):
The compiler generates the assembler code, not the Hercules developer. Code generated by gcc for processors that support atomic operations is comparable to the in-line assembly code currently in machdep.c.
The same coding is used for ia32 and ia64 systems, and can likely be used for Power PC and ARM.
The issue is addressed: efficient, locking code is generated for modern 32-bit processors.
Disadvantages:
If intrinsics are used in a compilation targeting a processor that that does not support atomic operations, a function call is created by gcc. It is not known if that function includes any form of locking, nor how efficient that code is.
One Clang Oddity
When clang compiles with -march-i386, clang still generates locking assembly code. I do not understand this; the lock instruction did not appear until i486, and cmpxchg8 did not appear until the Pentium.
Changes to CMake build scripts:
1) Test for the ability of the c compiler to use the atomic intrinsic for compare and swap. If the compiler accepts the atomic intrinsic, set a preprocessor macro to indicate this.
2) If the atomic intrinsic is rejected, run a compile that uses the current in-line assembly code appropriate to the bitness of the target system. If this is accepted, set a preprocessor macro to indicate this.
Changes to GNU-Autotools scripts (configure.ac)
1) Test the gcc compiler version for 4.1.2 or better. If true, then set a preprocessor macro to indicate the availability of atomic intrinsics.
Changes to machdep.h coding for UNIX-like targets
1) Test for the availability of atomic intrinsics. If available, use atomic intrinsics for both 32-bit and 64-bit targets. It may be possible to replace the current static inline functions with a #define for each type requiring locked access.
2) If atomic intrinsics are not available but the compiler accepted the in-line assembler code, use the assembler code.
3) Fall back to c code.
Additional information
To view the preprocessor macros defined by gcc or clang that are relevant to this discussion, use the following command line. If you don't wish to create foo.h, use /dev/null as the input.
The 32-bit assembly code for UNIX-like systems in machdep.h is not generated when
-march=native
is included as a gcc command line option when compiling for a 32-bit target system. Hercules normally includes-march=native
. The fallback is to c code that does not appear to include locking.Background
The 32-bit assembly code in machdep.h is generated for UNIX-like systems only when any of the following processor-specific preprocessor macros are defined.
The code was added to machdep.h about ten years ago and has not been changed much.
When Hercules is compiled for a 32-bit target UNIX-like system (
-m32
specified or defaulted) and with-march=native
, none of the needed preprocessor macros are defined by gcc, or by clang for that matter. The option-march=native
became a default for 32-bit and 64-bit builds of Hercules about five years ago. The CMake scripts continue this default.Reading gcc documentation,
-march=native
sounds like an excellent default. And having read the gcc code that probes the processor for-march=native
, gcc does a very thorough job of identifying x86 processor capabilities. The same gcc probe code is used for-mtune=native
, but fewer of the results are used by gcc to generate machine code.The coding in machdep.h for 64-bit UNIX-like targets and for all Windows targets is not affected by this issue.
Proposed Change
Use gcc/clang atomic intrinsics in machdep.h where such intrinsics are supported by the compiler in use. These were first documented in gcc 4.1.2 and should be included in clang, which is based on gcc 4.2.1. The intrinsics are documented here (gcc 4.2.1):
https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html#Atomic-Builtins
Advantages:
The compiler generates the assembler code, not the Hercules developer. Code generated by gcc for processors that support atomic operations is comparable to the in-line assembly code currently in machdep.c.
The same coding is used for ia32 and ia64 systems, and can likely be used for Power PC and ARM.
The issue is addressed: efficient, locking code is generated for modern 32-bit processors.
Disadvantages:
One Clang Oddity
-march-i386
, clang still generates locking assembly code. I do not understand this; thelock
instruction did not appear until i486, andcmpxchg8
did not appear until the Pentium.Changes to CMake build scripts:
1) Test for the ability of the c compiler to use the atomic intrinsic for compare and swap. If the compiler accepts the atomic intrinsic, set a preprocessor macro to indicate this.
2) If the atomic intrinsic is rejected, run a compile that uses the current in-line assembly code appropriate to the bitness of the target system. If this is accepted, set a preprocessor macro to indicate this.
Changes to GNU-Autotools scripts (configure.ac)
1) Test the gcc compiler version for 4.1.2 or better. If true, then set a preprocessor macro to indicate the availability of atomic intrinsics.
Changes to machdep.h coding for UNIX-like targets
1) Test for the availability of atomic intrinsics. If available, use atomic intrinsics for both 32-bit and 64-bit targets. It may be possible to replace the current static inline functions with a #define for each type requiring locked access.
2) If atomic intrinsics are not available but the compiler accepted the in-line assembler code, use the assembler code.
3) Fall back to c code.
Additional information
To view the preprocessor macros defined by gcc or clang that are relevant to this discussion, use the following command line. If you don't wish to create foo.h, use /dev/null as the input.
Adjust
-m32
and-march=
as you see fit.Sample program to view generated code
The following program can be used to verify that the assembly code generated by the atomic intrinsics is very similar to that included machdep.h.