Open wks opened 9 years ago
Some hardware instructions mandate alignment of memory access. For example:
movaps
in SSE requires an address aligned to 128 bits (16 bytes). Otherwise a hardware exception is raised (segmentation fault observed in application).VLDM
in ARMv7 Advanced SIMD and Floating-point instructions will raise an Alignment fault when the address is not aligned to Word (32-bit). This is the same as the size of float, but other instructions like VADD
can work on smaller units than int32 or float, e.g. VADD.I16
.Some hardware instructions have less performance than aligned accesses. For example,
movups
in SSE does not have such requirement, but has inferior performance than movaps
.VLD1
, VLD2
, VLD3
in ARMv7, all instructions in ARMv8 that are not LL/SC and are not accessing device memory.Alignment challenges the µVM type system. The copying garbage collector also imposes a challenge.
Current µVM array types have the same alignment requirement as their elements. When loading a vector from an array of scalars (e.g. loading a vector <float 4>
from an array<float 100>
), it is uncertain whether the beginning of an array is aligned to 16 bytes. Even it is so, the copying garbage collector may move it to an unaligned address.
The thing is worse when an array can be moved during an iteration. Consider the following C program
float *f_ary = new float[100];
for(float *cur = f_ary; cur < f_ary + 100; cur += 4) {
__m128 vec = _mm_load_ps(cur); // alignment required
doSomething(vec);
gc_yield_point(); // may move f_ary
}
The gc_yield_point
may trigger a GC and the GC can move the f_ary
object. Even if the cur
points to aligned location in the begining, it may instead point to unaligned location in the next iteration. So the code must be written as the following, which is ugly:
float *f_ary = new float[100];
float *cur;
for(cur = f_ary; cur < f_ary + 100; cur += 4) {
while( ! isaligned(cur)) {
doScalarOperationOn(cur);
cur += 1;
}
if (cur < f_ary + 100 - 4) {
__m128 vec = _mm_load_ps(cur); // alignment required
doSomething(vec);
gc_yield_point(); // may move f_ary
}
}
while (cur < f_ary + 100) {
doScalarOperationOn(cur);
cur += 1;
}
1. If the Client want aligned access to an array, it should use an array of vectors.
The vector type, written as vector <T len>
will always be aligned to the requirement of the platform. For example, on x86_64, vector <float 4>
will be 16-bytes aligned. An array or a hybrid of such vector, i.e. array < vector <float 4> 100>
(__m128[100]
in C notation) has the same alignment.
Using the proposed "prefix rule", a vector array can be used as a scalar array and compatibility with traditional scalar array accesses can be maintained.
The following code is always aligned for SSE loads and stores:
.typedef @LengthType = int<32>
.typedef @JavaArrayHeader = struct <@TIBRef @Lock @LengthType>
.typedef @PackedFloat = vector <float 4>
.typedef @JavaFloatArray = hybrid <@JavaArrayHeader @PackedFloat>
// float[] fAry = new Float[98]
%fAry = allocHybrid <@JavaFloatArray> 25 // allocate 25 packed float vectors. It allocates two redundant elements.
%fAry.length = ... // GETFIXEDPARTIREF followed by GETFIELDIREF
STORE <@LengthType> %fAry.length 98
// fAry[10] = 3.14f
%fAry.var = GETVARPARTIREF <@JavaFloatArray> %fAry // iref<@PackedFloat>
%fAry.0 = REFCAST <iref<@PackedFloat> iref<float>> %fAry.var // cast a reference to vector to a reference to scalar float
%fAry.10 = SHIFTIREF <float> %fAry.0 10
STORE <float> %fAry.10 3.14f
// __m128 firstFourElems = _mm_load_ps(fAry)
%fAry.var = GETVARPARTIREF <@JavaFloatArray> %fAry // iref<@PackedFloat>
% firstFourElems = LOAD <@PackedFloat> %fAry.var // always aligned
Note the last line. The LOAD from a @PackedFloat
internal reference is guaranteed to be aligned.
2. Unaligned accesses are allowed, but aligned accesses are encouraged.
The LOAD, STORE instructions optionally have the UNALIGNED
flag, in which case unaligned accesses are allowed. So there are:
LOAD ALIGNED <vector <float 4>> %someLocation
: must be aligned, i.e. load from an actual vector<float 4>
vector. Otherwise it is an undefined behaviour.LOAD UNALIGNED <vector <float 4>> %someLocation
: may be unaligned. Can load from a plain array<float SIZE>
array.LOAD <vector <float 4>> %someLocation
: This implies ALIGNED
.If a platform only has unaligned vector load/store, then ALIGNED
will be implemented as UNALIGNED
; but if there is no unaligned vector load/store, it will be an error to use UNALIGNED
.
Here is a C program that may cause segmentation fault because of unaligned vector access
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
struct MyArray { // When allocated, the beginning is 16 bytes aligned
int length;
float elems[]; // Not 16 bytes aligned
};
int main(int argc, char **argv) {
__m128 a = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
__m128 b = _mm_set_ps(5.0f, 6.0f, 7.0f, 8.0f);
__m128 c = _mm_add_ps(a, b);
struct MyArray *ary = (struct MyArray*)malloc(sizeof(struct MyArray) + 42 * sizeof(float));
printf("&ary->elems == %p\n", ary->elems);
// _mm_store_ps(ary->elems, c); // segmentation fault
_mm_storeu_ps(ary->elems, c);
int i;
for(i=0; i<4; i++) {
printf("elems[%d] = %f\n", i, ary->elems[i]);
}
free(ary);
return 0;
}
The following program is safe.
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
struct MyArray { // When allocated, the beginning is 16 bytes aligned
int length;
__m128 elems[]; // Also aligned
};
int main(int argc, char **argv) {
__m128 a = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
__m128 b = _mm_set_ps(5.0f, 6.0f, 7.0f, 8.0f);
__m128 c = _mm_add_ps(a, b);
struct MyArray *ary = (struct MyArray*)malloc(sizeof(struct MyArray) + 11 * sizeof(__m128));
printf("&ary->elems == %p\n", (float*)ary->elems);
_mm_store_ps((float*)ary->elems, c); // safe
int i;
for(i=0; i<4; i++) {
printf("elems[%d] = %f\n", i, ((float*)ary->elems)[i]); // load as scalar
}
free(ary);
return 0;
}
The current BinOp instructions (including ADD, SUB, MUL, UDIV, SDIV, UREM, SREM, SHL, LSHR, ASHR, AND, OR, XOR) which are common in CPUs, but hardwares may provide extra operations (vector or scalar) not covered by the above. Extra intrinsic functions should be provided to use the operations provided by the hardware. Examples include:
The new ICALL
instruction will takes a more general form:
ICALL _ifuncname <_type_params_> (_variableparams) EXC _nordest _excdest KEEPALIVE (_keepalivevariables)
All of type_params, variable_params, the exceptional clause (EXC nor_dest exc_dest) and the keep-alive clause can be optional. In the simplest form, only the intrinsic function name is needed.
ICALL @uvm.thread_exit // perhaps the simplest intrinsic function. No type or variable parameters.
ICALL @uvm.new_thread(%some_stack) // takes only a variable parameter
%result = ICALL @uvm.math.sqrt <float> (%some_num) // The type parameter indicates that this is a vector operation
%result2 = ICALL @uvm.math.sqrt <vector<float 4>> (%some_vec) // The type parameter indicates that this is a vector operation
// Assume integer division is an intrinsic function (theoretically any instruction can be an intrinsic function).
// The %exc destination will handle the divide-by-zero error.
%result = ICALL @uvm.math.udiv <int<64>> (%lhs %rhs) EXC %nor %exc
The type system and the main instruction set are aware of vectors, as documented in microvm-spec wiki revision e6d22bfd282e4afba5478b8deaa006517533580a
More "common instructions" (previously called "intrinsic functions") are needed for extra math operations.
Some modern processors provide SIMD instructions. Using them properly can greatly increase the performance of some computations. The µVM should expose them to the user.
LLVM's approach:
Things to be done in the µVM
BinOp
µVM instruction. Example: reciprocal, exp, log, abs, ...