wks commented 9 years ago

Some modern processors provide SIMD instructions. Using them properly can greatly increase the performance of some computations. The µVM should expose them to the user.

LLVM's approach:

Vector types are first-class types. Most instructions which accept scalar values also accept vector values.
- Binary operations are done element-wise.
- Comparison operations are done element-wise, resulting in a vector of one-bit integers.
- Conversions can be done between integer vectors, integer vector to FP vector, but not between different FP types (no fptrunc and fpext for vectors)
- The select instruction works with vector condition with vector values, and also scalar condition with vector values.
- extractelement and insertelement to build and extract element from vectors
- shufflevector: repermutate elements of two vectors
- All other instructions that work with first-class types, including phi, ret, function calls, the contents to load/store.
- LLVM does not address scatter/gather.

Things to be done in the µVM

Alignment requirements.
Provide extra intrinsic functions to support machine-provided operations not covered by the BinOp µVM instruction. Example: reciprocal, exp, log, abs, ...

wks commented 9 years ago

Alignment

Abstract

The copying GC and the µVM type system challenges vector memory accesses that have alignment requirements.
If the Client cares about performance, it should use "array of vectors".
Unaligned vector memory accesses are provided for convenience at the cost of performance.

Hardware instructions

Some hardware instructions mandate alignment of memory access. For example:

movaps in SSE requires an address aligned to 128 bits (16 bytes). Otherwise a hardware exception is raised (segmentation fault observed in application).
VLDM in ARMv7 Advanced SIMD and Floating-point instructions will raise an Alignment fault when the address is not aligned to Word (32-bit). This is the same as the size of float, but other instructions like VADD can work on smaller units than int32 or float, e.g. VADD.I16.

Some hardware instructions have less performance than aligned accesses. For example,

movups in SSE does not have such requirement, but has inferior performance than movaps.
Depending on the SCTLR register in ARMv7 and ARMv8, some instructions may be performed without raising Alignment fault, but will be slower. For example, VLD1, VLD2, VLD3 in ARMv7, all instructions in ARMv8 that are not LL/SC and are not accessing device memory.

µVM and the copying GC

Alignment challenges the µVM type system. The copying garbage collector also imposes a challenge.

Current µVM array types have the same alignment requirement as their elements. When loading a vector from an array of scalars (e.g. loading a vector <float 4> from an array<float 100>), it is uncertain whether the beginning of an array is aligned to 16 bytes. Even it is so, the copying garbage collector may move it to an unaligned address.

The thing is worse when an array can be moved during an iteration. Consider the following C program

float *f_ary = new float[100];

for(float *cur = f_ary; cur < f_ary + 100; cur += 4) {
    __m128 vec = _mm_load_ps(cur);    // alignment required
    doSomething(vec);
    gc_yield_point();    // may move f_ary
}

The gc_yield_point may trigger a GC and the GC can move the f_ary object. Even if the cur points to aligned location in the begining, it may instead point to unaligned location in the next iteration. So the code must be written as the following, which is ugly:

float *f_ary = new float[100];

float *cur;
for(cur = f_ary; cur < f_ary + 100; cur += 4) {
    while( ! isaligned(cur)) {
        doScalarOperationOn(cur);
        cur += 1;
    }
    if (cur < f_ary + 100 - 4) {
        __m128 vec = _mm_load_ps(cur);    // alignment required
        doSomething(vec);
        gc_yield_point();    // may move f_ary
    }
}
while (cur < f_ary + 100) {
    doScalarOperationOn(cur);
    cur += 1;
}

The proposal

1. If the Client want aligned access to an array, it should use an array of vectors.

The vector type, written as vector <T len> will always be aligned to the requirement of the platform. For example, on x86_64, vector <float 4> will be 16-bytes aligned. An array or a hybrid of such vector, i.e. array < vector <float 4> 100> (__m128[100] in C notation) has the same alignment.

Using the proposed "prefix rule", a vector array can be used as a scalar array and compatibility with traditional scalar array accesses can be maintained.

The following code is always aligned for SSE loads and stores:

.typedef @LengthType = int<32>
.typedef @JavaArrayHeader = struct <@TIBRef @Lock @LengthType>
.typedef @PackedFloat = vector <float 4>
.typedef @JavaFloatArray = hybrid <@JavaArrayHeader @PackedFloat>

// float[] fAry = new Float[98]
%fAry = allocHybrid <@JavaFloatArray> 25    // allocate 25 packed float vectors. It allocates two redundant elements.
%fAry.length = ... // GETFIXEDPARTIREF followed by GETFIELDIREF
STORE <@LengthType> %fAry.length 98

// fAry[10] = 3.14f
%fAry.var = GETVARPARTIREF <@JavaFloatArray> %fAry    // iref<@PackedFloat>
%fAry.0 = REFCAST <iref<@PackedFloat> iref<float>> %fAry.var // cast a reference to vector to a reference to scalar float
%fAry.10 = SHIFTIREF <float> %fAry.0 10
STORE <float> %fAry.10 3.14f

// __m128 firstFourElems = _mm_load_ps(fAry)
%fAry.var = GETVARPARTIREF <@JavaFloatArray> %fAry    // iref<@PackedFloat>
% firstFourElems = LOAD <@PackedFloat> %fAry.var     // always aligned

Note the last line. The LOAD from a @PackedFloat internal reference is guaranteed to be aligned.

2. Unaligned accesses are allowed, but aligned accesses are encouraged.

The LOAD, STORE instructions optionally have the UNALIGNED flag, in which case unaligned accesses are allowed. So there are:

LOAD ALIGNED <vector <float 4>> %someLocation: must be aligned, i.e. load from an actual vector<float 4> vector. Otherwise it is an undefined behaviour.
LOAD UNALIGNED <vector <float 4>> %someLocation: may be unaligned. Can load from a plain array<float SIZE> array.
LOAD <vector <float 4>> %someLocation: This implies ALIGNED.

If a platform only has unaligned vector load/store, then ALIGNED will be implemented as UNALIGNED; but if there is no unaligned vector load/store, it will be an error to use UNALIGNED.

wks commented 9 years ago

Here is a C program that may cause segmentation fault because of unaligned vector access

#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>

struct MyArray {    // When allocated, the beginning is 16 bytes aligned
    int length;
    float elems[];  // Not 16 bytes aligned
};

int main(int argc, char **argv) {
    __m128 a = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
    __m128 b = _mm_set_ps(5.0f, 6.0f, 7.0f, 8.0f);

    __m128 c = _mm_add_ps(a, b);

    struct MyArray *ary = (struct MyArray*)malloc(sizeof(struct MyArray) + 42 * sizeof(float));

    printf("&ary->elems == %p\n", ary->elems);

    // _mm_store_ps(ary->elems, c); // segmentation fault
    _mm_storeu_ps(ary->elems, c);

    int i;

    for(i=0; i<4; i++) {
        printf("elems[%d] = %f\n", i, ary->elems[i]);
    }

    free(ary);

    return 0;
}

The following program is safe.

#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>

struct MyArray {    // When allocated, the beginning is 16 bytes aligned
    int length;
    __m128 elems[];  // Also aligned
};

int main(int argc, char **argv) {
    __m128 a = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
    __m128 b = _mm_set_ps(5.0f, 6.0f, 7.0f, 8.0f);

    __m128 c = _mm_add_ps(a, b);

    struct MyArray *ary = (struct MyArray*)malloc(sizeof(struct MyArray) + 11 * sizeof(__m128));

    printf("&ary->elems == %p\n", (float*)ary->elems);

    _mm_store_ps((float*)ary->elems, c); // safe

    int i;

    for(i=0; i<4; i++) {
        printf("elems[%d] = %f\n", i, ((float*)ary->elems)[i]); // load as scalar
    }

    free(ary);

    return 0;
}

wks commented 9 years ago

Intrinsic functions

Extra intrinsic functions

The current BinOp instructions (including ADD, SUB, MUL, UDIV, SDIV, UREM, SREM, SHL, LSHR, ASHR, AND, OR, XOR) which are common in CPUs, but hardwares may provide extra operations (vector or scalar) not covered by the above. Extra intrinsic functions should be provided to use the operations provided by the hardware. Examples include:

square root: fsqrt (x87), sqrtss (SSE), sqrtps (SSE), sqrtpd (SSE2), vsqrtpd (AVX), VSQRT (ARMv7), FSQRT (A64)
reciprocal: rcpss (SSE), rcpps (SSE), vrcpps (AVX), VRECPE/VRECPS (ARMv7), FRECPE (A64)
fused multiplication and adding: vfmadd231pd (AVX), VFMA (ARMv7), FMADD (A64)
absolute value: fabs (x87), pabsw (SSSE3), vpabsw (AVX2), VABS (ARM), FABS (A64)
cosine: fcos (x87)

Modified intrinsic function call instruction

The new ICALL instruction will takes a more general form:

ICALL _ifuncname <_type_params_> (_variableparams) EXC _nordest _excdest KEEPALIVE (_keepalivevariables)

ifunc_name is the name or ID of the intrinsic function
type_params is a list of types. This decides the action taken by the instruction.
variable_params is a list of SSA variables (may be constant). These are the operands. (For the ease of implementation, the µVM tries not to infer types from other instructions.)
nor_dest and exc_dest are destination basic blocks for normal or exceptional conditions.
keepalive_variables are SSA variables kept for OSR points

All of type_params, variable_params, the exceptional clause (EXC nor_dest exc_dest) and the keep-alive clause can be optional. In the simplest form, only the intrinsic function name is needed.

ICALL @uvm.thread_exit        // perhaps the simplest intrinsic function. No type or variable parameters.

ICALL @uvm.new_thread(%some_stack)    // takes only a variable parameter

%result = ICALL @uvm.math.sqrt <float> (%some_num)    // The type parameter indicates that  this is a vector  operation
%result2 = ICALL @uvm.math.sqrt <vector<float 4>> (%some_vec)    // The type parameter indicates that this is a vector operation

// Assume integer division is an intrinsic function (theoretically any instruction can be an intrinsic function).
// The %exc destination will handle the divide-by-zero error.
%result = ICALL @uvm.math.udiv <int<64>> (%lhs %rhs) EXC %nor %exc

wks commented 9 years ago

The type system and the main instruction set are aware of vectors, as documented in microvm-spec wiki revision e6d22bfd282e4afba5478b8deaa006517533580a

More "common instructions" (previously called "intrinsic functions") are needed for extra math operations.

microvm / microvm-meta

Vector operations #9

Alignment

Abstract

Hardware instructions

µVM and the copying GC

The proposal

Intrinsic functions

Extra intrinsic functions

Modified intrinsic function call instruction