microvm / microvm-meta

We have moved: https://gitlab.anu.edu.au/mu/general-issue-tracker
https://gitlab.anu.edu.au/mu/general-issue-tracker
3 stars 0 forks source link

Vector operations #9

Open wks opened 9 years ago

wks commented 9 years ago

Some modern processors provide SIMD instructions. Using them properly can greatly increase the performance of some computations. The µVM should expose them to the user.

LLVM's approach:

Things to be done in the µVM

wks commented 9 years ago

Alignment

Abstract

  1. The copying GC and the µVM type system challenges vector memory accesses that have alignment requirements.
  2. If the Client cares about performance, it should use "array of vectors".
  3. Unaligned vector memory accesses are provided for convenience at the cost of performance.

Hardware instructions

Some hardware instructions mandate alignment of memory access. For example:

Some hardware instructions have less performance than aligned accesses. For example,

µVM and the copying GC

Alignment challenges the µVM type system. The copying garbage collector also imposes a challenge.

Current µVM array types have the same alignment requirement as their elements. When loading a vector from an array of scalars (e.g. loading a vector <float 4> from an array<float 100>), it is uncertain whether the beginning of an array is aligned to 16 bytes. Even it is so, the copying garbage collector may move it to an unaligned address.

The thing is worse when an array can be moved during an iteration. Consider the following C program

float *f_ary = new float[100];

for(float *cur = f_ary; cur < f_ary + 100; cur += 4) {
    __m128 vec = _mm_load_ps(cur);    // alignment required
    doSomething(vec);
    gc_yield_point();    // may move f_ary
}

The gc_yield_point may trigger a GC and the GC can move the f_ary object. Even if the cur points to aligned location in the begining, it may instead point to unaligned location in the next iteration. So the code must be written as the following, which is ugly:

float *f_ary = new float[100];

float *cur;
for(cur = f_ary; cur < f_ary + 100; cur += 4) {
    while( ! isaligned(cur)) {
        doScalarOperationOn(cur);
        cur += 1;
    }
    if (cur < f_ary + 100 - 4) {
        __m128 vec = _mm_load_ps(cur);    // alignment required
        doSomething(vec);
        gc_yield_point();    // may move f_ary
    }
}
while (cur < f_ary + 100) {
    doScalarOperationOn(cur);
    cur += 1;
}

The proposal

1. If the Client want aligned access to an array, it should use an array of vectors.

The vector type, written as vector <T len> will always be aligned to the requirement of the platform. For example, on x86_64, vector <float 4> will be 16-bytes aligned. An array or a hybrid of such vector, i.e. array < vector <float 4> 100> (__m128[100] in C notation) has the same alignment.

Using the proposed "prefix rule", a vector array can be used as a scalar array and compatibility with traditional scalar array accesses can be maintained.

The following code is always aligned for SSE loads and stores:

.typedef @LengthType = int<32>
.typedef @JavaArrayHeader = struct <@TIBRef @Lock @LengthType>
.typedef @PackedFloat = vector <float 4>
.typedef @JavaFloatArray = hybrid <@JavaArrayHeader @PackedFloat>

// float[] fAry = new Float[98]
%fAry = allocHybrid <@JavaFloatArray> 25    // allocate 25 packed float vectors. It allocates two redundant elements.
%fAry.length = ... // GETFIXEDPARTIREF followed by GETFIELDIREF
STORE <@LengthType> %fAry.length 98

// fAry[10] = 3.14f
%fAry.var = GETVARPARTIREF <@JavaFloatArray> %fAry    // iref<@PackedFloat>
%fAry.0 = REFCAST <iref<@PackedFloat> iref<float>> %fAry.var // cast a reference to vector to a reference to scalar float
%fAry.10 = SHIFTIREF <float> %fAry.0 10
STORE <float> %fAry.10 3.14f

// __m128 firstFourElems = _mm_load_ps(fAry)
%fAry.var = GETVARPARTIREF <@JavaFloatArray> %fAry    // iref<@PackedFloat>
% firstFourElems = LOAD <@PackedFloat> %fAry.var     // always aligned

Note the last line. The LOAD from a @PackedFloat internal reference is guaranteed to be aligned.

2. Unaligned accesses are allowed, but aligned accesses are encouraged.

The LOAD, STORE instructions optionally have the UNALIGNED flag, in which case unaligned accesses are allowed. So there are:

If a platform only has unaligned vector load/store, then ALIGNED will be implemented as UNALIGNED; but if there is no unaligned vector load/store, it will be an error to use UNALIGNED.

wks commented 9 years ago

Here is a C program that may cause segmentation fault because of unaligned vector access

#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>

struct MyArray {    // When allocated, the beginning is 16 bytes aligned
    int length;
    float elems[];  // Not 16 bytes aligned
};

int main(int argc, char **argv) {
    __m128 a = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
    __m128 b = _mm_set_ps(5.0f, 6.0f, 7.0f, 8.0f);

    __m128 c = _mm_add_ps(a, b);

    struct MyArray *ary = (struct MyArray*)malloc(sizeof(struct MyArray) + 42 * sizeof(float));

    printf("&ary->elems == %p\n", ary->elems);

    // _mm_store_ps(ary->elems, c); // segmentation fault
    _mm_storeu_ps(ary->elems, c);

    int i;

    for(i=0; i<4; i++) {
        printf("elems[%d] = %f\n", i, ary->elems[i]);
    }

    free(ary);

    return 0;
}

The following program is safe.

#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>

struct MyArray {    // When allocated, the beginning is 16 bytes aligned
    int length;
    __m128 elems[];  // Also aligned
};

int main(int argc, char **argv) {
    __m128 a = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
    __m128 b = _mm_set_ps(5.0f, 6.0f, 7.0f, 8.0f);

    __m128 c = _mm_add_ps(a, b);

    struct MyArray *ary = (struct MyArray*)malloc(sizeof(struct MyArray) + 11 * sizeof(__m128));

    printf("&ary->elems == %p\n", (float*)ary->elems);

    _mm_store_ps((float*)ary->elems, c); // safe

    int i;

    for(i=0; i<4; i++) {
        printf("elems[%d] = %f\n", i, ((float*)ary->elems)[i]); // load as scalar
    }

    free(ary);

    return 0;
}
wks commented 9 years ago

Intrinsic functions

Extra intrinsic functions

The current BinOp instructions (including ADD, SUB, MUL, UDIV, SDIV, UREM, SREM, SHL, LSHR, ASHR, AND, OR, XOR) which are common in CPUs, but hardwares may provide extra operations (vector or scalar) not covered by the above. Extra intrinsic functions should be provided to use the operations provided by the hardware. Examples include:

Modified intrinsic function call instruction

The new ICALL instruction will takes a more general form:

ICALL _ifuncname <_type_params_> (_variableparams) EXC _nordest _excdest KEEPALIVE (_keepalivevariables)

All of type_params, variable_params, the exceptional clause (EXC nor_dest exc_dest) and the keep-alive clause can be optional. In the simplest form, only the intrinsic function name is needed.

ICALL @uvm.thread_exit        // perhaps the simplest intrinsic function. No type or variable parameters.

ICALL @uvm.new_thread(%some_stack)    // takes only a variable parameter

%result = ICALL @uvm.math.sqrt <float> (%some_num)    // The type parameter indicates that  this is a vector  operation
%result2 = ICALL @uvm.math.sqrt <vector<float 4>> (%some_vec)    // The type parameter indicates that this is a vector operation

// Assume integer division is an intrinsic function (theoretically any instruction can be an intrinsic function).
// The %exc destination will handle the divide-by-zero error.
%result = ICALL @uvm.math.udiv <int<64>> (%lhs %rhs) EXC %nor %exc
wks commented 9 years ago

The type system and the main instruction set are aware of vectors, as documented in microvm-spec wiki revision e6d22bfd282e4afba5478b8deaa006517533580a

More "common instructions" (previously called "intrinsic functions") are needed for extra math operations.