oracle / graal

GraalVM compiles Java applications into native executables that start instantly, scale fast, and use fewer compute resources 🚀
https://www.graalvm.org
Other
20.32k stars 1.63k forks source link

Feature request: low level math api #1807

Open pekd opened 4 years ago

pekd commented 4 years ago

Right now there is the ExactMath class which contains methods for signed/unsigned long multiplication. It would be extremely useful to have similar methods for long signed/unsigned division as well as for add/subtract with carry, carry/overflow computations, and rotate left/right (with and without carry). Float computations with the 5 different IEEE rounding modes, handling of subnormals, float exceptions (overflow/underflow/inexact/invalid/…), handling of qNaN vs sNaN, … should also be in that API.

Of course it is possible to implement those functions in plain Java in every language interpreter that needs them, but it's quite tedious work, it's easy to get it wrong, and chances are that performance is suboptimal. Since a CPU implements those computations in hardware, Truffle could/should provide an API to directly use them.

Various low level language interpreters (e.g. Sulong, truffle86) also have to deal with vectors of fixed length (e.g. 128bit and 256bit vectors). Although vector instructions differ between architectures, it would be quite useful to have classes for the most common vector types (e.g. 128bit and 256bit vectors) with the most common operations (e.g. packed int/float add/sub/mul/div/…) in the Truffle API. That would also ensure that those operations really use the underlying vector instructions.

boris-spas commented 4 years ago

Tracking internally as Issue GR-20934.

chumer commented 4 years ago

Thank you for your suggestion!

I think we could add some more scalar signed/unsigned operations, if we have a concrete use-case for them. Could you describe exactly which new operations you would expect from ExactMath? Note that we typically only add new instrinsics if there is a benefit from it, that cannot be easily inferred automatically from the Java bytecodes. Also note that we intrinsify the java.lang.Math class. Some of the operations you mentioned can be found there.

A vector instructions API is a significantly bigger effort, and a controversial one. cc @dougxc For now we rely on automatic vectorization provided by the compiler. Again we would need to see which use-cases cannot be tracked automatically, but need to be informed by intrinsics. If you those, please share.

dougxc commented 4 years ago

Regarding a vector API, it's something we'd be very reluctant to provide as it breaks abstractions badly and locks code (performance) into a given CPU generation. For example, if you write 256bit vector code, what happens when 512bit or 1024bit vector instructions become available?

We might in the future provide some high-level API that does not tie you to concrete CPUs, but which uses code patterns that are recognized by our loop vectorizer and so should result in vectorized code.

pekd commented 4 years ago

This is roughly the IEEE float API that is not available in the JDK/Truffle right now:

public class IEEEException extends Exception { public final double result; /* … */ }
public class IEEEOverflowException extends IEEEException { /* … */ }
public class IEEEUnderflowException extends IEEEException { /* … */ }
public class IEEEInexactException extends IEEEException { /* … */ }
public class IEEEInvalidException extends IEEEException { /* … */ }
public class IEEEZeroDivideException extends IEEEException { /* … */ }

public enum IEEERoundingMode {
    NEAREST_TO_EVEN,
    NEAREST_AWAY_FROM_ZERO,
    TOWARD_ZERO,
    TOWARD_INF,
    TOWARD_MINUS_INF;
}

public interface IEEEMath {
    // IEEE 754 math operations

    // single precision
    public float add(IEEERoundingMode mode, float x, float y) throws IEEEException;
    public float sub(IEEERoundingMode mode, float x, float y) throws IEEEException;
    public float mul(IEEERoundingMode mode, float x, float y) throws IEEEException;
    public float div(IEEERoundingMode mode, float x, float y) throws IEEEException;
    public float rem(IEEERoundingMode mode, float x, float y) throws IEEEException;
    public float sqrt(IEEERoundingMode mode, float x, float y) throws IEEEException;
    public float fma(IEEERoundingMode mode, float x, float y, float z) throws IEEEException;
    public int cmp(float x, float y); // ordered
    public int cmpu(float x, float y); // unordered
    public boolean isQuietNaN(float x);
    public boolean isSignalingNaN(float x);

    // double precision
    public double add(IEEERoundingMode mode, double x, double y) throws IEEEException;
    public double sub(IEEERoundingMode mode, double x, double y) throws IEEEException;
    public double mul(IEEERoundingMode mode, double x, double y) throws IEEEException;
    public double div(IEEERoundingMode mode, double x, double y) throws IEEEException;
    public double rem(IEEERoundingMode mode, double x, double y) throws IEEEException;
    public double sqrt(IEEERoundingMode mode, double x, double y) throws IEEEException;
    public double fma(IEEERoundingMode mode, double x, double y, double z) throws IEEEException;
    public int cmp(double x, double y); // ordered
    public int cmpu(double x, double y); // unordered
    public boolean isQuietNaN(double x);
    public boolean isSignalingNaN(double x);
}

The problem here is that there is no way to specify rounding modes without using BigDecimal in the JDK, and you cannot get the different floating point exceptions either. For comparisons, I'm not sure if ordered vs unordered is optimized to the corresponding instructions if I add the necessary additional checks in a plain Java implementation; even then it might be much easier to just provide ordered/unordered comparison in Truffle so not everyone has to reinvent the wheel. Everyone who needs a complete implementation of IEEE754 floating point math needs them anyway (right now that means at least Sulong and trufflex86).

sNaN vs qNaN: right now every NaN is handled as if it was a qNaN. However, sometimes you need to introduce sNaNs and then a floating point operation should signal an invalid operation exception; the only question here is how to denote an sNaN in an architecture independent way, but probably sticking to what x86/Power/SPARC/ARM does (= also what IEEE 754-2008 recommends) should be good enough.

It might be useful to provide a few more things in that float API, like e.g. data types for 80bit floats and 128bit floats where the compiler decides how to implement/emulate them efficiently on a specific architecture (e.g. 80bit floats on non-x86 using 128bit floats + rounding). Although at least Sulong and trufflex86 need 80bit floats, and there is absolutely no sane way to implement them in an interpreter right now (no, JNI + long double in native code is not a sane way), I'm not sure if it's worth it to support that in Truffle.

For integer math, this is what you can't easily do in plain Java:

public class IntegerMath {
    // integer math

    // 128 bit / 64 bit division / remainder operation (signed)
    public long div128by64(long hi, long lo, long div) throws OverflowException, ArithmeticException;
    public long rem128by64(long hi, long lo, long div) throws OverflowException, ArithmeticException;

    // 128 bit / 64 bit division / remainder operation (unsigned)
    public long div128by64unsigned(long hi, long lo, long div) throws OverflowException, ArithmeticException;
    public long rem128by64unsigned(long hi, long lo, long div) throws OverflowException, ArithmeticException;

    // not sure if optimized versions of this can be derived from the bytecode already
    public boolean getCarry(byte a, byte b, boolean carryIn);
    public boolean getCarry(short a, short b, boolean carryIn);
    public boolean getCarry(int a, int b, boolean carryIn);
    public boolean getCarry(long a, long b, boolean carryIn);

    // addition
    public byte getOverflow(byte a, byte b, boolean carryIn);
    public short getOverflow(short a, short b, boolean carryIn);
    public int getOverflow(int a, int b, boolean carryIn);
    public long getOverflow(long a, long b, boolean carryIn);

    // subtraction
    public byte getOverflowSub(byte a, byte b, boolean carryIn);
    public short getOverflowSub(short a, short b, boolean carryIn);
    public int getOverflowSub(int a, int b, boolean carryIn);
    public long getOverflowSub(long a, long b, boolean carryIn);
}

This is essentially the division version of what already exists for multiplication; the carry/overflow is usually a by-product of addition / addition with carry / subtraction / subtraction with carry or borrow, and I'm not sure if this is already recognized by the optimizer. But it's something low level languages always need.

After thinking about it for a while, I think for rotate with/without carry it's probably easiest to detect the (rather simple) bytecode pattern (unlike carry/overflow, which can be formulated in many different ways and it's easy to miss a corner case when implementing it).

Since you asked for concrete use cases: look at trufflex86 and its implementation of floating point math (and lack of FPU control word support) and div/rem. Then look at Sulong and its implementation of floating point math and inline assembly / lack of various instructions related to the FPU control word and rounding modes.


I don't really see the problem with a Truffle vector API / what would be controversial here. It's not about implementing some generic vectors that might become longer in the future, it's about providing vectors like e.g. 128bit vectors that exist on every relevant CPU architecture and which are used by existing programs for a long time already. It's also not about writing 128bit vector code by hand, you already have that 128bit vector code. It's about writing an interpreter which is supposed to provide those 128bit vector instructions / built-ins to the guest program which uses them. The guest program doesn't care about any future 512bit or 1024bit vector instructions. Look at e.g. the shootout benchmarks. They use e.g. 128bit vectors, nothing else. They wouldn't gain anything from longer vectors since the code was written specifically for 128bit vectors. And the shootout benchmarks are just one example, the same is true for almost every other program which uses hand-written vectorized code. Every additional abstraction is quite counter productive for an interpreter developer.

This is different from "normal" Java and the vector API discussion there, because you don't write your algorithms using that "Truffle vector API". You already have an implementation of the algorithm using vector data types and built-ins, and now you write an interpreter with that "Truffle vector API" which exposes those operations via built-ins.

I asked for a Truffle implementation of 128bit vectors because that's essentially all you need: they would directly support (most of) the 128bit vector operations, and if you need longer vectors like 256bit / 512bit / … vectors for a particular interpreter, you could just combine multiple 128bit vectors and some compiler pass could combine them into longer hardware vectors again (and even then those longer vector types should be available in the Truffle API, even if they don't correspond to intrinsics; there is really no need to re-invent the wheel every time you need vectors in an interpreter). Only some "select" / "broadcast" instructions might suffer if it's implemented in that way. And of course there is no need to use a quite costly auto vectorizer (which isn't even available in Graal CE last time I checked it) to vectorize implementations of vector instructions where it's exactly known how they should be translated to (roughly) a single assembler instruction. This could also ensure that data from vector types stay in vector registers, are passed to functions using vector registers, and the NFI could also support those types (because you can pass vectors in native code, but you can't pass them through the Truffle NFI right now).

Concrete use case: SSE implementation in trufflex86, vector operations in Sulong (implementation of built-ins like e.g. _mm_min_pd; look at the operations on data types derived from LLVMVector), and native calls via Truffle NFI to native functions with vector arguments.

helloguo commented 4 years ago

We might in the future provide some high-level API that does not tie you to concrete CPUs, but which uses code patterns that are recognized by our loop vectorizer and so should result in vectorized code.

@dougxc Can you elaborate more about this? This sounds like an exciting feature. What kind of APIs will look like? Those APIs will only go with EE, or both CE and EE?

dougxc commented 4 years ago

This was more a "thinking out loud" statement than being representative of any concrete plans or more detailed designs that we currently have.

adinn commented 4 years ago

Never think aloud! Marketing people might be listening and you'll get asked to deliver the product next week ;-)