RFC: advanced math/matrix operations on Zynq

dhslichter commented 4 years ago

Here is the contract specification for implementing more advanced math and matrix capabilities on the Zynq core device. @jordens apparently would like to discuss modifications? @sbourdeauducq @cjbe @dnadlinger @dtcallcock @hartytp

3.2 – Implement advanced math capabilities for ARTIQ on Zynq. The contractor must implement core device runtime methods, callable from ARTIQ 4.0 kernels, and using the ARM hard floating point unit as appropriate, to calculate the following mathematical and matrix functions:

3.2.1 natural logarithm (base e) of a floating point number 3.2.2 logarithm (base 10) of a floating point number 3.2.3 base e exponentiation: ex, where x is a floating point number 3.2.4 exponentiation: xy, where x is a floating point number, for both the case where y is an integer and the case where y is a floating point number 3.2.5 sine of a floating point number 3.2.6 cosine of a floating point number 3.2.7 tangent of a floating point number 3.2.8 arcsine of a floating point number 3.2.9 arccosine of a floating point number 3.2.10 arctangent of a floating point number 3.2.11 matrix addition for 1D or 2D matrices of the same dimensions 3.2.12 matrix subtraction for 1D or 2D matrices of the same dimensions 3.2.13 matrix multiplication of an N x M matrix with an M x P matrix to produce an N x P matrix, where N, M, and P are positive integers (can be 1). This covers the case of multiplication of matrices, vectors, and combinations of matrices and vectors. 3.2.14 multiplication of a 1D or 2D matrix by a scalar of the same type as the matrix 3.2.15 transpose of a 1D or 2D matrix 3.2.16 determinant of a square 2D matrix 3.2.17 perform any of the mathematical functions 3.2.1 – 3.2.10 on a 1D or 2D matrix of floating point numbers, element-wise. For operation 3.2.4, the matrix elements can be either the base (x) or the exponent (y) in the operation.
3.2.18 cast all elements of a matrix to floating point numbers 3.2.19 cast all elements of a matrix to integers

jordens commented 4 years ago

Tl;dr: I'd take a step back and compare robust long-term solutions based on existing projects with what you actually need in the lab.

It's unclear to me how that list came about and whether it is a useful/practical/robust specification. What's the use case that this list specifically covers and which use cases are excluded? Is it just a brain dump of some float/vector math operations that are known or was there some analysis of cost/benefit? For multiple reasons I suspect that it is the former (reshaping, atan2, sqrt, random, convolve, rounding, clip/max/min, complex math seem forgotten, others are suspiciously detailed/redundant like the log/exp family; some algorithms are trivial, others extremely tricky to implement decently, no assessment of what a floating point number actually is, whether we need numerically accurate algorithms or "sufficiently good approximations").

Implementing that specification as such risks becoming an unmaintained ad-hoc partial and sub-standard solution. I'd propose that instead of demanding micro-features, we should demand a solid long-term perspective for maintainability, performance and sustained development.

In this case we'd start with a review of existing technologies for ARM math support from the compiler and the runtime. Then extend that to scalar math kernels from some libm (or replacement) and from there to existing vector math libs and their prospects for being used by ARTIQ kernels. With that in hand the list of supported ops should be compared and mapped to the actual use case in QT. And then we can judge what route the implementation should take. The resulting features will likely be very different from the ones above. On top of that we have an idea of what external projects we are tied to and what internal projects we have to maintain long term.

dhslichter commented 4 years ago

tl;dr - I think a holistic, big-picture examination sounds good and sensible. I think having this discussion be informed by existing libraries and the like is wise. We want to be careful not to ask for something overly ambitious up front, or bite off more than we can chew (part of the reason that the spec was "weird" was an attempt to limit the scope of this particular contract).

Here's a bit of the rationale/story behind how this list was devised, by way of explanation. First off, this list was explicitly NOT intended to be a "complete" implementation of all potentially desirable mathematical or matrix capabilities. Our discussions at the time were for a more limited set of things. It would of course be great to make the computational power of the core device be as full-featured as possible, but since that is a very open-ended and poorly defined goal, this was an attempt to pick somewhere to start, at least.

The goal of doing this kind of math on the core device is to make it faster to perform on-the-fly calculations/parameter updates/feedforward for various experimental tasks. There are various workarounds currently in use to enable this to be carried out using only the current kernel math capabilities. A major goal of the Zynq development was to enable a richer set of calculations to be performed on the core device, mainly for latency and compile time reasons (no waiting for RPC round trip required, and it may be possible to run kernels that don't pull parameters from parameter database, but rather save them on the core device, perform all calculations on the fly on the core device, async RPC for results so that the PC knows what has happened). This can include things like drift tracking and prediction (involving digital filtering and/or curve fitting), branching/conditionals based on more complicated analysis of the data (e.g. Bayesian readout based on photon arrival times), clock-related tasks (for example, implementing software digital servo to updating an error signal for cavity), and so forth. Our clock folks (@dleibrandt) are worried about compile times (and resulting dead times), and are thus also hoping to run in a single infinite looping kernel, where all calculations/updates are performed by the core device, and results just get pushed back to the PC as async RPC while the system runs.

Of course, if all these calculations can be performed easily on the core device, then it may make sense for many users to offload some of the precompilation from the PC, and instead do that work on the core device. Certainly as one goes to ever-larger ARTIQ systems, the ability to do more powerful calculations in a distributed way, rather than relying on a central PC and RPC (going up and down a DRTIO chain as well) seems that it would be sensible to implement, in some appropriate way.

OK, now to the spec: there was substantial pushback on making this too broad when we were quoting, so that informed some of the choices made (i.e. don't say "implement all methods in XXX library"). This was intended to be a super-basic "starter set" of operations, from which some more complex ones could be constructed by users if desired. Clearly it has limitations.

Items 3.2.1 through 3.2.10 (the specific math functions) were chosen just because Rust already has methods for them for f32 and f64 types, so I assumed it would be essentially "free" to write the runtime for these by just invoking the appropriate Rust method (thus the duplication of logs, for example). If this is not correct, please let me know. The reason for not choosing all of the methods you have mentioned above was just that I didn't want to make an extensive laundry list, and some (e.g. sqrt) can be implemented using the ones listed above. Of course, if writing the runtime is really just as easy as a single line of Rust calling the Rust method, then it might make sense to implement all of the Rust methods for the various types. One also wants to take clutter into account, and not make a bunch of methods that nobody wants or needs.

Items 3.2.11 to 3.2.19 seemed like a reasonable set of very basic matrix operations. The idea would be to then allow users to construct more complicated operations (matrix inversion, solving systems of linear equations, linear regressions, etc) from these basic operations. Down the line, those more complicated operations would be written into the runtime and could be called directly (potentially with more efficient implementations), but this would get us started. I didn't think about reshape, because my own (limited) ideas for applications all involved situations where the dimensions of the desired matrix would be known and specified at compile time and would not need to be changed.

I didn't think about things like random, complex math, etc because those seemed like they would present more challenge for implementation, and could be left for a later time. However, if we are thinking about implementing all that kind of functionality on the core device eventually, then it makes sense to build a roadmap now that includes them. Personally, I don't have a use case for needing randomness on the core device right now. Complex numbers make trig math easier and so it would be nice to have those, but I imagine they would take more work to implement.

dhslichter commented 4 years ago

In this case we'd start with a review of existing technologies for ARM math support from the compiler and the runtime. Then extend that to scalar math kernels from some libm (or replacement) and from there to existing vector math libs and their prospects for being used by ARTIQ kernels. With that in hand the list of supported ops should be compared and mapped to the actual use case in QT. And then we can judge what route the implementation should take. The resulting features will likely be very different from the ones above. On top of that we have an idea of what external projects we are tied to and what internal projects we have to maintain long term.

I think this seems reasonable. @jordens will you make a proposal along these lines that we can discuss? As I said, I am not married to the exact feature set in the above (although they would need to be part of a larger vision), and I agree that we shouldn't be reinventing the wheel if we can help it. My worry previously was taking on such a large set of math and matrix operations that it would need a giant contract (and a lot of time) to implement all of them. If we make a large-scale plan, but implement in targeted stages, I think that addresses the issue.

sbourdeauducq commented 4 years ago

just because Rust already has methods for them for f32 and f64 types, so I assumed it would be essentially "free" to write the runtime for these by just invoking the appropriate Rust method

That assumption is incorrect on embedded (no_std) systems... What Rust typically does for these is invoke the system libm, which is not present on embedded unless manually added.

sbourdeauducq commented 4 years ago

There are several options:

https://docs.rs/micromath/1.0.1/micromath/ (f32 only, not sure about quality).
https://docs.rs/libm/0.2.1/libm/ (should be good)
compile one of the C libms

sbourdeauducq commented 4 years ago

https://gitissue.com/issues/5dea2f23f80c21170ed2cf6e so much for "free" stuff...

dnadlinger commented 4 years ago

https://gitissue.com/issues/5dea2f23f80c21170ed2cf6e so much for "free" stuff...

I'm not quite sure how this is relevant for ARTIQ Python. Without manually implementing emission of LLVM intrinsics/… in the compiler, all the implementations will involve at least a function call, but that's fine, and can be addressed with LTO if really necessary.

At that point, it doesn't matter where those symbols come from. If there is no Rust version of sufficient quality available, just link in any of the battle-tested C implementations?

That being said, math libraries are surprisingly tricky if you want to be precise about things. For instance, many libc implementations actually aren't as accurate as they claim. Cephes (an ancient implementation from Netlib) is pretty good, but also not very efficient, depending on the platform. CPU instructions are often quite imprecise (e.g. outside the primary domain) or don't handle edge cases as per the standard, but then, that's not a problem for many applications where the extra speed is appreciated, etc.

dhslichter commented 4 years ago

At that point, it doesn't matter where those symbols come from. If there is no Rust version of sufficient quality available, just link in any of the battle-tested C implementations?

It does seem for something like this that using C implementations would be perhaps the best way to go? On that note, would it be possible to include (for the future plan, not right now) things like BLAS/LAPACK on the core device? Would the fact that one has already figured out how to compile C implementations of other math for the runtime be a help for doing BLAS/LAPACK too?

I have no experience with different libc implementations (nor with compiling them for embedded devices) and so am relying on the expertise of others (thanks @dnadlinger for your comments). My feeling is that as long as we document exactly what the kernel math functions do, and they behave exactly as expected, it is potentially tolerable if they do not adhere exactly to e.g. the relevant IEEE standard for certain edge cases, or if the efficiency is not perfectly optimized. The stuff being calculated on the core device will necessarily involve noisy measurement data, and thus would be dominated by experimental uncertainty rather than any small edge-case misbehavior of the math functions, I would think. However, this would need to be prominently documented so that users are aware. Ideally we would have functions that are well-behaved and efficiently implemented, but I am OK if we keep that as a goal for the longer term but aren't fully compliant in the short term, if the technical hurdles are to great for some reason.

At the risk of being flamed to death, can I ask if Xilinx provides suitable libraries for this instance? Or would they require an OS to be running on the ARM processor?

sbourdeauducq commented 4 years ago

At the risk of being flamed to death, can I ask if Xilinx provides suitable libraries for this instance? Or would they require an OS to be running on the ARM processor?

They're using the open source C libm implementations, both with OS and baremetal.

dhslichter commented 4 years ago

They're using the open source C libm implementations, both with OS and baremetal.

Do we have an understanding of the quality of these implementations relative to others that exist? Are there specific preferred implementations (as mentioned above by @dnadlinger)? Are there ways in which using one implementation versus another (ignoring for the moment the quality of the implementations, which I know is a nontrivial consideration) makes the task easier or harder for us, or are they all about the same?

sbourdeauducq commented 4 years ago

https://docs.rs/libm/0.2.1/libm/ (should be good)

Let's give this one a try. It might not be the fastest on ARM, but it is straightforward to use (this compiles: https://git.m-labs.hk/M-Labs/artiq-zynq/commit/e0560a2db9b7a1a29a47e7ebe4de668763e36b2d) and reduces the amount of C code we need.

dhslichter commented 4 years ago

OK, so implementing this version of libm will cover the "math functions" part of things, at least for now. Will have to see how efficient it ends up being, but the bar is very low in comparison to the current soft FPU on soft CPU design.

What about for vector/matrix operations? How should we proceed there?

dhslichter commented 4 years ago

this compiles: https://git.m-labs.hk/M-Labs/artiq-zynq/commit/e0560a2db9b7a1a29a47e7ebe4de668763e36b2d)

If this is compiling ok, can we just implement all of the functions in https://docs.rs/libm/0.2.1/libm/ ? It seems like it would be minimal extra effort to add them. If some of the additional ones break the compile for some reason, we can discuss what to do.

An unrelated question: is f32 math meaningfully faster than f64 math here? If not, I would propose that we not implement any of the f32 (single precision) functions and just do the f64 (double precision) functions.

dhslichter commented 4 years ago

@jordens @dnadlinger @sbourdeauducq a brief dip into Rust linear algebra world suggests that the most mature potential solution for the linear algebra tasks (thinking about overarching picture here) seems to be the Rust ndarray crate. One could then combine with the ndarray-linalg crate, which interfaces to LAPACK (ndarray can use BLAS functions, but not LAPACK, AFAICT). Just in terms of an overarching plan for linear algebra on the core device, it seems that these packages would allow us to do all the sorts of tasks one might want to carry out, even in the longer term.

Maturity-wise, it seems that they are both being actively developed (which seems promising to me), and have pretty good functionality, but are still definitely "under development". My understanding is that there is a substantial "machine learning/data science in Rust" community that is helping drive the development, so it seems that the risk that they will be abandoned/left to rot is pretty low.

Do you see roadblocks/issues for being able to compile these for the core device? Other concerns?

In terms of ARTIQ core device matrix math, I would say that in the fullness of time it would probably be good to expose all of the functionality of these packages to the user as runtime methods. One can start with subsets at the beginning if that makes things easier. Does this seem sensible/reasonable?

dnadlinger commented 4 years ago

Is there a proposal for how to handle allocations for these arrays on the core device? Currently, everything is allocated on the stack, which is fine for element-wise loops on arrays, but an issue for temporary results in array calculations (assuming the idea is to implement an NumPy like interface).

As for accuracy concerns, this was probably somewhat off-topic here in the first place. If done half-decently, Rust libm will be fine in that department (at which point it is just a question of performance). Those details, like the difference between 0.5 ULP or 1.5 ULP accuracy, or proper handling of signed infinities/zeros, NaNs, etc. aren't going to be very important here.

dnadlinger commented 4 years ago

(If support for slices and other views is not required, which indeed doesn't seem to be the case according to the contract excerpt posted above, this should become considerably easier.)

dhslichter commented 4 years ago

If support for slices and other views is not required

In the spirit of @jordens comment earlier, I think we want to think beyond the immediate terms of the contract (and its shortcomings) and think about the bigger picture of what broad-strokes functionality we want to enable on the core device in the longer term. While the funding in the current contract might not be sufficient to implement things like slices and other views, they are clearly an important part of a long-term plan, so it is important that we choose a solution going forward where we think that it will be possible, and not unreasonably difficult, to implement such things. We also want to do the near-term implementation in such a way that we don't end up cutting corners that harm us/require reworking once it comes time to implement a fuller set of features.

dhslichter commented 4 years ago

Is there a proposal for how to handle allocations for these arrays on the core device? Currently, everything is allocated on the stack, which is fine for element-wise loops on arrays, but an issue for temporary results in array calculations (assuming the idea is to implement an NumPy like interface).

I am not expert enough on the nitty gritty to make a proposal here -- @sbourdeauducq or @jordens or you, @dnadlinger, should make a proposal for how best to do this.

dnadlinger commented 4 years ago

The easiest solution is to handle allocations exactly like for lists. With the dimensionality part of the type (ie. number is axes), that's then exactly as usable (and sometimes annoying) as lists currently are.

sbourdeauducq commented 4 years ago

3.2.16 determinant of a square 2D matrix

@dhslichter How large will the matrices be? There is a significant difference in complexity between computing determinants of small matrices (e.g. ~3x3 are trivial) vs. large ones.

sbourdeauducq commented 4 years ago

The easiest solution is to handle allocations exactly like for lists.

I think this is what we should be doing.

dhslichter commented 4 years ago

@dhslichter How large will the matrices be? There is a significant difference in complexity between computing determinants of small matrices (e.g. ~3x3 are trivial) vs. large ones.

The reason I included determinants was to enable normalization of matrices being used to rotate coordinates, and for simplistic matrix inversion (I was not thinking carefully about efficient algorithms for inversion) with the idea of being able to solve linear systems of equations in a hack-y way. I think @jordens has shown that this is not a smart way to think about the problem. Mea culpa.

It seems like the best thing might be to implement LU decomposition instead, which allows then for the efficient calculation of determinants, as well as solving systems of equations, and so forth. Is it possible to use the implementation in ndarray_linalg for this in the runtime? It seems like, for the long run, one would want to use ndarray_linalg anyway (for various relevant tasks as described above), rather than rewriting things ourselves?

dhslichter commented 4 years ago

One question I have (others likely know the answer already) is whether or not the underlying BLAS and LAPACK implementations used by default by ndarray and ndarray_linalg will compile for us with our bare metal ZC706 target. AFAICT there is (notionally) the ability to plug in different BLAS and LAPACK implementations when compiling ndarray and ndarray_linalg, respectively, but I don't know if anything that already exists out there will work for us out of the box (can we use C implementations with a minimal Rust wrapper?)

sbourdeauducq commented 4 years ago

Contractual items done.

sbourdeauducq commented 4 years ago

Most of them also work on OR1K by the way, though they are obviously slow there.

m-labs / artiq

RFC: advanced math/matrix operations on Zynq #1485