New Compiler Scoping Issue

hartytp commented 4 years ago

There have been a few discussions about the next iteration of the ARTIQ compiler. I wanted to start a thread to consolidate ideas. Here are some initial thoughts after a conversation with @dnadlinger...

This is not something I consider myself particularly expert on, so expect mistakes/misconceptions in the below...mainly trying to start a conversation here.

What Problems Are We Trying To Solve

Why bother rewriting the compiler? I'm aware of a few types of issue at the moment:

Speed: moderately complex experiments can easily take ~10s to compile at the moment, which can easily become the experimental bottleneck
Bugs: there are a variety of bugs in the current compiler which have been known about for a long time
Corner cases: there are a variety of surprising things that can happen in the current ARTIQ python and regularly catch users out (particularly around types and attributes)
Maintainability: both in terms of tooling/dependencies and codebase. I don't have a feeling for how bad this is right now or how a redesign would/would not affect it, but including for completeness.

At least some of these could be solved without a major rewrite (e.g. by documentation/incremental code improvements/etc), but some of these look hard to resolve without a more major upheaval.

To do: link to some issues to give illustrative examples of the above points

Priorities

There is a lot of code out there at this point. Most of the ways I can see us rewriting the compiler would cause at least some level of breakage and force users to learn new rules. This may be unavoidable and a price we decide is worth paying, but we should be realistic about how much pain it will generate and how big a deterrent it will be for much of the community/how large the potential support burden will be.

Design Choices

There are three major design choices we need to make

Language choice: do we stick with a custom python-based language or move to something like straight c++/rust/D/whatever (NB we could still potentially use python for host code and, say, Rust for Kernels, but it might get a bit messy):
- The main argument I see for switching away from "ARTIQ python" is that it's already quite far from actual python. Other changes we might introduce (e.g. to improve the typing situation) will likely widen this gap. IME the differences between "ARTIQ python" and actual python (and the motivations for those differences) are not well understood by many users (who mainly don't really understand how python works anyway) and to some extent undermine the benefits of using python
- Related to the above: by using our own language (albeit, based closely on a common language) we have to write custom tooling and, perhaps more importantly, write our own language docs rather than being able to rely on extensive documentation already out there
- Another argument that's easy to overstate, but still relevant: while it's true that a big plus of python is that students with little programming experience can get up to speed quickly and write code that does simple things, people pretty quickly hit subtle issues that waste a lot of time. It might be worth the trade off of having a language with a steeper initial learning curve but simpler overall rules
- Advantage of python: it's widely used and many people have at least a basic familiarly with it. It's also what people tend to do their data analysis in.
- Advantage of python: IME even very smart physics students often struggle with languages like c (pointers) and rust (borrow checker)
Execution model
- The current execution model is based around the idea of a Kernel as a method call, which receives a lot of state from its environment at compile time and returns to the host at completion.
- There are various other execution models that might be more beneficial. e.g. a Kernel being a class that provides method calls that can be called from the host, but which persists until killed (so multiple methods can be called without reloading the kernel)
- Other execution models can be emulated (with some drawbacks) using the current framework (e.g. there are plans to implement something like the above in a general way as part of ndscan)
Implementation
- What language is the compiler written in (rust/python)? This would be mainly about compile time (would the compiler toolchain be easier to maintain with rust?)
- What is our type system? This is both about compile speed (avoiding global type inference) and usability (e.g. making it easier to follow the precision of numeric in a calculation)
- Do we have generics? If not, then stronger typing would make code reuse/composability challenging
- Can we use caching to avoid recompiling kernels? One big thing to consider here is the interface between the kernel and its environment. At present this is quite messy with the kernel having access to a lot of global state, which makes it hard to cache kernels. Providing a clearer boundary between the kernel and the external data it contains would make caching easier. This may be affected by changes to the execution model.

Discussion

Realistically, I don't think the arguments in favour of dropping python are strong enough to outweigh the disadvantages. I think the same is true of switching to other execution models. So, (3) is where the best return on investment is likely to be for now.

From the data that @dnadlinger posted (NB this is still a provisional number so let's not get into a detailed discussion about this until we've posted more data), it looks like were not likely to get more than a factor of ~5 (best case) improvement in compile times without some kind of caching (about 1/7 of the time spent compiling the experiments he looked at was spent in LLVM).

To do: go through all the compiler/language issues and decide which ones should/should not be fixed by the improvements. It would be good to also include a list of things we need to document, this is basically a proper documentation on how/why artiq python differs from normal python, including common gotchas (e.g. writing to attributes in an RPC will not affect the running kernel). Once that's done, we can agree on a list of tests that should pass in the new compiler and what it will/won't do.

dnadlinger commented 4 years ago

From the data that @dnadlinger posted (NB this is still a provisional number so let's not get into a detailed discussion about this until we've posted more data), it looks like were not likely to get more than a factor of ~5 (best case) improvement in compile times without some kind of caching (about 1/7 of the time spent compiling the experiments he looked at was spent in LLVM).

I ran some actual numbers for this, by running one fairly representative experiment (not very complex, but not simple either) from the remote entanglement setup using artiq_run in py-spy, after modifying Coredevice.run() to stop after compile(), i.e. before actually uploading/running the kernel. Of the ~8s spent in run(), about 8%, or 0.63 s, were spent in LLVM C/C++ code. About a third of this is spent parsing IR in string form, so there is scope for reducing that by doing the sane thing and using the C++ API to directly create the LLVM IR in memory. On the other hand, our current optimizer pass pipeline is a bit silly, so we might want to sacrifice a bit more compile-time performance here to handle complex code better than we currently do (especially for more involved processing on Zynq).

This is better than the guesstimates Tom mentioned above, but on the other hand, even if a new compiler had a lightning-fast frontend, we'd still be looking, at, let's say, 0.6 s of latency for this kind of experiment. An order of magnitude in improvement, but still not negligibly quick (i.e. you'd still want to minimise kernel compilation count).

dhslichter commented 4 years ago

To me this speaks to the value of being able to cache compiled experiments, if we can do so in a reasonable way. There are a lot of experiments that we run repeatedly where only a small set of parameters are changed. Of course, there are issues of loop unrolling or things like that (pardon my ignorance, is that still being done with the current runtime?) where the compiled code might depend on such parameters. However, there are certainly a lot of instances where we perform the same experiment over and over with the same overall timing, but perhaps different values of a frequency or trap voltage (for example, a clock probe experiment, or a micromotion compensation experiment, or a transition frequency calibration experiment), where it seems from my naive point of view that it should be possible to cache a compiled version and just pull new values of specific parameters when resubmitting.

sbourdeauducq commented 4 years ago

doing the sane thing and using the C++ API

Did you mean the C API? The C++ API of LLVM is complicated and has poor forward compatibility, and is better avoided. AFAIK the C API (which is what Inkwell uses internally) should be sufficient.

dnadlinger commented 4 years ago

Did you mean the C API?

Tastes differ. Yes, for a compiler where the codegen layer isn't written in C++, the C API is the obvious choice. It isn't very comprehensive, though (you typically end up extending the C API with your own wrapper functions consuming the C++ API here and there), and from C++, the C++ API can be quite ergonomic to use. The lack of backward compatibility isn't a huge issue in practice; it is entirely possible to ship a compiler with support for a few major versions back without much manpower. Either way, not relevant here, as the point was just about avoiding to construct a giant string only to then parse it again.

ljstephenson commented 4 years ago

What is the ultimate goal here? If it's responsiveness to user input, then IMO anything less than ~1 s is totally tolerable. Faster than ~0.5 s seems completely unnecessary - I'm guessing for most non-trivial experiments, a single data point is ~100 repetitions of a sequence taking of order ~1-5 ms, so we're waiting a couple of hundred milliseconds for the first data points anyway.

I've probably missed prior discussion on this but a cursory search didn't yield anything: what is the status on precompiling experiments i.e. while another experiment is running? Even that would be hugely useful.

dhslichter commented 4 years ago

Agreed that precompilation (i.e. moving compile into prepare(), so that things are all compiled when ready to run()) would be nice, but the challenge is that then if the previous experiment updates some dataset values after it finishes running, the new experiment will not have been compiled using those new values. There is potential for some frustrating corner cases here.

@sbourdeauducq would the new compiler design still unroll all pulses at compile time, or is there a way that loops/parallels/sequentials could be determined at run time? Sorry for my complete ignorance on this subject.

Regarding dead times -- this will depend on the type of experiment being run, but for example for clock applications, half a second (and, by the way, mean of half a second with a tail for whatever hiccups might occur) is an unacceptable dead time. In such instances one probably has to resort to other modes of operation (some kind of everlasting kernel that RPCs for all of its wants and needs). I think the point of @dnadlinger's comment above is that if ~1 second is the compile time for a medium-complexity experiment, it could be substantially longer for a high complexity experiment. Just something we should be aware of, and perhaps have some test cases in mind (e.g. sweeping two parameters in a sequence with many pulses while recording timestamps of photon arrivals) for bigger/more complex experiments.

sbourdeauducq commented 4 years ago

@sbourdeauducq would the new compiler design still unroll all pulses at compile time,

This is with interleave, which never really worked well and isn't used in practice even with the current compiler. There's just the regular loop unrolling as found in other compilers.

pca006132 commented 4 years ago

Two Questions:

Do we really want exception? Exceptions are really slow even with zynq, and we need dynamic allocation for exception objects in order to emulate the exception behavior in Python (#1491). Most of the errors can be implemented by returning a result type, like DMAs etc. For RTIO underflow errors, I think we can do a setjump longjump on error (installed with something like with) and panic otherwise. I think prohibiting users from using try within the kernel may solve some exception related problems. However, getting the backtrace would probably still be slow (#948), I wonder if we can somehow obtain the backtrace using CPU0 and let the CPU1 run some predefined kernel to try to do some other things like saving the ions, but this may be pretty hard as we may have to roll our own backtrace code instead of using the LLVM one.
Can we require the user to use a pure kernel, and pass data only from the attributes and RPC? If this can be done (can be accepted by the users), caching the compiled kernel may be a lot easier, and it may even be possible to use the kernel cached on the board.

dnadlinger commented 4 years ago

Yes, we do want exceptions. They aren't per se "really slow" in any meaningful way, as long as you don't build them into timing-critical paths. If we wanted to avoid exceptions, we'd pretty much need to switch to another language, as asking people to use monadic result types everywhere without language support just doesn't work – especially not in Python, with its very exception-focused design. (RTIOUnderflows are only one source of exceptional state during execution; we also need to handle other weird hardware conditions – lasers losing lock/dropping in power, someone just yanking a cable, some issue on other nodes participating in a quantum networking experiment, user killing the experiment, …).

You only need to symbolize the backtrace when you actually want to print the exception, so that shouldn't be an issue performance-wise, as you don't need to do that when handling the exception to perform somewhat time-sensitive recovery tasks. Allocations are indeed a tricky topic, however.

What do you mean by "pure kernel" (especially if you then mention attributes)? Kernels obviously aren't pure in the FP sense, as their job is to interact with hardware. It doesn't seem, however, that this is the main difficulty with caching. Rather, what makes caching a bit annoying to implement is that a priori every attribute of every object in the host Python program can somehow end up influencing kernel compilation, and further, there is currently no way for the user to distinguish between attribute that should be assumed to rarely change and hence compiled into the kernel for performance optimisation reasons (e.g. the fine timestamp resolution or some other hardware parameters), and those that might (e.g. a scanned parameter, or a calibration value pulled from a dataset that might be changed by another, higher-priority servo experiment).

dnadlinger commented 4 years ago

@ljstephenson: Completely agreed that a 10x improvement would effectively make compiler latency largely irrelevant given our current code base. However, a) as Dan pointed out, we are not running anything particularly critical in terms of duty cycle, and b) the code base was specifically written with somewhat long compile times in mind – if you ran your scan loops on the host for interactive experiments, you'd still be in trouble. I don't think "fixing" the latter is on our agenda, but it's still worth pointing out that "10x" isn't equivalent to "infinitely fast" in practice.

dtcallcock commented 4 years ago

there are certainly a lot of instances where we perform the same experiment over and over with the same overall timing, but perhaps different values of a frequency or trap voltage (for example, a clock probe experiment, or a micromotion compensation experiment, or a transition frequency calibration experiment), where it seems from my naive point of view that it should be possible to cache a compiled version and just pull new values of specific parameters when resubmitting.

further, there is currently no way for the user to distinguish between attribute that should be assumed to rarely change and hence compiled into the kernel for performance optimisation reasons (e.g. the fine timestamp resolution or some other hardware parameters), and those that might (e.g. a scanned parameter, or a calibration value pulled from a dataset that might be changed by another, higher-priority servo experiment).

Could there just be two dataset_dbs? One that works exactly like now (requires recompilation if anything changes) and a special one for those few parameters that are allowed to change post-compilation. This db would need to be able to provide guarantees that the compiler can rely on (like the range/type of possible values). We sort of already do this in drift tracking experiments, where the trap frequency is stored and tracked on the core device right @dhslichter?

pca006132 commented 4 years ago

Yes, we do want exceptions. They aren't per se "really slow" in any meaningful way, as long as you don't build them into timing-critical paths. If we wanted to avoid exceptions, we'd pretty much need to switch to another language, as asking people to use monadic result types everywhere without language support just doesn't work – especially not in Python, with its very exception-focused design. (RTIOUnderflows are only one source of exceptional state during execution; we also need to handle other weird hardware conditions – lasers losing lock/dropping in power, someone just yanking a cable, some issue on other nodes participating in a quantum networking experiment, user killing the experiment, …).

You only need to symbolize the backtrace when you actually want to print the exception, so that shouldn't be an issue performance-wise, as you don't need to do that when handling the exception to perform somewhat time-sensitive recovery tasks. Allocations are indeed a tricky topic, however.

What do you mean by "pure kernel" (especially if you then mention attributes)? Kernels obviously aren't pure in the FP sense, as their job is to interact with hardware. It doesn't seem, however, that this is the main difficulty with caching. Rather, what makes caching a bit annoying to implement is that a priori every attribute of every object in the host Python program can somehow end up influencing kernel compilation, and further, there is currently no way for the user to distinguish between attribute that should be assumed to rarely change and hence compiled into the kernel for performance optimisation reasons (e.g. the fine timestamp resolution or some other hardware parameters), and those that might (e.g. a scanned parameter, or a calibration value pulled from a dataset that might be changed by another, higher-priority servo experiment).

Yes, monadic result without language support and a good type system would probably be painful, perhaps similar to the C style of checking return values everywhere...

What I mean by pure kernel is that, we can disallow the user from referencing outer python variables within the kernel and require the user to pass them in explicitly. For distinguishing kernel parameters (static) and other data, we can allow the user to have unbounded variables within the kernel code, but require them to supply the unbounded parameters explicitly when calling prepare() or something to compile before running the kernel. Any unbounded parameters not specified in the prepare() call would just cause error. For other parameters, we would pass through attributes.

lriesebos commented 4 years ago

At this moment I do not have to add a lot for this issue, but I do wanted to mention two things:

As mentioned above, something as caching (sub-)kernels could probably already save a lot of compilation time without changing the whole compiler. It would basically boil down to "smarter" usage of the existing compiler. Not sure how easy it would be to implement in the current design though.
Personally I am a big fan of the Python subset for kernels. The always re-inventing of domain specific languages for kernels is a pain for tools and users. Taking a subset of an existing language is a great idea. In addition, it makes it much easier to write a functional simulator for kernel code.

dnadlinger commented 4 years ago

As mentioned above, something as caching (sub-)kernels could probably already save a lot of compilation time without changing the whole compiler. It would basically boil down to "smarter" usage of the existing compiler. Not sure how easy it would be to implement in the current design though.

I think we all agree that caching is a good idea; the question is just how easy it is to implement in the current architecture (in terms of implementation effort, as well as ease of understanding for users).

Personally I am a big fan of the Python subset for kernels. The always re-inventing of domain specific languages for kernels is a pain for tools and users. Taking a subset of an existing language is a great idea.

I don't think anybody has suggested inventing a new DSL for kernels; rather, the question is whether it might make sense to switch to a language better suited to real-time code on a fairly resource-constrained system than Python. Python's lack of facilities for deterministic lifetime management, templates/generics (in view of static typing), and strong reliance on exceptions for error management (with those not being the most natural choice for an environment without dynamic memory management).

Also, note that ARTIQ Python isn't a strict subset of Python. We do try to keep the semantics as close to host Python as possible for code that compiles/runs in either, but to me, the question is at what point, our variant starts looking sufficiently different from regular Python that it is easier to teach another, existing language (which people might already be familiar with, and for which there is already ample documentation). We probably aren't at that point quite yet, but I wonder whether e.g. the introduction of templates to allow us to do away with global type inference would push us past it.

sbourdeauducq commented 2 years ago

Doesn't sound like there is anything actionable here - closing. NAC3 repos is at https://git.m-labs.hk/M-Labs/nac3

m-labs / artiq