m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
434 stars 201 forks source link

RTIO loop execution time #298

Closed cjbe closed 8 years ago

cjbe commented 8 years ago

Using the current host and gateware (nist_qc2) on a KC705, the following code underflows when tPulse < 36us. This seems suprisingly slow compared to the example in the manual. Is this a regression in the new compiler?

class UnderflowTest(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.out = self.get_device("ttl0")

    def run(self):
        self.runKernel(36*us) # This does not underflow 
        self.runKernel(35*us) # This underflows

    @kernel
    def runKernel(self, tPulse):
        self.core.break_realtime()
        for i in range(10000):
            self.out.pulse(tPulse)
            delay(1*us)
jordens commented 8 years ago

It needs to do soft-float arithmetics if you call it that way. If you convert to machine units, you can go down to 3 µs on pipistrello and probably 2 µs on kc705.

    @kernel
    def runKernel(self, tPulse):
        t = seconds_to_mu(tPulse)
        self.core.break_realtime()
        for i in range(10000):
            self.out.pulse_mu(t)
            delay(1*us)
cjbe commented 8 years ago

Aha! Casting like this the following sequence is on the edge of underflow with t=seconds_to_mu(940*ns). Hence a total loop time of 1.04us.

        for i in range(10000):
            self.out.pulse_mu(t)
            delay(100*ns)
jordens commented 8 years ago

This would make a nice FAQ entry.

jordens commented 8 years ago

And it is a bit surprising to me that it does soft-fp in every iteration of the loop. @whitequark shouldn't llvm be able to figure out that it can move a large part of the calculation of the timestamp outside of the loop?

whitequark commented 8 years ago

@jordens Remember when I said LLVM's default pass pipeline is a poor fit for ARTIQ? Well, this is why. Try dumping LLVM and then running something like... opt -sroa -inline -licm -gvn -instcombine -dce. There's a world of difference, and in much more than just hoisting FP out of the loop.

whitequark commented 8 years ago

Reopening since this is still slow.

jordens commented 8 years ago

It's floating point. This will always be "too slow".

whitequark commented 8 years ago

@jordens Actually, no. LLVM ought to inline all of the functions in this experiment and hoist the entire calculation except for one addition out of the loop. But it doesn't currently.

whitequark commented 8 years ago

Oh, and once it does that, it should lower all floating point operations to integers, since RTIO timeline is in machine units, so it doesn't even have to stay floating point.

jordens commented 8 years ago

Still. Here the bug is with the user. It could have been kept benign here as you describe. But no LLVM optimization will prevent all floating point emulation. In many cases the user needs to be aware that floating point can become expensive and. There is floating point in many more places than just the timeline.

whitequark commented 8 years ago

The optimizations that should have happened here would benefit vast amounts of code, and the reason they are not currently engaged (chiefly lack of inlining) are even more troubling. So regardless of whether code like this should be written, this particular snippet is ought to be optimized well, if only as a representative sample.

whitequark commented 8 years ago

With latest compiler this doesn't underflow with tPulse as low as 310ns, and there is no more low-hanging fruit.