Marking an array as kernel invariant doesn't allow loop to be optimized

philipkent commented 8 years ago

We're running a few benchmarks of pulse sequences to try and resolve underflow errors that we're getting in one of our cooling experiments. We've found that loops over array's marked as kernel invariant aren't optimized in the kernel code.

The first benchmark we ran is over as simple a loop as possible.

class PulseTTLOptimized(EnvExperiment):
    """ Pulse TTL: Optimized """
    def build(self):
        self.setattr_device('core')
        self.setattr_device('ttl0')
        self.setattr_argument("num_pulses", NumberValue(scale=1, ndecimals=0, default=1000, step=1))
        self.setattr_argument("pulse_time", NumberValue(scale=1*ns, unit="ns", default=7*ns, ndecimals=0, step=1*ns))
        self.setattr_argument("delay", NumberValue(scale=1*ns, unit="ns", default=1*ns, ndecimals=0))

    @kernel
    def run(self):
        self.core.break_realtime()
        for _ in range(self.num_pulses):
            delay(self.delay)
            self.ttl0.pulse(self.pulse_time)

This can handle a maximum of about 1 pulse per microsecond. Any pulse time less than about 1us starts to throw underflow exceptions.

The other benchmark we ran measures looping over an array of pulse times.

class PulseTTLArray(EnvExperiment):
    """ Pulse TTL: Times stored in array """
    kernel_invariants = {"pulse_times"}

    def build(self):
        self.setattr_device('core')
        self.setattr_device('ttl0')
        self.setattr_argument("num_pulses", NumberValue(scale=1, ndecimals=0, default=36, step=1))
        self.setattr_argument("pulse_time", NumberValue(scale=1*us, unit="us", default=1*us, ndecimals=0, step=1*us))
        self.setattr_argument("delay", NumberValue(scale=1*us, unit="us", default=2*us, ndecimals=0))

        self.ds = DatasetHandler(parent=self, prefix="benchmarks.pulse_ttl_array.")

   def prepare(self):
        self.pulse_times = [self.pulse_time for _ in range(self.num_pulses)]

   @kernel
   def run(self):
        self.core.break_realtime()
        for t in self.pulse_times:
            delay(self.delay)
            self.ttl0.pulse(t)

This can only handle about 1 pulse every 30 microseconds, any shorter pulse time starts to throw underflow exceptions. Marking pulse_times as kernel invariant made no difference.

My hunch is that the optimization rules won't unroll a loop that iterates over an array due to corner cases, but it seems like you should be able to mark an array as kernel invariant and explicitly allow the optimization when you know that the array won't change on the core.

The issue is that our cooling experiment needs to pre calculate an array of pulse times on the host and then loop over that array on the core device. The times start at about 100us and end at about 1us. Once the times reach around 30 us we still have about 300 pulses left in the sequence which is too much for the core according to our benchmark plots (below). They show a best case of maybe 300 27us pulses.

Our benchmark plots. The y-axis is the number of pulses we can generate without an underflow vs the pulse time on the x-axis

benchmark_pulse_ttl_optimized Optimized: Doesn't seem to be any limit on the number of pulses when the pulse time is at least about 1.2 us

benchmark_pulse_ttl_array Not Optimized: The best the core can do is about 1 pulse every 30 us before underflow errors start being thrown.

whitequark commented 8 years ago

My hunch is that the optimization rules won't unroll a loop that iterates over an array due to corner cases

That's not how kernel invariants work. Specifically, the field that points to the array is indeed marked as invariant. The array itself is not, since it can be aliased in other places, e.g. you could reference it from a field and also pass it to some other function on the core device. Only one copy of every instance of an array is ever serialized.

A solution I see is introducing an immutablelist type that is the equivalent of marking every element of an array as kernel invariant. (This is distinct from tuple since tuple can contain elements of any type and thus cannot be iterated over in ARTIQ Python.)

sbourdeauducq commented 8 years ago

Instead of adding another type (and moving further from Python), how about allowing iteration on tuples whose elements are all the same type?

whitequark commented 8 years ago

Instead of adding another type (and moving further from Python), how about allowing iteration on tuples whose elements are all the same type?

We were planning to always unroll loops that iterate over tuples. These two behaviors conflict:

suppose you have a loop over a small tuple with identically typed elements that needs to be unrolled to expose more dataflow to the optimizer
suppose you have a loop over a large tuple with identically typed elements that needs to stay in one piece because it would explode code size

(This also makes generic functions more fragile, though it is a minor concern in comparison.)

sbourdeauducq commented 8 years ago

I would do it the other way, in order to stay closer to Python semantics. Support optimized iteration on homogeneous tuples without unrolling, introduce a HeterogeneousCollection (or similarly named) type for the more special case of iterating (with unrolling) on disparate objects.

sbourdeauducq commented 8 years ago

inhomogeneous kernel lists #519

whitequark commented 7 years ago

I would do it the other way, in order to stay closer to Python semantics. Support optimized iteration on homogeneous tuples without unrolling, introduce a HeterogeneousCollection (or similarly named) type for the more special case of iterating (with unrolling) on disparate objects.

I don't think this really works out nicely. Specifically, we already do "unrolling" if doing x, y = tup and if we make tuples proper homogeneous-only it would become much less useful. Essentially breaking the pattern of multiple return values.

sbourdeauducq commented 7 years ago

OK, point taken.

m-labs / artiq

Marking an array as kernel invariant doesn't allow loop to be optimized #560