jbqubit commented 1 year ago

Bug Report

Issue Details

I observe large variations in the execution time of kernels on the core device (Kasli v1.x). While the standard deviation of the execution time may be small, in a small number of cases there are anomalous long processing delays. That is, the execution time has a long tail, implying non-gaussian statistics. This can be observed in code examples below.

In my lab's experiment codebase we make heavy use of abstraction. We write modular kernel methods with sufficient built-in slack to reliably run without RTIOUnderflow (eg Doppler cooling, gates, detection). We combine these modular elements to produce large monolithic kernels. Unfortunately, the slack that works for the atomic modules often proves insufficient when many modules are called in succession. We suspect what's happening is a random walk in slack; with a sufficient number of reps we hit RTIOUnderflow.

Actual (undesired) Behavior

Below are tables showing the execution time in ns for 80 successive kernel calls for a variety of kernel methods. At the end of each table are summary statistics that exclude the first two reps. The first two are excluded from the statistics because they are known to be reproducibly slower than successive reps.

The example code relies on get_rtio_counter_mu() to put a bound on kernel execution time. I note that the non-determinism of get_rtio_counter_mu() method is documented. In my test I put a bound on this in test_dummy(). The balance of uncertainty seems attributable to the core device microprocessor.

Another manifestation of this is in https://github.com/m-labs/artiq/issues/1987.

$ artiq_run repository/core_device_process_durations_jwb.py
### test_dummy ###
  0:    1296,   1:      312,   2:       320,   3:       320,   4:       320,   5:       320,   6:       320,   7:       320,
  8:    320,   9:       320,  10:       320,  11:       312,  12:       320,  13:       320,  14:       312,  15:       320,
 16:    320,  17:       320,  18:       320,  19:       320,  20:       320,  21:       320,  22:       320,  23:       312,
 24:    320,  25:       320,  26:       320,  27:       320,  28:       320,  29:       320,  30:       312,  31:       320,
 32:    320,  33:       312,  34:       320,  35:       320,  36:       320,  37:       320,  38:       320,  39:       312,
 40:    312,  41:       320,  42:       320,  43:       320,  44:       320,  45:       312,  46:       320,  47:       320,
 48:    320,  49:       320,  50:       320,  51:       320,  52:       320,  53:       320,  54:       320,  55:       320,
 56:    320,  57:       320,  58:       320,  59:       320,  60:       320,  61:       320,  62:       312,  63:       320,
 64:    320,  65:       320,  66:       320,  67:       320,  68:       320,  69:       320,  70:       320,  71:       320,
 72:    320,  73:       320,  74:       320,  75:       312,  76:       320,  77:       320,  78:       312,  79:       320,

mean:       319, std:       3, max deviation:       8, max frac deviation: 0.025

### test_ttl_pulse ###
  0:    4320,   1:      920,   2:       904,   3:       896,   4:       896,   5:       896,   6:       904,   7:       904,
  8:    904,   9:       896,  10:       904,  11:       896,  12:       896,  13:       904,  14:       896,  15:       896,
 16:    896,  17:       904,  18:       896,  19:       896,  20:       904,  21:       896,  22:       896,  23:       904,
 24:    896,  25:       904,  26:       904,  27:       896,  28:       896,  29:       904,  30:       904,  31:       896,
 32:    904,  33:       904,  34:       904,  35:       896,  36:       896,  37:       904,  38:       904,  39:       896,
 40:    896,  41:       896,  42:       896,  43:       896,  44:       896,  45:       904,  46:       904,  47:       896,
 48:    896,  49:       904,  50:       904,  51:       896,  52:       896,  53:       896,  54:       904,  55:       896,
 56:    896,  57:       904,  58:       904,  59:       896,  60:       904,  61:       904,  62:       904,  63:       896,
 64:    896,  65:       904,  66:       904,  67:       904,  68:       896,  69:       896,  70:       896,  71:       904,
 72:    904,  73:       896,  74:       896,  75:       904,  76:       904,  77:       896,  78:       896,  79:       896,

mean:       900, std:       4, max deviation:       8, max frac deviation: 0.009

### test_reset ###
  0:   5814680,   1:   4844920,   2:   4875376,   3:   4876640,   4:   4929560,   5:   5809232,   6:   4876648,   7:   4878136,
  8:   4876536,   9:   5859448,  10:   4915488,  11:   4825176,  12:   4876984,  13:   4879168,  14:   5910032,  15:   4876536,
 16:   4825952,  17:   4877464,  18:   4930624,  19:   5858080,  20:   4878960,  21:   4875568,  22:   4826144,  23:   5911616,
 24:   4863920,  25:   4872912,  26:   4882088,  27:   4876640,  28:   5859728,  29:   4876536,  30:   4878256,  31:   4876456,
 32:   4930040,  33:   5807816,  34:   4876368,  35:   4878208,  36:   4877904,  37:   5859864,  38:   4863872,  39:   4877456,
 40:   4877024,  41:   4876944,  42:   5859152,  43:   4879648,  44:   4876496,  45:   4877056,  46:   4928392,  47:   5808752,
 48:   4878328,  49:   4876344,  50:   4877624,  51:   5858760,  52:   4913240,  53:   4879072,  54:   4876544,  55:   4826096,
 56:   5911192,  57:   4876440,  58:   4878592,  59:   4877904,  60:   4876592,  61:   5859632,  62:   4876592,  63:   4877936,
 64:   4861432,  65:   4945680,  66:   5807392,  67:   4876568,  68:   4877800,  69:   4878480,  70:   5857784,  71:   4862752,
 72:   4878096,  73:   4877064,  74:   4878816,  75:   5858208,  76:   4878056,  77:   4876640,  78:   4877704,  79:   4930336,

mean:   5079931, std:   395087, max deviation:   1086440, max frac deviation: 0.214

### test_simple_float ###
  0:    1704,   1:      336,   2:       320,   3:       320,   4:       320,   5:       320,   6:       320,   7:       320,
  8:    320,   9:       320,  10:       312,  11:       320,  12:       320,  13:       312,  14:       312,  15:       320,
 16:    320,  17:       312,  18:       312,  19:       320,  20:       320,  21:       320,  22:       320,  23:       312,
 24:    320,  25:       320,  26:       320,  27:       320,  28:       320,  29:       320,  30:       320,  31:       320,
 32:    320,  33:       320,  34:       320,  35:       320,  36:       312,  37:       320,  38:       312,  39:       312,
 40:    312,  41:       320,  42:       320,  43:       320,  44:       320,  45:       320,  46:       312,  47:       312,
 48:    320,  49:       312,  50:       312,  51:       320,  52:       320,  53:       312,  54:       312,  55:       320,
 56:    320,  57:       312,  58:       312,  59:       320,  60:       312,  61:       312,  62:       320,  63:       312,
 64:    312,  65:       320,  66:       320,  67:       320,  68:       320,  69:       320,  70:       320,  71:       320,
 72:    320,  73:       320,  74:       312,  75:       320,  76:       320,  77:       312,  78:       312,  79:       320,

mean:       317, std:       4, max deviation:       8, max frac deviation: 0.025

### test_long_float ###
  0:   5027304,   1:   4846088,   2:   4843176,   3:   4857208,   4:   4851056,   5:   4840888,   6:   4848184,   7:   4847440,
  8:   4845248,   9:   4852192,  10:   4846528,  11:   4843688,  12:   4853344,  13:   4848872,  14:   4845480,  15:   4847336,
 16:   4845144,  17:   4858360,  18:   4848984,  19:   4845440,  20:   4847328,  21:   4845112,  22:   4848144,  23:   4843696,
 24:   4850608,  25:   4956864,  26:   4847328,  27:   4845336,  28:   4848136,  29:   4845808,  30:   4848472,  31:   4841272,
 32:   4841992,  33:   4846704,  34:   4851592,  35:   4846552,  36:   4843512,  37:   4858800,  38:   4855152,  39:   4842928,
 40:   4849728,  41:   4847440,  42:   4845256,  43:   4852216,  44:   4846520,  45:   4843552,  46:   4853344,  47:   4848984,
 48:   4842704,  49:   4849600,  50:   4845136,  51:   4858352,  52:   4848976,  53:   4845440,  54:   4847280,  55:   4845328,
 56:   4848144,  57:   4843832,  58:   4850608,  59:   4841728,  60:   4842024,  61:   4846704,  62:   4851632,  63:   4846552,
 64:   4843640,  65:   4858816,  66:   4854456,  67:   4852744,  68:   4846424,  69:   4857208,  70:   4851056,  71:   4840880,
 72:   4848184,  73:   4847368,  74:   4851296,  75:   4843832,  76:   4850736,  77:   4841720,  78:   4842232,  79:   4850928,

mean:   4849314, std:   13044, max deviation:   115984, max frac deviation: 0.024

### test_urukul ###
  0:    24992,   1:     5424,   2:      5328,   3:      5976,   4:      5352,   5:      5304,   6:      5368,   7:      5264,
  8:    5344,   9:      5304,  10:      5224,  11:      5336,  12:      5608,  13:      5992,  14:      5288,  15:      5312,
 16:    5608,  17:      5328,  18:      5336,  19:      5360,  20:      5512,  21:      5328,  22:      5400,  23:      5176,
 24:    5416,  25:      5336,  26:      5640,  27:      5184,  28:      5224,  29:      5288,  30:      5384,  31:      5400,
 32:    5456,  33:      5392,  34:      5336,  35:      5376,  36:      5256,  37:      5384,  38:      5280,  39:      5472,
 40:    5192,  41:      5336,  42:      5208,  43:      5368,  44:      5536,  45:      5352,  46:      5208,  47:      5240,
 48:    5248,  49:      5592,  50:      5296,  51:      5600,  52:      5192,  53:      5272,  54:      5320,  55:      5464,
 56:    5296,  57:      5288,  58:      5368,  59:      5376,  60:      5360,  61:      5360,  62:      5456,  63:      5208,
 64:    5376,  65:      5248,  66:      5616,  67:      5208,  68:      5344,  69:      5288,  70:      5272,  71:      5336,
 72:    5360,  73:      5208,  74:      5584,  75:      5248,  76:      5824,  77:      5368,  78:      5256,  79:      5264,

mean:   5369, std:      158, max deviation:     816, max frac deviation: 0.152

### test_dataset ###
  0:    116560,   1:    72136,   2:     47848,   3:     45720,   4:     44864,   5:     44584,   6:     43184,   7:     43272,
  8:    44168,   9:     44856,  10:     43952,  11:     45192,  12:     44880,  13:     45272,  14:     45744,  15:     44544,
 16:    43680,  17:     45656,  18:     44592,  19:     44016,  20:     44368,  21:     44520,  22:     43920,  23:     44840,
 24:    45016,  25:     44096,  26:     44592,  27:     45736,  28:     44520,  29:     45104,  30:     43736,  31:     43952,
 32:    45632,  33:     44816,  34:     44376,  35:     45416,  36:     44520,  37:     46088,  38:     43912,  39:     44576,
 40:    44456,  41:     44248,  42:     44072,  43:     44136,  44:     44640,  45:     46440,  46:     45312,  47:     44840,
 48:    44936,  49:     44624,  50:     44640,  51:     43704,  52:     43896,  53:     44416,  54:     45656,  55:     44840,
 56:    45256,  57:     44448,  58:     45616,  59:     43960,  60:     43696,  61:     44544,  62:     44520,  63:     44064,
 64:    44056,  65:     74952,  66:     44480,  67:     44624,  68:     43856,  69:     43984,  70:     43624,  71:     43496,
 72:    44632,  73:     44696,  74:     44776,  75:     45136,  76:     45856,  77:     45104,  78:     43992,  79:     45376,

mean:   45043, std:     3493, max deviation:    31768, max frac deviation: 0.705

Repeat these tests 800 times per loop and only print the statistics.

$ artiq_run repository/core_device_process_durations_jwb.py
### test_dummy ###
mean:       318, std:       3, max deviation:       8, max frac deviation: 0.025

### test_ttl_pulse ###
mean:       901, std:       4, max deviation:       8, max frac deviation: 0.009

### test_reset ###
mean:   5088269, std:   398924, max deviation:   1284992, max frac deviation: 0.253

### test_simple_float ###
mean:       318, std:       4, max deviation:       8, max frac deviation: 0.025

### test_long_float ###
mean:   4844143, std:   10057, max deviation:   122656, max frac deviation: 0.025

### test_urukul ###
mean:   5365, std:      151, max deviation:     1016, max frac deviation: 0.189

### test_dataset ###
mean:   44621, std:     836, max deviation:     10920, max frac deviation: 0.245

It's a common to observe that tests with small variability show larger variability a minute later. Compare above the maximum fractional deviation for test_dataset with 80 reps and 800 reps.

Steps to Reproduce

from artiq.experiment import *
import numpy as np

class CoreDeviceProcessDurationsJWB(EnvExperiment):

    kernel_invariants = {"ftw0", "asf0", 'mindelay'}

    def build(self):
        self.trials = 800
        self.setattr_device("core")
        self.setattr_device("ttl7")
        self.ttl = self.ttl7
        self.urukul_channel = self.get_device('urukul0_ch0')
        self.urukul_switch = self.get_device('ttl_urukul0_sw0')
        #self.zotino = self.get_device('zotino0')

        self.ftw0 = self.urukul_channel.frequency_to_ftw(100*MHz)
        self.asf0 = self.urukul_channel.amplitude_to_asf(0.85)
        self.mindelay = 1*us

    def prepare(self):
        self.durations = np.full(self.trials, 0, dtype=np.int64)

    def print_results(self, test_name, durations):
        print("### {} ### ".format(test_name))
        first = 2
        mean = np.mean(durations[first:])
        std = np.std(durations[first:])
        maxdev = np.max(durations[first:]) - np.min(durations[first:])
        maxfdev = maxdev / np.mean(durations[first:])
        deviations_found = 0
        s = ""
        for i in range(len(durations)):
            s += "{:3}: {:9.0f}, ".format(i, durations[i])
            if (i+1)%8 == 0:
                s += '\n'
        #print(s)
        s = "mean: {:9.0f}, std: {:9.0f}, max deviation: {:9.0f}, max frac deviation: {:.3f}\n".format(
            mean, std, maxdev, maxfdev)
        print(s)

    def run(self):
        tests = ['test_dummy', 'test_ttl_pulse', 'test_reset', 'test_simple_float', 'test_long_float']
        tests = tests + ['test_urukul', 'test_dataset']
        for test in tests:
            testf = getattr(self, test)
            durations = self.timer(testf)
            self.print_results(test, durations)

    @kernel
    def timer(self, kernel_method):
        self.core.reset()
        delay(1*ms)
        for i in range(self.trials):
            delay(1*ms)
            t1 = self.core.get_rtio_counter_mu()
            td = kernel_method()
            delay(td)  # default td=self.mindelay, required so kernel_method() is not optimized away
            t2 = self.core.get_rtio_counter_mu()
            self.durations[i] = t2 - t1
            self.core.wait_until_mu(now_mu())
        return self.durations

    @kernel
    def test_dummy(self):
        return self.mindelay

    @kernel
    def test_ttl_pulse(self):
        self.ttl.pulse(self.mindelay)
        return self.mindelay

    @kernel
    def test_reset(self):
        self.core.reset()
        return self.mindelay

    @kernel
    def test_simple_float(self):
        td = np.cos(3.14159/2) + np.sin(0.0001) + np.log(1.001)  # ~0
        return self.mindelay + td*ns

    @kernel
    def test_long_float(self):
        td = 0.0
        for _ in range(10):
            td -= np.cos(3.14159 / 2) + np.sin(td) + np.log(1.00001)  # ~0
            td += np.cos(3.14159 / 2) + np.sin(td) + np.log(1.00001)  # ~0
        return self.mindelay + td*ns

    @kernel
    def test_urukul(self):
        delay(1*ms)
        self.urukul_channel.set_mu(ftw=self.ftw0, asf=self.asf0)
        self.urukul_switch.pulse(10*us)
        return self.mindelay

    @kernel
    def test_dataset(self):
        # Note that set_dataset is decorated with @rpc(flags={"async"})
        self.set_dataset('ds0', 1.1, broadcast=True)
        return self.mindelay

Expected Behavior

Core device execution time variation is manageable if it's bounded. In ARITQ7 it does not appear to be well bounded.
Presently we add huge delays to each experiment repetition to patch over this class of problem. This solution comes at the cost of much reduced experiment duty cycle.
Subjectively this problem seems more prevalent in ARTIQ 7 than previous ARTIQ releases. Is it related to the switch to VexRiscv from mor1kx? Are there benchmarks for execution variation in the ARTIQ codebase? It would be helpful to track this in order to detect performance regressions.

Questions

Why are the first 1 (or 2) reps of each rep of a loop so much slower than subsequent reps?
Would a bare VexRISCv uP have deterministic execution?
What are the leading factors that cause the ARTIQ MiSOC to suffer from non-determinism?

System

Operating System: Ubuntu
ARTIQ version: latest commit of tag version 7
Environment is nix shell
Hardware involved: Kasli v1.x

sbourdeauducq commented 1 year ago

The only way to guarantee fully deterministic CPU execution times is to use SRAM only; as soon as you use SDRAM and caches then you have this sort of issue. Zynq is even worse though the much faster speed makes up for it.

SRAM is a very limited resource on FPGAs and it tends to be slow for large blocks. Can you use ARTIQ_DUMP_ELF to get the binary for the largest kernel in your experiment and see if it could fit?

jbqubit commented 1 year ago

I assume you beam BRAM not SRAM.

Can you use ARTIQ_DUMP_ELF to get the binary for the largest kernel in your experiment and see if it could fit?

The constraints for code complied to ELF are extreme.

raise ValueError("Experiment entry point must be a kernel")
raise ValueError("Experiment must not use RPC") I can't refactor my code base subject to these constraints. I depend too much on many higher level ARTIQ features.

I can see this a problem inherent in SDRAM (link). Have you looked into using RLDRAM? The arxiv paper describes a memory controller design that predictably manages accesses to the RLDRAM.

sbourdeauducq commented 1 year ago

BRAM is a particular form of SRAM. The constraints you mention are specific to artiq_compile and not ARTIQ_DUMP_ELF, the latter has no restrictions. The problem is inherent to all DRAM, the paper only mentions a reduction of the latency variation, not an elimination. And the vast majority of boards don't support RLDRAM.

jbqubit commented 1 year ago

The constraints you mention are specific to artiq_compile and not ARTIQ_DUMP_ELF, the latter has no restrictions.

I'm not familiar with this feature of ARTIQ. Please give an example of how of how to use it to do an ELF dump. There are a lot of options for the Target class...

The problem is inherent to all DRAM, the paper only mentions a reduction of the latency variation, not an elimination. And the vast majority of boards don't support RLDRAM.

A 10X reduction in latency variation would be a big win. Sinara builds it's own boards so that's not necessarily a roadblock. I guess it would entail writing another memory controller. @gkasprow Have you worked with RLDRAM?

dnadlinger commented 1 year ago

Please give an example of how of how to use it to do an ELF dump.

Just set the ARTIQ_DUMP_ELF environment variable to a string (which will become the file name), either when calling artiq_run, or when starting artiq_master (the latter IIRC will just write to the result file directory).

sbourdeauducq commented 1 year ago

A 10X reduction in latency variation would be a big win. Sinara builds it's own boards so that's not necessarily a roadblock.

Just put SRAM on the board then. But SRAM is expensive and we need to support the existing fleet as well as new boards for which SRAM isn't necessarily a good option. Preloading of kernel or kernel parts into existing on-chip SRAM (e.g. FPGA BRAM) might be better.

jbqubit commented 1 year ago

I followed the advice of @dnadlinger for a typical experiment in my lab. Here's the size.

$ ls -lh
total 1.8M
-rw-r--r--   1 iirabi iirabi  33K Nov 14 17:38 artiq_dump.elf

How much BRAM is typically available on a Kasli running ARTIQ? Is there an established path for fully loading a kernel into BRAM?

sbourdeauducq commented 1 year ago

33K would fit in BRAM but needs gateware, firmware and perhaps (minor) compiler modifications.

jbqubit commented 1 year ago

Is there a way to do an initial test and see how the latency variation improves? Has the latency variation for Zynq been evaluated yet? It may be that both core device variants would benefit from use of BRAM for sufficiently simple kernels.

sbourdeauducq commented 1 year ago

Zynq would use OCM not BRAM.

jbqubit commented 1 year ago

Has the latency variation for Zynq been evaluated yet? I'm curious how the test I wrote above would perform.

m-labs / artiq

core device execution time variation #1988

Bug Report

Issue Details

Actual (undesired) Behavior

Steps to Reproduce

Expected Behavior

Questions

System