m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
434 stars 201 forks source link

DMA: Increase buffer size #1552

Closed hermitdemschoenenleben closed 2 years ago

hermitdemschoenenleben commented 3 years ago

Bug Report

One-Line Summary

When submitting a few thousand commands using DMA, kasli crashes.

Issue Details

I'm trying to program a few thousand commands using DMA. I noticed that this results in kasli crashing completely such that it can't even be pinged anymore. I suspect that this happens because DMA memory size is exceeded? If that's the case: is there any way to compile the gateware with a bigger DMA memory, or is it limited by hardware?

Steps to Reproduce

from artiq.experiment import EnvExperiment, kernel, ms

class BenTestSequence(EnvExperiment):
    def build(self):
        self.setattr_device("core")

        for u in range(3):
            self.setattr_device("urukul{}_cpld".format(u))
            for ch in range(4):
                self.setattr_device("urukul{}_ch{}".format(u, ch))

        self.setattr_device("core_dma")

    def prepare(self):
        self.my_counter = 0

    @kernel
    def run(self):
        print('start kernel')
        self.core.reset()
        self.core.break_realtime()

        self.urukul0_cpld.init()
        self.urukul0_ch0.init()
        self.urukul0_ch1.init()
        self.urukul0_ch2.init()
        self.urukul0_ch3.init()

        # turn all channels on
        self.urukul0_cpld.cfg_sw(0, 1)
        self.urukul0_cpld.cfg_sw(1, 1)
        self.urukul0_cpld.cfg_sw(2, 1)
        self.urukul0_cpld.cfg_sw(3, 1)

        with self.core_dma.record("pulses"):
            for i in range(10000):
                self.urukul0_ch0.set_mu(0)

                self.my_counter = self.my_counter + 1
                if self.my_counter % 1000 == 0:
                    print(self.my_counter)

                delay(100*ms)

Expected Behavior

no crash, at least an error message

Actual (undesired) Behavior

❯ time artiq_run repository/ben_minimal.py
start kernel
1000
2000
3000
4000
5000
6000
7000
8000
Traceback (most recent call last):
  File "/home/ben/.conda/envs/artiq/bin/artiq_run", line 10, in <module>
    sys.exit(main())
  File "/home/ben/.conda/envs/artiq/lib/python3.7/site-packages/artiq/frontend/artiq_run.py", line 225, in main
    return run(with_file=True)
  File "/home/ben/.conda/envs/artiq/lib/python3.7/site-packages/artiq/frontend/artiq_run.py", line 211, in run
    raise exn
  File "/home/ben/.conda/envs/artiq/lib/python3.7/site-packages/artiq/frontend/artiq_run.py", line 204, in run
    exp_inst.run()
  File "/home/ben/.conda/envs/artiq/lib/python3.7/site-packages/artiq/language/core.py", line 54, in run_on_core
    return getattr(self, arg).run(run_on_core, ((self,) + k_args), k_kwargs)
  File "/home/ben/.conda/envs/artiq/lib/python3.7/site-packages/artiq/coredevice/core.py", line 137, in run
    self.comm.serve(embedding_map, symbolizer, demangler)
  File "/home/ben/.conda/envs/artiq/lib/python3.7/site-packages/artiq/coredevice/comm_kernel.py", line 501, in serve
    self._read_header()
  File "/home/ben/.conda/envs/artiq/lib/python3.7/site-packages/artiq/coredevice/comm_kernel.py", line 116, in _read_header
    (sync_byte, ) = struct.unpack("B", self.read(1))
  File "/home/ben/.conda/envs/artiq/lib/python3.7/site-packages/artiq/coredevice/comm_kernel.py", line 97, in read
    rn = self.socket.recv(min(8192, length - len(r)))
TimeoutError: [Errno 110] Connection timed out
artiq_run repository/ben_minimal.py  1,31s user 0,06s system 11% cpu 11,690 total

Your System (omit irrelevant parts)

❯ artiq_coremgmt -D 192.168.0.200 log                  
[     0.000009s]  INFO(runtime): ARTIQ runtime starting...
[     0.003934s]  INFO(runtime): software ident 5.7136.15bb0fa9;hub
[     0.009864s]  INFO(runtime): gateware ident 5.7136.15bb0fa9;hub
[     0.015811s]  INFO(runtime): log level set to INFO by default
[     0.021531s]  INFO(runtime): UART log level set to INFO by default
[     0.027914s]  INFO(runtime::rtio_clocking): using internal RTIO clock (by default)
[     0.304791s]  INFO(board_artiq::si5324): waiting for Si5324 lock...
[     3.052754s]  INFO(board_artiq::si5324):   ...locked
[     3.081965s]  INFO(runtime): network addresses: MAC=54-10-ec-a9-d2-7b IPv4=192.168.0.200 IPv6-LL=fe80::5610:ecff:fea9:d27b IPv6=no configured address
sbourdeauducq commented 3 years ago

I noticed that this results in kasli crashing completely such that it can't even be pinged anymore.

Anything on the UART log?

Nobody really invested in DMA support on Kasli so far, we just threw in the code that worked on KC705 with only minimal testing (also https://github.com/m-labs/artiq/issues/946). M-Labs could probably fund it from hardware sales, if someone wants to work on these issues.

hermitdemschoenenleben commented 3 years ago
[    57.889119s]  INFO(runtime::mgmt): changing UART log level to DEBUG
[    69.124864s]  INFO(runtime::session): new connection from 192.168.0.24:37226
[    69.281113s]  INFO(runtime::kern_hwreq): resetting RTIO
panic at runtime/main.rs:278:5heap view: BUSY 0x40147000 + 0xc + 0x24 -> 0x40147030
BUSY 0x40147030 + 0xc + 0xc -> 0x40147048
BUSY 0x40147048 + 0xc + 0xc -> 0x40147060
BUSY 0x40147060 + 0xc + 0xc -> 0x40147078
BUSY 0x40147078 + 0xc + 0x18 -> 0x4014709c
BUSY 0x4014709c + 0xc + 0x24 -> 0x401470cc
BUSY 0x401470cc + 0xc + 0x1008 -> 0x401480e0
BUSY 0x401480e0 + 0xc + 0x3c -> 0x40148128
IDLE 0x40148128 + 0xc + 0xc -> 0x40148140
BUSY 0x40148140 + 0xc + 0x1008 -> 0x40149154
BUSY 0x40149154 + 0xc + 0x3c -> 0x4014919c
BUSY 0x4014919c + 0xc + 0xc -> 0x401491b4
BUSY 0x401491b4 + 0xc + 0x4008 -> 0x4014d1c8
BUSY 0x4014d1c8 + 0xc + 0x3c -> 0x4014d210
IDLE 0x4014d210 + 0xc + 0x18 -> 0x4014d234
BUSY 0x4014d234 + 0xc + 0x1008 -> 0x4014e248
BUSY 0x4014e248 + 0xc + 0x3c -> 0x4014e290
BUSY 0x4014e290 + 0xc + 0x1008 -> 0x4014f2a4
BUSY 0x4014f2a4 + 0xc + 0x3c -> 0x4014f2ec
BUSY 0x4014f2ec + 0xc + 0x24 -> 0x4014f31c
BUSY 0x4014f31c + 0xc + 0x3c -> 0x4014f364
IDLE 0x4014f364 + 0xc + 0xc0 -> 0x4014f430
BUSY 0x4014f430 + 0xc + 0x108 -> 0x4014f544
BUSY 0x4014f544 + 0xc + 0x198 -> 0x4014f6e8
BUSY 0x4014f6e8 + 0xc + 0x2004 -> 0x401516f8
IDLE 0x401516f8 + 0xc + 0x108 -> 0x4015180c
BUSY 0x4015180c + 0xc + 0x108 -> 0x40151920
BUSY 0x40151920 + 0xc + 0x108 -> 0x40151a34
IDLE 0x40151a34 + 0xc + 0x18fc -> 0x4015333c
BUSY 0x4015333c + 0xc + 0x48 -> 0x40153390
IDLE 0x40153390 + 0xc + 0x3c -> 0x401533d8
BUSY 0x401533d8 + 0xc + 0x30 -> 0x40153414
BUSY 0x40153414 + 0xc + 0x2004 -> 0x40155424
BUSY 0x40155424 + 0xc + 0x10008 -> 0x40165438
BUSY 0x40165438 + 0xc + 0x4008 -> 0x4016944c
IDLE 0x4016944c + 0xc + 0xa23c -> 0x40173694
BUSY 0x40173694 + 0xc + 0x108 -> 0x401737a8
IDLE 0x401737a8 + 0xc + 0x420c -> 0x401779c0
BUSY 0x401779c0 + 0xc + 0x804 -> 0x401781d0
BUSY 0x401781d0 + 0xc + 0x804 -> 0x401789e0
IDLE 0x401789e0 + 0xc + 0x3e4 -> 0x40178dd0
BUSY 0x40178dd0 + 0xc + 0x10008 -> 0x40188de4
BUSY 0x40188de4 + 0xc + 0x10008 -> 0x40198df8
BUSY 0x40198df8 + 0xc + 0x108 -> 0x40198f0c
BUSY 0x40198f0c + 0xc + 0x174 -> 0x4019908c
IDLE 0x4019908c + 0xc + 0x4014 -> 0x4019d0ac
BUSY 0x4019d0ac + 0xc + 0x7c8 -> 0x4019d880
BUSY 0x4019d880 + 0xc + 0x10008 -> 0x401ad894
BUSY 0x401ad894 + 0xc + 0x10008 -> 0x401bd8a8
IDLE 0x401bd8a8 + 0xc + 0x3528 -> 0x401c0ddc
BUSY 0x401c0ddc + 0xc + 0x10008 -> 0x401d0df0
IDLE 0x401d0df0 + 0xc + 0xf0024 -> 0x402c0e20
BUSY 0x402c0e20 + 0xc + 0xffff0 -> 0x403c0e1c
IDLE 0x403c0e1c + 0xc + 0x3f1d8 -> 0x0
 === busy: 0x1722f0 idle: 0x146a88 meta: 0x288 total: 0x2b9000

cannot allocate layout: Layout { size_: 2097120, align_: 1 }
backtrace for software version 5.7136.15bb0fa9;hub:
0x40043ff8
0x4002118c
0x40043dfc
0x4001b4ac
0x4001b0e8
0x4001b53c
0x40032d88
0x40025b3c
0x40024610
halting.
use `artiq_coremgmt config write -s panic_reset 1` to restart instead
sbourdeauducq commented 3 years ago

Looks like a simple out-of-memory error. The memory layout is currently less than optimal for the amount of SDRAM we have (too much of it is allocated to the kernel stack), so we should be able to get much larger DMA buffers. A number of things would need tweaking, including:

sbourdeauducq commented 2 years ago

Should also have been fixed by https://github.com/m-labs/artiq/commit/92fd705990d785dcfee3b2283eeafee091c6ce90