m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
434 stars 201 forks source link

Memory corruption in lwIP #637

Closed whitequark closed 7 years ago

whitequark commented 7 years ago

We currently have a major issue with lwIP: the internals of the stack get corrupted. The symptoms are as follows:

I do not see anything wrong with our usage of lwIP API, and I don't think the Rust code is overwriting lwIP stuff via either its heap or its stack (but we could really use the trivial MPU here...)

whitequark commented 7 years ago

For now the roadmap is as follows:

whitequark commented 7 years ago

The reproduction steps we have:

  1. artiq_run examples/phaser/repository/test_ad9154_status.py (needs phaser gateware)
  2. python3.5 -m unittest artiq.test.coredevice.test_rtio.CoredeviceTest.test_pulse_rate_dds
  3. rapidly schedule and cancel (through dashboard) an experiment that uses an async RPC many times, such as:
    
    from artiq.experiment import *

class Core_Test(EnvExperiment): def build(self): self.setattr_device("core")

@kernel
def run(self):
    self.core.reset()
    print('test')
dnadlinger commented 7 years ago

From working with embedded lwIP on the Zynq before, I'd be inclined to triple-check that the assumptions made by its API are followed (e.g. w.r.t. send buffer lifetimes and the copying flags). The lifetime mechanics are sometimes poorly documented to the point where they could be called buggy under a less charitable interpretation.

I also found the ability to compile an emulator binary for Linux to be invaluable (rigging up a tap device for networking and stubs for the various Xilinx platform methods), because then Valgrind and all the usual tools are available if invalid memory accesses are not outright turned into segfaults. No idea if that is currently feasible for ARTIQ, though.

whitequark commented 7 years ago

I'd be inclined to triple-check that the assumptions made by its API are followed (e.g. w.r.t. send buffer lifetimes and the copying flags).

Absolutely, and I've spent a great deal of time reading lwIP docs and sources and comparing them to our use of the API (which btw is rather minor, and contained in the runtime.rs/liblwip crate). Do you think you could take a look also? It could use a fresh set of eyes.

No idea if that is currently feasible for ARTIQ, though.

Not easily. The Rust runtime is somewhat designed with that in mind (hence libstd_artiq, for easy switching to libstd proper--though for now runtime::clock and friends make that nontrivial anyhow), but running kernels seems complicated; even ignoring our RTIO peripherals, just getting the emulator to cooperate and to export basic MiSoC peripherals like the timer and mailbox is a major work.

whitequark commented 7 years ago

I've fixed a lifetime issue (interestingly, on the Rust side exclusively) in 668928a, let's see if that helps.

whitequark commented 7 years ago

668928a has not fixed anything (more or less as I expected; the code in sched.rs was mildly suspicious but ultimately it did the same thing).

whitequark commented 7 years ago

Integrate the trivial MPU and become really sure that Rust code is not clobbering lwIP by accident with its stacks (as the method I used to check before now is unreliable).

Unfortunately this cannot be integrated so easily: https://github.com/m-labs/artiq/issues/544#issuecomment-264904553. But, I've enlarged the stacks significantly, and that didn't appear to affect the bug in any way, so this is probably not the culprit.

whitequark commented 7 years ago

I've tried always copying input data in the call to tcp_write. This did not improve anything.

sbourdeauducq commented 7 years ago

just getting the emulator to cooperate and to export basic MiSoC peripherals like the timer and mailbox is a major work.

This isn't worse than for the Zynq work @klickverbot mentioned. And a repro might not actually need to execute kernels.

whitequark commented 7 years ago

With smoltcp merged this is issue is not present any more.

sbourdeauducq commented 7 years ago

Does smoltcp support gateways? NIST sometimes requires one.

whitequark commented 7 years ago

@sbourdeauducq ARTIQ never initiates connections, why would it ever need explicit support for gateways?

hossamfadeel commented 5 years ago

I know it is too late to respond, however, maybe someone looking for solution of this issue and come through this. So, what I have done and it solved my issue was if the buffer out of memory the wait for the acknowledge to come (By receiving the Ack it means there will be a free space in your buffer for sending further data)

Hope it helps.

Regards,