m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
430 stars 200 forks source link

Worker error during repository scan #1625

Closed lriesebos closed 3 years ago

lriesebos commented 3 years ago

Bug Report

One-Line Summary

During a repository scan we get WARNING:master:artiq.master.worker:worker finished with status code -11 (RID scan)

Issue Details

Steps to Reproduce

It seems to not be dependent on the file it is scanning. If I comment out the content of the file that causes the error, I will just get the same error at the next file.

It seems to depend on the duration of the scan as the error appears if the repo scan takes more than 10 seconds. If I delete files to reduce the scan time to <11 seconds, then the error does not appear. I was also able to reproduce that behavior in a different repository by just duplicating experiments until the scan took >=11 seconds.

Expected Behavior

No errors, with ARTIQ 5 we have not experienced this. This happened after switching to ARTIQ 6.

Actual (undesired) Behavior

DEBUG:master:asyncio:Using selector: EpollSelector
DEBUG:master:artiq.master.rid_counter:Using last RID from cache
DEBUG:master:artiq.master.rid_counter:Next RID is 17
ARTIQ master is now ready.
DEBUG:master:artiq.master.experiments:processing file /home/leon/projects/staq/repository spectroscopy/yb_ions/fluorescence_freq_scan.py
...
DEBUG:master:artiq.master.experiments:processing file /home/leon/projects/staq/repository trap_characterization/ion_lifetime.py
DEBUG:master:artiq.master.experiments:processing file /home/leon/projects/staq/repository trap_characterization/microwave/mw_coherence_time.py
DEBUG:master:sipyco.logging_tools:stopped log forwarding of stream <StreamReader eof transport=<_UnixReadPipeTransport closed fd=16 closed>> of worker(scan,mw_coherence_time.py)
DEBUG:master:sipyco.logging_tools:stopped log forwarding of stream <StreamReader eof transport=<_UnixReadPipeTransport closed fd=14 closed>> of worker(scan,mw_coherence_time.py)
ERROR:master:artiq.master.worker:worker exception details
Traceback (most recent call last):
  File "/nix/store/83xl8z602cfzwssg6hmcvf4c3c9sw0dg-python3-3.8.8-env/lib/python3.8/site-packages/artiq/master/experiments.py", line 26, in process_file
    description = await self.worker.examine(
  File "/nix/store/83xl8z602cfzwssg6hmcvf4c3c9sw0dg-python3-3.8.8-env/lib/python3.8/site-packages/artiq/master/worker.py", line 306, in examine
    await self._worker_action({"action": "examine", "file": file},
  File "/nix/store/83xl8z602cfzwssg6hmcvf4c3c9sw0dg-python3-3.8.8-env/lib/python3.8/site-packages/artiq/master/worker.py", line 252, in _worker_action
    completed = await self._handle_worker_requests()
  File "/nix/store/83xl8z602cfzwssg6hmcvf4c3c9sw0dg-python3-3.8.8-env/lib/python3.8/site-packages/artiq/master/worker.py", line 208, in _handle_worker_requests
    obj = await self._recv(self.watchdog_time())
  File "/nix/store/83xl8z602cfzwssg6hmcvf4c3c9sw0dg-python3-3.8.8-env/lib/python3.8/site-packages/artiq/master/worker.py", line 193, in _recv
    raise WorkerError(
artiq.master.worker.WorkerError: Worker ended while attempting to receive data (RID scan)
WARNING:master:artiq.master.experiments:Skipping file 'trap_characterization/microwave/mw_coherence_time.py'
Traceback (most recent call last):
  File "/nix/store/83xl8z602cfzwssg6hmcvf4c3c9sw0dg-python3-3.8.8-env/lib/python3.8/site-packages/artiq/master/experiments.py", line 63, in _scan
    await self.process_file(entry_dict, root, filename)
  File "/nix/store/83xl8z602cfzwssg6hmcvf4c3c9sw0dg-python3-3.8.8-env/lib/python3.8/site-packages/artiq/master/experiments.py", line 26, in process_file
    description = await self.worker.examine(
  File "/nix/store/83xl8z602cfzwssg6hmcvf4c3c9sw0dg-python3-3.8.8-env/lib/python3.8/site-packages/artiq/master/worker.py", line 306, in examine
    await self._worker_action({"action": "examine", "file": file},
  File "/nix/store/83xl8z602cfzwssg6hmcvf4c3c9sw0dg-python3-3.8.8-env/lib/python3.8/site-packages/artiq/master/worker.py", line 252, in _worker_action
    completed = await self._handle_worker_requests()
  File "/nix/store/83xl8z602cfzwssg6hmcvf4c3c9sw0dg-python3-3.8.8-env/lib/python3.8/site-packages/artiq/master/worker.py", line 208, in _handle_worker_requests
    obj = await self._recv(self.watchdog_time())
  File "/nix/store/83xl8z602cfzwssg6hmcvf4c3c9sw0dg-python3-3.8.8-env/lib/python3.8/site-packages/artiq/master/worker.py", line 193, in _recv
    raise WorkerError(
artiq.master.worker.WorkerError: Worker ended while attempting to receive data (RID scan)
DEBUG:master:artiq.master.worker:worker already terminated (RID scan)
WARNING:master:artiq.master.worker:worker finished with status code -11 (RID scan)
DEBUG:master:artiq.master.experiments:processing file /home/leon/projects/staq/repository trap_characterization/fluorescence_saturation_curve.py
...
DEBUG:master:artiq.master.experiments:processing file /home/leon/projects/staq/repository testers/trap_rf_tester.py
DEBUG:master:sipyco.logging_tools:stopped log forwarding of stream <StreamReader eof transport=<_UnixReadPipeTransport closed fd=22 closed>> of worker(scan,dds_tester_2.py)
DEBUG:master:sipyco.logging_tools:stopped log forwarding of stream <StreamReader eof transport=<_UnixReadPipeTransport closed fd=20 closed>> of worker(scan,dds_tester_2.py)
DEBUG:master:artiq.master.worker:worker exited on request (RID scan)
INFO:master:artiq.master.experiments:repository scan took 13 seconds

@sbourdeauducq I tried to track down the source of this issue in the ARTIQ master code, but I was not able to find a bug. Do you have an idea?

Your System (omit irrelevant parts)

sbourdeauducq commented 3 years ago

Return code -11 is segmentation fault:

>>> import subprocess
>>> p = subprocess.run('python -c "from ctypes import string_at; print(string_at(0))"', shell=True)
>>> p.returncode
-11

Are you importing any special libraries in your experiments that could cause segfaults?

lriesebos commented 3 years ago

You are right, I did not dig deep enough. But you are not going to believe this. It had nothing to do with special libraries, the segfault was caused by a type annotation...

I tracked it down to the following type annotation. ShuttleEdge is a simple class defined before ShuttlingGraph, so this should not cause any problems assuming the correct imports from typing.

class ShuttlingGraph(list):
    def __init__(self, shuttling_edges: Optional[Sequence[ShuttleEdge]] = None):
        ...

There seems to be some ways to mitigate this:

  1. Use a string type hint for ShuttleEdge which is normally only required for forward references: Optional[Sequence['ShuttleEdge']]
  2. Enable postponed evaluation of annotations for this file
  3. Remove ShuttleEdge from the type annotation or remove the type annotation as a whole

Also very strange that this only happened if the repo scan takes more than 10 seconds. Anyway, this is solved for now. Tnx for the input!

sbourdeauducq commented 3 years ago

Sounds like a bug in Python. How exactly is ShuttleEdge defined?

lriesebos commented 3 years ago

I can give you an other example I found today. https://gitlab.com/duke-artiq/dax/-/blob/master/dax/modules/time_resolved_context.py#L645 . Changing that to the code below solves it:

    def __init__(self, source: typing.Union[str, h5py.File],
                 state_detection_threshold: typing.Optional[int] = None, *,
                 hdf5_group: typing.Optional[str] = None):
        ...

Though what is weird is that there seems to be some temporal component in this bug. Practically all our experiments import this file, but only one segfault seems to occur. All other files work fine. And even the file seems to be irrelevant because if I remove the file it segfaults on, then it will just segfault on the next file it scans. The time it takes to segfault seems to depend on the number of files and their content. If there are only a few files (scan takes less than 10 seconds), no problems occur. So I have not been able to find a minimum example to reproduce this issue. Neither was I able to find a related issue on https://bugs.python.org/ .