Closed lriesebos closed 3 years ago
Return code -11 is segmentation fault:
>>> import subprocess
>>> p = subprocess.run('python -c "from ctypes import string_at; print(string_at(0))"', shell=True)
>>> p.returncode
-11
Are you importing any special libraries in your experiments that could cause segfaults?
You are right, I did not dig deep enough. But you are not going to believe this. It had nothing to do with special libraries, the segfault was caused by a type annotation...
I tracked it down to the following type annotation. ShuttleEdge
is a simple class defined before ShuttlingGraph
, so this should not cause any problems assuming the correct imports from typing
.
class ShuttlingGraph(list):
def __init__(self, shuttling_edges: Optional[Sequence[ShuttleEdge]] = None):
...
There seems to be some ways to mitigate this:
ShuttleEdge
which is normally only required for forward references: Optional[Sequence['ShuttleEdge']]
ShuttleEdge
from the type annotation or remove the type annotation as a wholeAlso very strange that this only happened if the repo scan takes more than 10 seconds. Anyway, this is solved for now. Tnx for the input!
Sounds like a bug in Python. How exactly is ShuttleEdge
defined?
I can give you an other example I found today. https://gitlab.com/duke-artiq/dax/-/blob/master/dax/modules/time_resolved_context.py#L645 . Changing that to the code below solves it:
def __init__(self, source: typing.Union[str, h5py.File],
state_detection_threshold: typing.Optional[int] = None, *,
hdf5_group: typing.Optional[str] = None):
...
Though what is weird is that there seems to be some temporal component in this bug. Practically all our experiments import this file, but only one segfault seems to occur. All other files work fine. And even the file seems to be irrelevant because if I remove the file it segfaults on, then it will just segfault on the next file it scans. The time it takes to segfault seems to depend on the number of files and their content. If there are only a few files (scan takes less than 10 seconds), no problems occur. So I have not been able to find a minimum example to reproduce this issue. Neither was I able to find a related issue on https://bugs.python.org/ .
Bug Report
One-Line Summary
During a repository scan we get
WARNING:master:artiq.master.worker:worker finished with status code -11 (RID scan)
Issue Details
Steps to Reproduce
It seems to not be dependent on the file it is scanning. If I comment out the content of the file that causes the error, I will just get the same error at the next file.
It seems to depend on the duration of the scan as the error appears if the repo scan takes more than 10 seconds. If I delete files to reduce the scan time to <11 seconds, then the error does not appear. I was also able to reproduce that behavior in a different repository by just duplicating experiments until the scan took >=11 seconds.
Expected Behavior
No errors, with ARTIQ 5 we have not experienced this. This happened after switching to ARTIQ 6.
Actual (undesired) Behavior
@sbourdeauducq I tried to track down the source of this issue in the ARTIQ master code, but I was not able to find a bug. Do you have an idea?
Your System (omit irrelevant parts)