microsoft / iqsharp

Microsoft's IQ# Server.
https://docs.microsoft.com/quantum
MIT License
127 stars 58 forks source link

Rare race condition can cause Python host programs to hang #510

Closed cgranade closed 2 years ago

cgranade commented 3 years ago

Describe the bug When running many small cells (e.g.: %simulate DoFoo for a short operation DoFoo), messages sent by the IQ# kernel can be silently dropped by ZeroMQ polling internal to jupyter_client, causing the client to block forever waiting for messages.

To Reproduce

Screenshots Example IQ# debug log showing messages are correctly sent:

2021-09-01T13:59:11.2453868-07:00  [DBG] Received new message:
    {"msg_id":"c9167d57-c1e02da681857e1a7eec6f93_102640_2403","username":"username","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"execute_request","version":"5.3"}
    {"msg_id":"7be84757-0e63-4143-96c2-bd0e34875a1f","username":"cgran","session":null,"msg_type":null,"version":"5.2.0"}
    {}
    {"code":"%simulate DrawTomographyMeasurement {\"prep\": 1, \"meas\": 3}","silent":false,"store_history":true,"user_expressions":{},"allow_stdin":true,"stop_on_error":true} (2e905567)
2021-09-01T13:59:11.2454057-07:00  [DBG] Handling message of type "execute_request". (cbb20ba6)
2021-09-01T13:59:11.2454125-07:00  [DBG] Handing "execute_request" with ordered shell handler. (e6ef4f49)
2021-09-01T13:59:11.2454342-07:00  [DBG] Asked to execute code:
%simulate DrawTomographyMeasurement {"prep": 1, "meas": 3} (31d9ba3a)
2021-09-01T13:59:11.2455601-07:00  [DBG] Sending iopub message:
    {"ZmqIdentities":["YzkxNjdkNTctYzFlMDJkYTY4MTg1N2UxYTdlZWM2Zjkz"],"Signature":null,"Header":{"msg_id":"b76ad47a-e59e-475d-9e4c-bc3a7b829489","username":"cgran","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"status","version":"5.2.0"},"ParentHeader":{"msg_id":"c9167d57-c1e02da681857e1a7eec6f93_102640_2403","username":"username","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"execute_request","version":"5.3"},"Metadata":null,"Content":{"execution_state":"busy"}} (a9a26c78)
2021-09-01T13:59:11.2457356-07:00  [DBG] Sending iopub message:
    {"ZmqIdentities":["YzkxNjdkNTctYzFlMDJkYTY4MTg1N2UxYTdlZWM2Zjkz"],"Signature":null,"Header":{"msg_id":"fba80871-a769-4099-9e01-e88ba3f40de4","username":"cgran","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"execute_input","version":"5.2.0"},"ParentHeader":{"msg_id":"c9167d57-c1e02da681857e1a7eec6f93_102640_2403","username":"username","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"execute_request","version":"5.3"},"Metadata":null,"Content":{"code":"%simulate DrawTomographyMeasurement {\"prep\": 1, \"meas\": 3}","execution_count":2403} (32db5d1d)
2021-09-01T13:59:11.2460310-07:00  [DBG] Looking for magic %simulate (51b66c1e)
2021-09-01T13:59:11.2463072-07:00  [DBG] Using magic %simulate (9429eb87)
2021-09-01T13:59:11.2469593-07:00  [DBG] Sending iopub message:
    {"ZmqIdentities":["YzkxNjdkNTctYzFlMDJkYTY4MTg1N2UxYTdlZWM2Zjkz"],"Signature":null,"Header":{"msg_id":"2d46ae11-dbb4-45f1-99ee-11dd19f6f5df","username":"cgran","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"execute_result","version":"5.2.0"},"ParentHeader":{"msg_id":"c9167d57-c1e02da681857e1a7eec6f93_102640_2403","username":"username","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"execute_request","version":"5.3"},"Metadata":null,"Content":{"execution_count":2403,"data":{"text/plain":"Zero","application/x-qsharp-data":"0"},"metadata":{}} (f0ec6a5d)
2021-09-01T13:59:11.2471142-07:00  [DBG] Sending shell message:
    {"ZmqIdentities":["YzkxNjdkNTctYzFlMDJkYTY4MTg1N2UxYTdlZWM2Zjkz"],"Signature":null,"Header":{"msg_id":"b82ce983-ec05-4313-aafb-232401d9c117","username":"cgran","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"execute_reply","version":"5.2.0"},"ParentHeader":{"msg_id":"c9167d57-c1e02da681857e1a7eec6f93_102640_2403","username":"username","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"execute_request","version":"5.3"},"Metadata":null,"Content":{"status":"ok","execution_count":2403,"payload":[],"user_expressions":null}} (0b76f6d9)
2021-09-01T13:59:11.2472695-07:00  [DBG] Sending iopub message:
    {"ZmqIdentities":["YzkxNjdkNTctYzFlMDJkYTY4MTg1N2UxYTdlZWM2Zjkz"],"Signature":null,"Header":{"msg_id":"4356e3c9-4680-4fc5-97f0-0dd8a7eb005e","username":"cgran","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"status","version":"5.2.0"},"ParentHeader":{"msg_id":"c9167d57-c1e02da681857e1a7eec6f93_102640_2403","username":"username","session":"c9167d57-c1e02da681857e1a7eec6f93","msg_type":"execute_request","version":"5.3"},"Metadata":null,"Content":{"execution_state":"idle"}} (b41328dc)

System information Running against local build of IQ# using main, jupyter_client v7.0.2, pyzmq 22.2.1.

Additional context Add any other context about the problem here.

cgranade commented 3 years ago

Confirmed that the same issue occurs for the following environment.yml:

channels:
- conda-forge
- quantum-engineering
dependencies:
- python>=3.6
- numpy
- scipy
- future
- qutip
- matplotlib<=2.1.2
- ipyparallel
- py
- notebook
- qsharp==0.18.2108.160999
- pyzmq<22.2.1
- pip
- pip:
    - qinfer
    - duecredit
    - mpltools
cgranade commented 3 years ago

Narrowing down a bit further, I was unable to reproduce for the following environment.yml (downgrading IQ# conda package to June):

##
# environment.yml: Specification of a conda environment for use with
#      Q# interoperability samples.
##
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
##

name: process-tomography
channels:
- conda-forge
- quantum-engineering
dependencies:
- python>=3.6
- numpy
- scipy
- future
- qutip
- matplotlib<=2.1.2
- ipyparallel
- py
- notebook
- qsharp==0.18.2106.148911
- jupyter_client==6.1.7
- pip
- pip:
    - qinfer
    - duecredit
    - mpltools

To check if this was due to a regression in IQ#, I manually overwrote the IQ# kernel in this new environment with a dotnet-installed version of the kernel from the August release:

➜  dotnet tool install -g Microsoft.Quantum.IQSharp  
You can invoke the tool using the following command: dotnet-iqsharp
Tool 'microsoft.quantum.iqsharp' (version '0.18.2108160999') was successfully installed.

via 🅒 python-qsharp  took 3s 
➜  etenv process-tomography

via 🅒 process-tomography  
➜  dotnet iqsharp install --sys-prefix
[InstallKernelSpec] Removing existing kernelspec in C:\...\Anaconda3\envs\process-tomography\share\jupyter\kernels\iqsharp
[InstallKernelSpec] Installed kernelspec iqsharp in C:\...\Anaconda3\envs\process-tomography\share\jupyter\kernels\iqsharp

Using the August version of IQ#, I was again unable to reproduce the issue. Checking the full versions of all packages installed in the working environment:

name: process-tomography
channels:
  - quantum-engineering
  - conda-forge
  - defaults
dependencies:
  - argcomplete=1.12.3=pyhd8ed1ab_2
  - argon2-cffi=20.1.0=py37hcc03f2d_2
  - async_generator=1.10=py_0
  - attrs=21.2.0=pyhd8ed1ab_0
  - backcall=0.2.0=pyh9f0ad1d_0
  - backports=1.0=py_2
  - backports.functools_lru_cache=1.6.4=pyhd8ed1ab_0
  - backports.weakref=1.0.post1=py37hc8dfbb8_1002
  - bleach=4.1.0=pyhd8ed1ab_0
  - ca-certificates=2021.5.30=h5b45459_0
  - cffi=1.14.6=py37hd8e9650_0
  - colorama=0.4.4=pyh9f0ad1d_0
  - cycler=0.10.0=py_2
  - debugpy=1.4.1=py37hf2a7229_0
  - decorator=5.0.9=pyhd8ed1ab_0
  - defusedxml=0.7.1=pyhd8ed1ab_0
  - entrypoints=0.3=pyhd8ed1ab_1003
  - freetype=2.10.4=h546665d_1
  - future=0.18.2=py37h03978a9_3
  - icu=64.2=he025d50_1
  - importlib-metadata=4.8.1=py37h03978a9_0
  - importlib_metadata=4.8.1=hd8ed1ab_0
  - intel-openmp=2021.3.0=h57928b3_3372
  - ipykernel=6.3.1=py37h4038f58_0
  - ipyparallel=6.3.0=pyhd8ed1ab_3
  - ipython=7.27.0=py37h4038f58_0
  - ipython_genutils=0.2.0=py_1
  - iqsharp=0.18.2106.148911=py37_0
  - jedi=0.18.0=py37h03978a9_2
  - jinja2=3.0.1=pyhd8ed1ab_0
  - jpeg=9d=h8ffe710_0
  - jsonschema=3.2.0=pyhd8ed1ab_3
  - jupyter_client=6.1.7=py_0
  - jupyter_core=4.7.1=py37h03978a9_0
  - jupyterlab_pygments=0.1.2=pyh9f0ad1d_0
  - kiwisolver=1.3.2=py37h8c56517_0
  - libblas=3.9.0=11_win64_mkl
  - libcblas=3.9.0=11_win64_mkl
  - liblapack=3.9.0=11_win64_mkl
  - libpng=1.6.37=h1d00b33_2
  - libsodium=1.0.18=h8d14728_1
  - m2w64-gcc-libgfortran=5.3.0=6
  - m2w64-gcc-libs=5.3.0=7
  - m2w64-gcc-libs-core=5.3.0=7
  - m2w64-gmp=6.1.0=2
  - m2w64-libwinpthread-git=5.0.0.4634.697f757=2
  - markupsafe=2.0.1=py37hcc03f2d_0
  - matplotlib=2.1.2=py37_1
  - matplotlib-base=2.1.2=py37h2981e6d_1
  - matplotlib-inline=0.1.2=pyhd8ed1ab_2
  - mistune=0.8.4=py37hcc03f2d_1004
  - mkl=2021.3.0=hb70f87d_564
  - msys2-conda-epoch=20160418=1
  - nbclient=0.5.4=pyhd8ed1ab_0
  - nbconvert=6.1.0=py37h03978a9_0
  - nbformat=5.1.3=pyhd8ed1ab_0
  - nest-asyncio=1.5.1=pyhd8ed1ab_0
  - notebook=6.4.3=pyha770c72_0
  - numpy=1.21.2=py37h940b05c_0
  - openssl=1.1.1k=h8ffe710_1
  - packaging=21.0=pyhd8ed1ab_0
  - pandoc=2.14.2=h8ffe710_0
  - pandocfilters=1.4.2=py_1
  - parso=0.8.2=pyhd8ed1ab_0
  - pickleshare=0.7.5=py_1003
  - pip=21.2.4=pyhd8ed1ab_0
  - prometheus_client=0.11.0=pyhd8ed1ab_0
  - prompt-toolkit=3.0.20=pyha770c72_0
  - py=1.10.0=pyhd3deb0d_0
  - pycparser=2.20=pyh9f0ad1d_2
  - pygments=2.10.0=pyhd8ed1ab_0
  - pyparsing=2.4.7=pyh9f0ad1d_0
  - pyqt=5.9.2=py37h6538335_4
  - pyrsistent=0.17.3=py37hcc03f2d_2
  - python=3.7.10=h7840368_100_cpython
  - python-dateutil=2.8.2=pyhd8ed1ab_0
  - python_abi=3.7=2_cp37m
  - pytz=2021.1=pyhd8ed1ab_0
  - pywin32=301=py37hcc03f2d_0
  - pywinpty=1.1.3=py37h7f67f24_0
  - pyzmq=19.0.2=py37h453f00a_2
  - qt=5.9.7=h506e8af_3
  - qutip=4.6.2=py37h0a3528d_0
  - scipy=1.7.1=py37hb6553fb_0
  - send2trash=1.8.0=pyhd8ed1ab_0
  - setuptools=57.4.0=py37h03978a9_0
  - sip=4.19.8=py37h6538335_1000
  - six=1.16.0=pyh6c4a22f_0
  - sqlite=3.36.0=h8ffe710_0
  - tbb=2021.3.0=h2d74725_0
  - terminado=0.11.1=py37h03978a9_0
  - testpath=0.5.0=pyhd8ed1ab_0
  - tornado=6.1=py37hcc03f2d_1
  - traitlets=5.1.0=pyhd8ed1ab_0
  - typing_extensions=3.10.0.0=pyha770c72_0
  - ucrt=10.0.20348.0=h57928b3_0
  - vc=14.2=hb210afc_5
  - vs2015_runtime=14.29.30037=h902a5da_5
  - wcwidth=0.2.5=pyh9f0ad1d_2
  - webencodings=0.5.1=py_1
  - wheel=0.37.0=pyhd8ed1ab_1
  - winpty=0.4.3=4
  - zeromq=4.3.2=h0e60522_5
  - zipp=3.5.0=pyhd8ed1ab_0
  - zlib=1.2.11=h62dcd97_1010
  - pip:
    - azure-quantum==0.18.2106.148911
    - certifi==2021.5.30
    - charset-normalizer==2.0.4
    - citeproc-py==0.6.0
    - configobj==5.0.6
    - duecredit==0.9.1
    - idna==3.2
    - lxml==4.6.3
    - mpltools==0.2.0
    - qinfer==1.0
    - qsharp==0.18.2106.148911
    - qsharp-chemistry==0.18.2106.148911
    - qsharp-core==0.18.2106.148911
    - requests==2.26.0
    - urllib3==1.26.6
prefix: C:\...\Anaconda3\envs\process-tomography

Using the latest IQ# conda package, but with pyzmq pinned to 19.0.2 seems to avoid the issue as well:

name: process-tomography
channels:
  - quantum-engineering
  - conda-forge
  - defaults
dependencies:
  - argcomplete=1.12.3=pyhd8ed1ab_2
  - argon2-cffi=20.1.0=py37hcc03f2d_2
  - async_generator=1.10=py_0
  - attrs=21.2.0=pyhd8ed1ab_0
  - backcall=0.2.0=pyh9f0ad1d_0
  - backports=1.0=py_2
  - backports.functools_lru_cache=1.6.4=pyhd8ed1ab_0
  - bleach=4.1.0=pyhd8ed1ab_0
  - ca-certificates=2021.5.30=h5b45459_0
  - cffi=1.14.6=py37hd8e9650_0
  - colorama=0.4.4=pyh9f0ad1d_0
  - cycler=0.10.0=py_2
  - debugpy=1.4.1=py37hf2a7229_0
  - decorator=5.0.9=pyhd8ed1ab_0
  - defusedxml=0.7.1=pyhd8ed1ab_0
  - entrypoints=0.3=pyhd8ed1ab_1003
  - freetype=2.10.4=h546665d_1
  - future=0.18.2=py37h03978a9_3
  - icu=64.2=he025d50_1
  - importlib-metadata=4.8.1=py37h03978a9_0
  - importlib_metadata=4.8.1=hd8ed1ab_0
  - intel-openmp=2021.3.0=h57928b3_3372
  - ipykernel=6.3.1=py37h4038f58_0
  - ipyparallel=6.3.0=pyhd8ed1ab_3
  - ipython=7.27.0=py37h4038f58_0
  - ipython_genutils=0.2.0=py_1
  - iqsharp=0.18.2108.160999=py37_0
  - jedi=0.18.0=py37h03978a9_2
  - jinja2=3.0.1=pyhd8ed1ab_0
  - jpeg=9d=h8ffe710_0
  - jsonschema=3.2.0=pyhd8ed1ab_3
  - jupyter_client=6.1.7=py_0
  - jupyter_core=4.7.1=py37h03978a9_0
  - jupyterlab_pygments=0.1.2=pyh9f0ad1d_0
  - kiwisolver=1.3.2=py37h8c56517_0
  - libblas=3.9.0=11_win64_mkl
  - libcblas=3.9.0=11_win64_mkl
  - liblapack=3.9.0=11_win64_mkl
  - libpng=1.6.37=h1d00b33_2
  - libsodium=1.0.18=h8d14728_1
  - m2w64-gcc-libgfortran=5.3.0=6
  - m2w64-gcc-libs=5.3.0=7
  - m2w64-gcc-libs-core=5.3.0=7
  - m2w64-gmp=6.1.0=2
  - m2w64-libwinpthread-git=5.0.0.4634.697f757=2
  - markupsafe=2.0.1=py37hcc03f2d_0
  - matplotlib=2.1.2=py37_1
  - matplotlib-base=2.1.2=py37h2981e6d_1
  - matplotlib-inline=0.1.2=pyhd8ed1ab_2
  - mistune=0.8.4=py37hcc03f2d_1004
  - mkl=2021.3.0=hb70f87d_564
  - msys2-conda-epoch=20160418=1
  - nbclient=0.5.4=pyhd8ed1ab_0
  - nbconvert=6.1.0=py37h03978a9_0
  - nbformat=5.1.3=pyhd8ed1ab_0
  - nest-asyncio=1.5.1=pyhd8ed1ab_0
  - notebook=6.4.3=pyha770c72_0
  - numpy=1.21.2=py37h940b05c_0
  - openssl=1.1.1k=h8ffe710_1
  - packaging=21.0=pyhd8ed1ab_0
  - pandoc=2.14.2=h8ffe710_0
  - pandocfilters=1.4.2=py_1
  - parso=0.8.2=pyhd8ed1ab_0
  - pickleshare=0.7.5=py_1003
  - pip=21.2.4=pyhd8ed1ab_0
  - prometheus_client=0.11.0=pyhd8ed1ab_0
  - prompt-toolkit=3.0.20=pyha770c72_0
  - py=1.10.0=pyhd3deb0d_0
  - pycparser=2.20=pyh9f0ad1d_2
  - pygments=2.10.0=pyhd8ed1ab_0
  - pyparsing=2.4.7=pyh9f0ad1d_0
  - pyqt=5.9.2=py37h6538335_4
  - pyrsistent=0.17.3=py37hcc03f2d_2
  - python=3.7.10=h7840368_100_cpython
  - python-dateutil=2.8.2=pyhd8ed1ab_0
  - python_abi=3.7=2_cp37m
  - pytz=2021.1=pyhd8ed1ab_0
  - pywin32=301=py37hcc03f2d_0
  - pywinpty=1.1.3=py37h7f67f24_0
  - pyzmq=19.0.2=py37h453f00a_2
  - qt=5.9.7=h506e8af_3
  - qutip=4.6.2=py37h0a3528d_0
  - scipy=1.7.1=py37hb6553fb_0
  - send2trash=1.8.0=pyhd8ed1ab_0
  - setuptools=57.4.0=py37h03978a9_0
  - sip=4.19.8=py37h6538335_1000
  - six=1.16.0=pyh6c4a22f_0
  - sqlite=3.36.0=h8ffe710_0
  - tbb=2021.3.0=h2d74725_0
  - terminado=0.11.1=py37h03978a9_0
  - testpath=0.5.0=pyhd8ed1ab_0
  - tornado=6.1=py37hcc03f2d_1
  - traitlets=5.1.0=pyhd8ed1ab_0
  - typing_extensions=3.10.0.0=pyha770c72_0
  - ucrt=10.0.20348.0=h57928b3_0
  - vc=14.2=hb210afc_5
  - vs2015_runtime=14.29.30037=h902a5da_5
  - wcwidth=0.2.5=pyh9f0ad1d_2
  - webencodings=0.5.1=py_1
  - wheel=0.37.0=pyhd8ed1ab_1
  - winpty=0.4.3=4
  - zeromq=4.3.2=h0e60522_5
  - zipp=3.5.0=pyhd8ed1ab_0
  - zlib=1.2.11=h62dcd97_1010
  - pip:
    - azure-quantum==0.18.2108.160999
    - certifi==2021.5.30
    - charset-normalizer==2.0.4
    - citeproc-py==0.6.0
    - configobj==5.0.6
    - duecredit==0.9.1
    - idna==3.2
    - lxml==4.6.3
    - mpltools==0.2.0
    - qinfer==1.0
    - qsharp==0.18.2108.160999
    - qsharp-chemistry==0.18.2108.160999
    - qsharp-core==0.18.2108.160999
    - requests==2.26.0
    - urllib3==1.26.6
prefix: C:\...\Anaconda3\envs\process-tomography
alexeib2 commented 2 years ago

I am experiencing the python/iqsharp stall issue that is likely to be caused by this bug. The attached contentsList.txt is a concatenation of pip list and conda list in the referenced VM/environment. The demoLogDump and the demoLogDumpStall files are the IQSHARP_LOG_LEVEL=Debug logs of two back-to back runs. The fist one terminated success fully after execution_count:210 and output as expected. The second one stalled after execution_count:3 componentsList.txt componentsList.txt demo4logDump.txt demo4logDumpStall.txt

0

anpaz commented 2 years ago

This issue is now happening on our Docker image as well. Looks like the latest version of nbconvert (0.6.4) and ipython (7.34.0) are causing timeouts on IOPub messages.

anpaz commented 2 years ago

This should be fixed with #708, which is now released as part of the QDK 0.25.228311+