Not all expected performance gains from compiling a system image are seen

This occurs with qiskit_alt v0.1.6 and v0.17 (and probably earlier)

We need to make sure that we are actually getting all of the expected benefits from compiling a system image. And try to get the expected performance gains if we find some are missing. Or at least understand if there is a tradeoff, that is a feature or a limitation that makes it difficult to get the performance gain. The investigation could be ad hoc; a developer running scripts or working at the cli. A more ambitious project would be to set up some kind of benchmarking that can generate a report, and/or warn about regressions.

Compiling a system image should improve performance of qiskit_alt in three ways.

Reduces startup time. That is qiskit_alt.project.ensure_init() runs faster. This is mostly because PyCall.jl and/or PythonCall.jl are loaded by ensure_init() and they will be in the system image.
Reduces Julia package loading time. For example, many functions use QuantumOps.jl. If a custom system image is used, then loading time should be reduced greatly.
Reduces time to run Julia code the first time it is used. Even if QuantumOps is in the system image, calling function may require compilation the first time it is called, unless we do something to precompile them. For example QuantumOps.rand_op_sum(...), will run more slowly the first time it is called. However, we have included instructions to compile rand_op_sum and other functions into the system image. So they should run at full speed the first time they are called.

In fact things are a bit more complicated. We need to look at the following timings carefully and understand their causes.

HINT: We "exercise" QuantumOps when compiling a system image. But, we do not exercise calling QuantumOps from python when compiling. This would probably increase performance.
HINT: juliacall and pyjulia do different translations of objects when passing between Julia and Python. These may be inherently more or less efficient. It is possible to control these somewhat. May be useful if there is a bottleneck.

Here is what we see with juliacall.

In [1]: import qiskit_alt

In [2]: %time qiskit_alt.project.ensure_init(calljulia="juliacall", use_sys_image=False)
CPU times: user 3.38 s, sys: 551 ms, total: 3.93 s
Wall time: 3.46 s

In [3]: %time QuantumOps = qiskit_alt.project.simple_import("QuantumOps")
CPU times: user 3.19 s, sys: 156 ms, total: 3.35 s
Wall time: 3.35 s

In [4]: %time QuantumOps.rand_op_sum(QuantumOps.Pauli, 3, 10 ); None
CPU times: user 1.84 s, sys: 8.7 ms, total: 1.85 s
Wall time: 1.85 s

In [5]: %time QuantumOps.rand_op_sum(QuantumOps.Pauli, 3, 10 ); None
CPU times: user 75 µs, sys: 0 ns, total: 75 µs
Wall time: 76.3 µs

In [6]: %time QuantumOps.rand_op_sum(QuantumOps.Pauli, 3, 10 ); None
CPU times: user 71 µs, sys: 0 ns, total: 71 µs
Wall time: 73 µs

In [7]:
$ ipython

In [1]: import qiskit_alt

In [2]: %time qiskit_alt.project.ensure_init(calljulia="juliacall", use_sys_image=True)
CPU times: user 1.42 s, sys: 514 ms, total: 1.93 s
Wall time: 1.47 s

In [3]: %time QuantumOps = qiskit_alt.project.simple_import("QuantumOps")
CPU times: user 5.59 ms, sys: 0 ns, total: 5.59 ms
Wall time: 5.74 ms

In [4]: %time QuantumOps.rand_op_sum(QuantumOps.Pauli, 3, 10 ); None
CPU times: user 1.17 s, sys: 6.21 ms, total: 1.17 s
Wall time: 1.17 s

In [5]: %time QuantumOps.rand_op_sum(QuantumOps.Pauli, 3, 10 ); None
CPU times: user 76 µs, sys: 14 µs, total: 90 µs
Wall time: 92.5 µs

In [6]: %time QuantumOps.rand_op_sum(QuantumOps.Pauli, 3, 10 ); None
CPU times: user 55 µs, sys: 10 µs, total: 65 µs
Wall time: 68.4 µs

And here it is with pyjulia

In [1]: import qiskit_alt

In [2]: %time qiskit_alt.project.ensure_init(calljulia="pyjulia", use_sys_image=False)
CPU times: user 2.08 s, sys: 607 ms, total: 2.69 s
Wall time: 2.51 s

In [3]: %time QuantumOps = qiskit_alt.project.simple_import("QuantumOps")
CPU times: user 2.95 s, sys: 152 ms, total: 3.1 s
Wall time: 3.1 s

In [4]: %time QuantumOps.rand_op_sum(QuantumOps.Pauli, 3, 10 ); None
CPU times: user 513 ms, sys: 29.5 ms, total: 543 ms
Wall time: 543 ms

In [5]: %time QuantumOps.rand_op_sum(QuantumOps.Pauli, 3, 10 ); None
CPU times: user 764 µs, sys: 0 ns, total: 764 µs
Wall time: 772 µs

In [6]:                                                                                                                                         
$ ipython

In [1]: import qiskit_alt

In [2]: %time qiskit_alt.project.ensure_init(calljulia="pyjulia", use_sys_image=True)
CPU times: user 814 ms, sys: 520 ms, total: 1.33 s
Wall time: 1.16 s

In [3]: %time QuantumOps = qiskit_alt.project.simple_import("QuantumOps")
CPU times: user 606 µs, sys: 291 µs, total: 897 µs
Wall time: 890 µs

In [4]: %time QuantumOps.rand_op_sum(QuantumOps.Pauli, 3, 10 ); None
CPU times: user 75.9 ms, sys: 0 ns, total: 75.9 ms
Wall time: 76.1 ms

In [5]: %time QuantumOps.rand_op_sum(QuantumOps.Pauli, 3, 10 ); None
CPU times: user 1.34 ms, sys: 0 ns, total: 1.34 ms
Wall time: 1.34 ms

qiskit-community / qiskit-alt

Not all expected performance gains from compiling a system image are seen #11