Open migueldiascosta opened 10 months ago
Hi @migueldiascosta thank you so much for those benchmarks, this architecture looks interesting. Do you have also number/plots with A100? (cc @andrea-pasquale).
Hi @migueldiascosta thank you so much for those benchmarks, this architecture looks interesting. Do you have also number/plots with A100? (cc @andrea-pasquale).
added A100 data (ran at NSCC) to https://gist.github.com/migueldiascosta/0a0dbe061982bc4cc2bc7171785a4b86
Thanks a lot, these are quite interesting performance results for GH200.
@migueldiascosta @scarrazza I know this is not directly related here, but I think it could be interesting to run benchmarks on the Clifford simulator that @BrunoLiegiBastonLiegi is integrating with qibojit
right now. He already ran benchmarks on the cluster's A6000.
I think this is a good idea, so we could have some numbers for A6000, A100 and GH200.
Will look into that - btw, are those A6000 benchmarks with library_benchmarks
or with circuit_benchmarks
? The plots currently in the gist where I mix your data with mine may not be an apples-to-apples comparison.
Will look into that - btw, are those A6000 benchmarks with
library_benchmarks
or withcircuit_benchmarks
? The plots currently in the gist where I mix your data with mine may not be an apples-to-apples comparison.
I don't know what those names mean, which I guess means it's with neither
see e.g. https://github.com/qiboteam/qibojit-benchmarks/issues/45
the current GH200 data in the gist was obtained with qibojit-benchmarks
's main.py
/ circuit_benchmarks
, not with compare.py
/ library_benchmarks
(because I had seen the latter spend most of the time on the single-CPU-thread conversion of the final state vector to a numpy array, and it was not what I was interested in benchmarking...)
I believe the numbers quoted there have been obtained with compare.py
. @stavros11, could you please confirm?
I believe the numbers quoted there have been obtained with
compare.py
. @stavros11, could you please confirm?
Indeed, all the numbers in the qibojit paper were obtained with compare.py
. Looking at the bash scripts in the benchmark repository and also the numbers used to generate the plots, the data keys agree with compare.py
(library_benchmark
).
Therefore @migueldiascosta is right, if main.py
was used for the new benchmarks, for GPUs it is not apples-to-apples comparison because the transfer-to-host (numpy) time is logged seperately in that script. For CPUs (numba) it shouldn't make a difference because numpy array is used throughout the simulation.
For CPUs (numba) it shouldn't make a difference because numpy array is used throughout the simulation.
Indeed, but maybe there are other differences between library_benchmark
and circuit_benchmark
? i.e., the huge difference in my plots between EPYC and Grace for smaller circuits is suspicious (and in general for the paper data, there seems to be a constant time that dominates for smaller circuits, the curves always start basically flat at around one second until about 20 qubits, mine don't)
Indeed, but maybe there are other differences between
library_benchmark
andcircuit_benchmark
? i.e., the huge difference in my plots between EPYC and Grace for smaller circuits is suspicious
I also noticed that and I am not sure how to explain. One thing that could have changed other than the scripts is the libraries versions. It has been two years since publication so qibo, qibojit and probably dependencies as well may have changed during that time. That is unless you are using the older versions.
Given that we still have access to most of the hardware we did the benchmarks on, we could retry the benchmarks from our side using the same versions and script you used. This way we will have a much more accurate comparison.
Yes, there could also be differences there, but now I'm thinking the ~1s constant time in your data is simply the import time, which is added to the "total_simulation_time" in load_data
for the plots - it's also added to mine, but my import time is much shorter, which could be simply about disk IO (the system I'm using has NVMe drives and I'm loading from them, not from a network filesystem) and/or caching
Actually, that's mentioned in the paper: "Furthermore, a constant of about one second is required to import the library, which can be relevant (comparable or larger than execution time) for simulation of small circuits. This is unlikely to impede practical usage as it is only a small constant overhead that is independent of the total simulation load."
indeed, if I remove the import time the comparison looks more reasonable, e.g.
data and some plots at https://gist.github.com/migueldiascosta/0a0dbe061982bc4cc2bc7171785a4b86, as requested by @scarrazza