some qibojit benchmarks on NVIDIA Grace-Hopper (WIP)

migueldiascosta commented 10 months ago

data and some plots at https://gist.github.com/migueldiascosta/0a0dbe061982bc4cc2bc7171785a4b86, as requested by @scarrazza

scarrazza commented 10 months ago

Hi @migueldiascosta thank you so much for those benchmarks, this architecture looks interesting. Do you have also number/plots with A100? (cc @andrea-pasquale).

migueldiascosta commented 10 months ago

Hi @migueldiascosta thank you so much for those benchmarks, this architecture looks interesting. Do you have also number/plots with A100? (cc @andrea-pasquale).

added A100 data (ran at NSCC) to https://gist.github.com/migueldiascosta/0a0dbe061982bc4cc2bc7171785a4b86

scarrazza commented 10 months ago

Thanks a lot, these are quite interesting performance results for GH200.

renatomello commented 10 months ago

@migueldiascosta @scarrazza I know this is not directly related here, but I think it could be interesting to run benchmarks on the Clifford simulator that @BrunoLiegiBastonLiegi is integrating with qibojit right now. He already ran benchmarks on the cluster's A6000.

scarrazza commented 10 months ago

I think this is a good idea, so we could have some numbers for A6000, A100 and GH200.

migueldiascosta commented 10 months ago

Will look into that - btw, are those A6000 benchmarks with library_benchmarks or with circuit_benchmarks? The plots currently in the gist where I mix your data with mine may not be an apples-to-apples comparison.

renatomello commented 10 months ago

Will look into that - btw, are those A6000 benchmarks with library_benchmarks or with circuit_benchmarks? The plots currently in the gist where I mix your data with mine may not be an apples-to-apples comparison.

I don't know what those names mean, which I guess means it's with neither

migueldiascosta commented 10 months ago

see e.g. https://github.com/qiboteam/qibojit-benchmarks/issues/45

the current GH200 data in the gist was obtained with qibojit-benchmarks's main.py / circuit_benchmarks, not with compare.py/ library_benchmarks

(because I had seen the latter spend most of the time on the single-CPU-thread conversion of the final state vector to a numpy array, and it was not what I was interested in benchmarking...)

scarrazza commented 10 months ago

I believe the numbers quoted there have been obtained with compare.py. @stavros11, could you please confirm?

stavros11 commented 10 months ago

I believe the numbers quoted there have been obtained with compare.py. @stavros11, could you please confirm?

Indeed, all the numbers in the qibojit paper were obtained with compare.py. Looking at the bash scripts in the benchmark repository and also the numbers used to generate the plots, the data keys agree with compare.py (library_benchmark).

Therefore @migueldiascosta is right, if main.py was used for the new benchmarks, for GPUs it is not apples-to-apples comparison because the transfer-to-host (numpy) time is logged seperately in that script. For CPUs (numba) it shouldn't make a difference because numpy array is used throughout the simulation.

migueldiascosta commented 10 months ago

For CPUs (numba) it shouldn't make a difference because numpy array is used throughout the simulation.

Indeed, but maybe there are other differences between library_benchmark and circuit_benchmark? i.e., the huge difference in my plots between EPYC and Grace for smaller circuits is suspicious (and in general for the paper data, there seems to be a constant time that dominates for smaller circuits, the curves always start basically flat at around one second until about 20 qubits, mine don't)

qibo_scaling_qft_total_simulation_time_double cpu

stavros11 commented 10 months ago

Indeed, but maybe there are other differences between library_benchmark and circuit_benchmark? i.e., the huge difference in my plots between EPYC and Grace for smaller circuits is suspicious

I also noticed that and I am not sure how to explain. One thing that could have changed other than the scripts is the libraries versions. It has been two years since publication so qibo, qibojit and probably dependencies as well may have changed during that time. That is unless you are using the older versions.

Given that we still have access to most of the hardware we did the benchmarks on, we could retry the benchmarks from our side using the same versions and script you used. This way we will have a much more accurate comparison.

migueldiascosta commented 10 months ago

Yes, there could also be differences there, but now I'm thinking the ~1s constant time in your data is simply the import time, which is added to the "total_simulation_time" in load_data for the plots - it's also added to mine, but my import time is much shorter, which could be simply about disk IO (the system I'm using has NVMe drives and I'm loading from them, not from a network filesystem) and/or caching

migueldiascosta commented 10 months ago

Actually, that's mentioned in the paper: "Furthermore, a constant of about one second is required to import the library, which can be relevant (comparable or larger than execution time) for simulation of small circuits. This is unlikely to impede practical usage as it is only a small constant overhead that is independent of the total simulation load."

migueldiascosta commented 10 months ago

indeed, if I remove the import time the comparison looks more reasonable, e.g.

qibo_scaling_qft_total_simulation_time_double cpu

qiboteam / qibojit

some qibojit benchmarks on NVIDIA Grace-Hopper (WIP) #165