Closed mlazzarin closed 2 years ago
Looks good to me. The second family of plots, which I believe are interesting to show, are histograms where each group of bins show the performance for a fixed number of qubits and circuit.
Looks good to me. The second family of plots, which I believe are interesting to show, are histograms where each group of bins show the performance for a fixed number of qubits and circuit.
Do you mean something like this (maybe a bit fancier)?
Another issue I'd point out is related to the data files used to generate the plots. We need to decide whether we will make them public on GitHub or keep them locally. I believe the general practice is to not put data files in the repository (even though ours are not that large in size). The disadvantage of this is that one can only reproduce the plots (even just for changing a font size or color) only if the data are regenerated by running the benchmarks or accessing in some other way which may be more inconvenient than git. I do not have a strong opinion as to where data are kept.
Another option would be to upload the logs to a different repository / to a gist.
We could also improve the logging a bit to include some details on the machine (OS, versions, hardware, drivers, etc.) given that we and possibly other people may try to run these scripts on different platforms and get different results.
Yes, it may be useful, we should decide exactly what to add.
Finally, as part of this cleanup I would remove the
data.ipynb
anddata/
folder we currently have as it is outdated by the new scripts and plots, but include a README regarding the new scripts.
Ok, I'll do it.
@scarrazza If I understood, you'd like something like this to compare the different qibojit backends, right?
Yes, exactly.
This one should address the breakdown in import time (with compilation), creation time and run time.
Yes, but what happens if you show the absolute time? If it is difficult to read, we could include a text containing the time at the top of one of the histogram bars.
I tried to remove the normalization, actually it's not that bad. We should see what happens with different circuits (I'm still waiting for the benchmarks). EDIT: The only problem is that creation time is nearly invisible.
I've prepared some drafts with only small circuits and a few backends. Here they are:
Qibojit: comparison between import time, creation time and run time for different circuits and different number of qubits. qibojit_breakdownsingle[10, 20, 23].pdf
Qibo: comparison between different backends. I've prepared both line plots and histograms. qibo_scaling_total_dry_time_single.pdf qibo_bars_double_23.pdf
Comparison between multi-gpu: work in progress, we may add it to point 4 with small efforts
Qibojit: comparison between different GPUs: qibojit_gpu_total_simulation_time_double.pdf
Libraries: comparison between different libraries. I've prepared both line plots and histograms. qibo_scaling_total_simulation_time_single.pdf libraries_bars_single_24.pdf
Please let me know your opinion. Is there something missing? If they are fine (at least the idea, we can polish everything later, fix the labels or change some quantities) I can run the full suite of benchmarks this weekend.
Thanks @mlazzarin , looks good to me and it should be complete. Please go ahead with the data collection campaign.
I tried to make a complete run, but it was taking too much (more than one week) and I stopped it. I can tweak the scripts to save time, anyway I have some plots to show.
We can see that qibojit is always the best performing option, whether on CPU or on GPU:
I also have bar plots, where we can see the same thing:
Finally, in the following plots we can see that the import and creation time in negligible with a high number of qubits:
qibojit_breakdowndouble[18, 24, 28].pdf
I have also the comparison between different GPUs:
qibojit_gpu_total_simulation_time_double.pdf
Then, we have the plots with all the different libraries. For time reasons, I only have single precision data. Still, we can have scaling plots:
And bar plots:
Please let me know your opinion. Of course something still need to be polished, but I'd like to raise a point. The bar plots are way faster to generate than scaling plots, because you don't need to sweep over all the possible configurations. I think that the scaling plots with the qibo backends are cool, while those with all the libraries are difficult to read and less interesting than the bar plots. What do you think?
Yes, I don't think it makes much sense to run benchmarks for very long time (definitely not one week). It's a waste of resources, given that our space in paper is limited and we can support all our points with shorter runs. Some comments on each of the plots (in different order from your post):
And bar plots:
This bar plot looks very nice. I agree with keeping this instead of scaling plots for library comparison and leave the scaling for qibo plots only. It is expected that all libraries will show the same scaling up to different constants anyway. We could generate bar plots for two different qubit numbers, one mid-range, where CPU and GPU are close and one large, where GPU clearly wins. Perhaps 20 and 30 are good choices. Some other suggestions regarding this plot:
Btw, I find interesting that for the quantum-volume circuit qsim-CPU is faster than everything else (even qsim-GPU).
I have also the comparison between different GPUs:
I would add a CPU line here for reference. In fact this plot could prove a relatively strong point that non-high-end GPUs with up to 4GB of memory are not very useful for quantum simulation (at least with naive state vector and without some distribution mechanism), since the GPU speed-up is visible for >25 qubits which do not fit in their memory.
Qibo - scaling plots qibo_scaling_total_dry_time_double.pdf
These are good too. We just need to extend all lines up to the same qubit number, probably 30 or 31 which is the maximum we can run with qibojit GPU. Tensorflow GPU will go OOM earlier which is fine, but I think the CPU should be extended to get a clear scaling. This should not be very time consuming, if numpy takes ages, we can end its line earlier. Other points:
Qibo - bar plots Finally, in the following plots we can see that the import and creation time in negligible with a high number of qubits:
I understand the point we are trying to make with this plot but I don't like the big discrepancy in bar heights which results to a big white space. I thought we discussed this in the past, but would it make sense to normalize with the total time so that all bars go to one? It would end up looking like three pie charts, which is not necessarily bad. I also think there is an issue with the y-axis label as we are not plotting % right?
Thank you for your comments. I'll address all of them. Just some additional comments:
Btw, I find interesting that for the quantum-volume circuit qsim-CPU is faster than everything else (even qsim-GPU).
Could it be linked to qsim fusion being always on? (but it doesn't explain why it's slower than GPU)
Qibo - scaling plots qibo_scaling_total_dry_time_double.pdf qibo_scaling_total_simulation_time_double.pdf
These are good too. We just need to extend all lines up to the same qubit number, probably 30 or 31 which is the maximum we can run with qibojit GPU. Tensorflow GPU will go OOM earlier which is fine, but I think the CPU should be extended to get a clear scaling. This should not be very time consuming, if numpy takes ages, we can end its line earlier. Other points:
I'm worried about tensorflow and numpy taking hours, especially if we want to run all circuits with both single and double precision. Shall we restrict ourselves to a specific circuit and precision?
3. Would it be useful to add a single thread line for qibojit and/or qibotf to see how the scaling compares with numpy?
Can do, hope the plot doesn't get too crowded.
Qibo - bar plots Finally, in the following plots we can see that the import and creation time in negligible with a high number of qubits: qibojit_breakdowndouble[18, 24, 28].pdf
I understand the point we are trying to make with this plot but I don't like the big discrepancy in bar heights which results to a big white space. I thought we discussed this in the past, but would it make sense to normalize with the total time so that all bars go to one? It would end up looking like three pie charts, which is not necessarily bad. I also think there is an issue with the y-axis label as we are not plotting % right?
We can do something like this: qibo_breakdowndouble[18, 24, 28].pdf
Could it be linked to qsim fusion being always on? (but it doesn't explain why it's slower than GPU)
Right, I totally forgot about this. It also explains why qsim is generally faster than everything else. We should be cautious when using qsim in plots where fusion is disabled for all other libraries. Indeed it doesn't explain why qsim-CPU is faster than qsim-GPU for that particular circuit, perhaps it has to do with the fused gates and how they're implemented.
I'm worried about tensorflow and numpy taking hours, especially if we want to run all circuits with both single and double precision. Shall we restrict ourselves to a specific circuit and precision?
If I understand correctly, for this plot we use a specific circuit and precision, otherwise we would need many of these plots for which we would not have space. We cannot plot all of them together, right?
I'm mostly concerned with the plot to look complete and not have lines that are cut abruptly. The numpy line should be okay to leave as it is. I would add one or two points in tensorflow CPU and complete the tensorflow GPU, qibotf CPU and qibojit CPU lines. Extrapolating from this plot, tensorflow CPU should take ~5 hours per point including dry run but all the rest seems should me less than one hour (seems like they will stay <10^3 sec). Tensorflow GPU will go OOM at least one qubit less than other GPU backends because it copies the state.
We should write in text that GPU lines end due to memory, while CPU lines end due to time constraints.
We can do something like this: qibo_breakdowndouble[18, 24, 28].pdf
This is okay, it looks a bit weird but I guess there'll be some variation when each bar corresponds to a different circuit and it will look better. Any format, normalized or no, is fine for me.
Regarding libraries, we have the following situation:
Single Precision | Double Precision |
---|---|
Qibo (CPU/GPU) | Qibo (CPU/GPU) |
Qiskit (CPU/GPU) | Qiskit (CPU/GPU) |
Qsim (CPU/GPU) | |
Qulacs (CPU/GPU) | |
QCGPU (only GPU) | |
HybridQ (CPU/GPU) | HybridQ (CPU/GPU) |
ProjectQ (only CPU) |
@stavros11 @scarrazza @andrea-pasquale Shall we run benchmarks for all of the available libraries, in both single precision and double precision, or drop some of them and focus only on a specific precision?
Shall we run benchmarks for all of the available libraries, in both single precision and double precision, or drop some of them and focus only on a specific precision?
Maybe keeping all the libraries for single precision can become quite messy. If I understand correctly it would be a bar plot with 10+ entries for each circuit. At the same time I don't know if there is a good criterion for which we can exclude some libraries.
I have no strong preference for the double precision. Surely it is good to have both plots but I don't know if we will be able to fit both in the paper.
@mlazzarin for the plots in the proceedings we can select just one of both precisions, e.g. double, but in the repository we can keep the mechanism to generate single precision numbers and plot.
Here's the next row of plots. The machines that I used were also used by someone else at the same time, so the results are a bit noisy. However, the main goal now is to decide what to plot and how. qibo_scaling_total_dry_time_double.pdf qibojit_breakdowndouble[18, 24, 28].pdf qibojit_gpu_total_simulation_time_double.pdf libraries_bars_double_30.pdf
The scripts should be more or less ready.
The only thing I see is that the logs (compare.py
) don't check if the environment is properly implemented, e.g. if the user asks for backend=qibojit,platform=cuquantum
the logs will contain such statement even if cuquantum or qibojit are not installed.
I guess it would be responsibility of the user to check that everything works properly.
The plots are also ready, of course they will require some modifications for the specific setup of each run, it's more of a starter code, but it should be enough.
The scripts should be more or less ready. The only thing I see is that the logs (
compare.py
) don't check if the environment is properly implemented, e.g. if the user asks forbackend=qibojit,platform=cuquantum
the logs will contain such statement even if cuquantum or qibojit are not installed.
Indeed, this is an issue with qibojit cupy/cuquantum because it falls back to numpy automatically if cupy/cuquantum is not installed or a GPU is not available but this fall back is not properly logged in the benchmark logs. We could fix it by generating the library-options
inside the backend after it is initialized, however I am not sure if this is very important as it does not happen with other libraries, right? For example if you do --library hybridq
and hybridq is not installed it will just fail.
The plots are also ready, of course they will require some modifications for the specific setup of each run, it's more of a starter code, but it should be enough.
I will have a more detailed look but would it be easy to share the .dat files and/or the plots that are generated through the notebooks somehow (not necessarily via GitHub), so that everyone can reproduce the plots locally?
I will have a more detailed look but would it be easy to share the .dat files and/or the plots that are generated through the notebooks somehow (not necessarily via GitHub), so that everyone can reproduce the plots locally?
I uploaded some logs here https://gist.github.com/mlazzarin/67e37fa9db148e9e3ae70d254730eb99 but remember that the results are a bit noisy.
I added an environment file with pinned versions. I updated the README accordingly.
I would suggest to create a new environment with
conda env create -f environment.yml
then install qulacs-gpu, cupy and cuquantum. Concerning HybridQ, I think that it's easier to install it in a separate environment.
I had a look at the notebooks using the data from the gist. I believe it is helpful to have a progress summary in terms of the target plots listed here.
I believe we are missing this from the notebooks, right? I'm not sure exactly how is the latest breakdown barplot because I believe the data are not in the gist but I think there is no dry-run vs execution comparison. Perhaps we could expand these barplots to have two bars for each circuit, one with total dry time and one with total simulation time and both bars can be broken down to (import + creation + simulation). This way we could show that both import and dry run overhead become less significant when increasing the number of qubits. Let me know what you think.
We could also consider removing the creation time from the breakdown plot (or incorporating in simulation), since it is invisible in all cases.
This is the first plot in the qibo.ipynb
notebook and except the noisy data which will change anyway, it looks good to me (including the colors).
This is missing but I will add it in #28 once we solve the multigpu problems.
I don't have the data but I believe this is done in qibojit.ipynb
. I would add a CPU line as baseline. We could also consider using a barplot with fixed qubit number instead of scaling here, and also compare dry run with simulation (two bars for each GPU model).
Done in libraries.ipynb
and looks good to me. The only way I see incorporating all libraries is to generate two such plots, one for single and one for double, like we did in the original qibo paper. The same plotting code can be used for both (changing just the library names). Btw it may be more convenient to define the labels, colors etc. using a dictionary instead of if
, or alternatively using match-case.
Regarding the proceeding/paper: Given that the proceedings will be focused on qibojit, we could only include plots that are connected to this and the effects of just-in-time compilation. These would be: 1-breakdown including dry run vs simulation comparison, 4-compare different hardware, 3(if there is space)-advantages in multigpu and CPU-GPU communication when compared to qibotf and finally end with 2-general scaling and comparison of all qibo backends. We could leave the library comparisons for the paper which will be more extensive and we could do single and double to include multiple libraries. Let me know what you think.
I believe we are missing this from the notebooks, right? I'm not sure exactly how is the latest breakdown barplot because I believe the data are not in the gist but I think there is no dry-run vs execution comparison. Perhaps we could expand these barplots to have two bars for each circuit, one with total dry time and one with total simulation time and both bars can be broken down to (import + creation + simulation). This way we could show that both import and dry run overhead become less significant when increasing the number of qubits. Let me know what you think.
I like the idea of having two bars with total simulation time and total dry time. Then, I guess we need to focus on a single circuit, otherwise we would have too many bars. Concerning the data, if we focus on a single circuit e.g. qft we can use the same data as the qibo scaling plot.
I don't have the data but I believe this is done in qibojit.ipynb. I would add a CPU line as baseline. We could also consider using a barplot with fixed qubit number instead of scaling here, and also compare dry run with simulation (two bars for each GPU model).
Yes, I agree with the CPU baseline. IMHO I prefer the scaling vs the bar plot, because I think that it's more useful, but we may add both and use the bar plot to compare dry run with simulation.
Regarding the proceeding/paper: Given that the proceedings will be focused on qibojit, we could only include plots that are connected to this and the effects of just-in-time compilation. These would be: 1-breakdown including dry run vs simulation comparison, 4-compare different hardware, 3(if there is space)-advantages in multigpu and CPU-GPU communication when compared to qibotf and finally end with 2-general scaling and comparison of all qibo backends. We could leave the library comparisons for the paper which will be more extensive and we could do single and double to include multiple libraries. Let me know what you think.
I agree with the structure of the paper. In the dry time vs simulation time, I would also add a small section where we discuss about the various problems regarding the dry run overhead (e.g. it changes between different cuda toolkit) and how we managed to solve it by caching the compiled kernels, maybe with a plot similar to this one:
Thanks for the reply.
I like the idea of having two bars with total simulation time and total dry time. Then, I guess we need to focus on a single circuit, otherwise we would have too many bars. Concerning the data, if we focus on a single circuit e.g. qft we can use the same data as the qibo scaling plot.
Yes, the idea was to focus on one or two circuits for the dry run vs simulation comparison.
Yes, I agree with the CPU baseline. IMHO I prefer the scaling vs the bar plot, because I think that it's more useful, but we may add both and use the bar plot to compare dry run with simulation.
Sure, if the scaling looks better we can go with that one. I'd just add both in the notebook so that we see how they look and we may not use them both in the paper.
In terms of this PR, if we add some script that generates the two dry run vs simulation barplots and the last plot you posted (scaling with different cuda toolkit versions and caching) in the notebooks, I believe it should be ready, unless you are planning to add something else. Let me know if you need any help with this. Other than that, it is just a matter on re-running all benchmarks required for the plots on an idle machine and re-execute the notebooks using these data.
Thanks for the summary @stavros11 .
I believe we are missing this from the notebooks, right? I'm not sure exactly how is the latest breakdown barplot because I believe the data are not in the gist but I think there is no dry-run vs execution comparison. Perhaps we could expand these barplots to have two bars for each circuit, one with total dry time and one with total simulation time and both bars can be broken down to (import + creation + simulation). This way we could show that both import and dry run overhead become less significant when increasing the number of qubits. Let me know what you think.
I also like this idea. The aim of this plot would be to show that the execution is faster compared to the dry-run, correct? Because then since they two should be quite different I don't know if we would be able to see clearly the different parts (import, creation and simulation) for the execution time. One solution would be to normalized them both (dry run and execution), but then we would't be able to show the overall difference between the two. Let me if you have a particular idea for this plot.
For the plot comparing multiple GPUs models I have no strong preferences between the scaling and the barplot. Once they are ready we can decide which one should we choose.
Regarding the proceeding/paper: Given that the proceedings will be focused on qibojit, we could only include plots that are connected to this and the effects of just-in-time compilation. These would be: 1-breakdown including dry run vs simulation comparison, 4-compare different hardware, 3(if there is space)-advantages in multigpu and CPU-GPU communication when compared to qibotf and finally end with 2-general scaling and comparison of all qibo backends. We could leave the library comparisons for the paper which will be more extensive and we could do single and double to include multiple libraries. Let me know what you think.
I agree with the structure of the paper. In the dry time vs simulation time, I would also add a small section where we discuss about the various problems regarding the dry run overhead (e.g. it changes between different cuda toolkit) and how we managed to solve it by caching the compiled kernels, maybe with a plot similar to this one.
I also agree with the structure of the proceeding/paper. I think that it should be fine if we include the library comparison for the paper.
@stavros11 were you thinking about something similar to this? qibojit_breakdowndouble[18, 20, 22, 24, 26].pdf
I repeated the qibo benchmarks in an idle qibomachine and as expected the noise disappears. I believe the scaling plots now look good: qibo_scaling_total_dry_time_double.pdf qibo_scaling_total_simulation_time_double.pdf and the same with nqubits fixed and changing the circuit at the x-axis:
@stavros11 were you thinking about something similar to this? qibojit_breakdowndouble[18, 20, 22, 24, 26].pdf
Yes, exactly like this. I was actually producing the same plots using my data right now qibojit_cpu_dry_vs_simulation_double_qft.pdf qibojit_gpu_dry_vs_simulation_double_qft.pdf
and the same with fixed nqubits and changing the circuit in the x-axis: qibojit_cpu_dry_vs_simulation_double_20.pdf qibojit_gpu_dry_vs_simulation_double_20.pdf
Your legend looks better, I would just remove the creation time since it is invisible anyway. I guess your plot is on GPU, right? From my results, I believe the CPU plots are a bit boring as there is no much difference between dry run and simulation, but the GPU case might be interesting as we can see the constant overhead of compilation and that it becomes less significant with increasing number of qubits (in terms of total simulation time percentage).
Thanks, could you please include cuquantum in the first plot?
Thanks, could you please include cuquantum in the first plot?
Here are some plots including cuquantum:
Scaling qibo_scaling_total_dry_time_double.pdf qibo_scaling_total_simulation_time_double.pdf
Dry run vs simulation bar plot as a function of nqubits qibojit_gpu_dry_vs_simulation_double_qft.pdf qibojit_gpu_dry_vs_simulation_double_variational.pdf qibojit_gpu_dry_vs_simulation_double_supremacy.pdf
Dry run vs simulation bar plot for different circuits qibojit_gpu_dry_vs_simulation_double_20qubits.pdf qibojit_gpu_dry_vs_simulation_double_30qubits.pdf
Interesting that cuquantum has a much larger dry-run. I would not expect that...
Thanks for the plots.
- Dry run vs simulation bar plot for different circuits qibojit_gpu_dry_vs_simulation_double_20qubits.pdf qibojit_gpu_dry_vs_simulation_double_30qubits.pdf
It is also strange that for 30 qubits the qft of cuquantum is much worse compared to the 20 qubits case.
Btw, I don't think we have adiabatic evolution (e.g. TFIM) using qibojit, I think we should give a try and compare to the results presented the original Qibo paper.
We can use this PR to implement the benchmark scripts and related plots. For now, I've implemented a plot with the scaling of each qibo backends w.r.t. the number of qubits, e.g.: qibo_scaling_total_simulation_time_double.pdf Actually, I mainly recycled @stavros11's code. Logs can be generated by running
scripts/qibo_scaling_cpu.sh
andscripts/qibo_scaling_gpu.sh
, while the plots can be generated and personalized inplots/qibo_scaling.ipynb
.EDIT: I also implemented a plot with qibojit performance with different GPUs, e.g.: qibojit_gpu_total_simulation_time_double.pdf Logs can be generated by running
scripts/qibojit_gpu.sh
on different machines, while the plots can be generated and personalized inplots/qibojit_gpu.ipynb
.