Initial development of Amdahl's Law plotting

mikerenfro commented 2 years ago

For #13. May still need some work on this, but here's a start for the Snakefile and Python code to convert the run*.log files into a speedup graph like the one shown below. Other items of note:

Since we're not doing real work in the MPI ranks, we can expand the horizontal axis out considerably beyond 8 processors. This gets us closer to the speedup limit, and de-emphasizes the bits of experimental error in the timing results.
We may need to explicitly specify the number of processes used in mpirun, especially in an HPC environment, as mpirun may query the scheduler to know how many processes it "should" run.
My time values aren't particularly close to the theoretical values. Not sure if the Conde-based mpirun is deficient or if I'm missing something else. Some runs have been better than others, but for a synthetic example, I'd have expected my values to be practically perfect on every run. -- Followup: Trevor's results were basically perfect, so it's possibly a local problem. Will check later with current Amdahl code to verify.

amdahl-080-percent

bkmgit commented 2 years ago

May consider just using Numpy, keep it simple https://matplotlib.org/stable/gallery/lines_bars_and_markers/simple_plot.html#sphx-glr-gallery-lines-bars-and-markers-simple-plot-py

mikerenfro commented 2 years ago

I can go either way on that. If the plotting code is intended to be black box like the Amdahl script is, then I don't mind making it more legible and showing the theoretical speedup line. If they're expected to code it up on their own, then a simpler plotting code is fine.

One other reason I added the extra line was because my timing results were so far off from the expected values. Not sure if that's due to conda's mpirun or something present on both my Mac and my HPC. Adding the theoretical line could lead to discussions of "yeah, and your results are right on that line because this is a somewhat synthetic problem," "yeah, your results are off the line because measurements aren't perfect," or others rather than explaining why different learners' graphs look visibly different.

tkphd commented 2 years ago

It's possible, but I believe it would be more complicated to build the NumPy arrays than it is to build the Pandas DataFrame. The input data for the plot exists as a number of JSON files written by the compute nodes to disk, essentially one "row" per file. The process of building the DataFrame, as Mike programmed it, is relatively straight-forward to explain. If you can come up with a competitive solution with NumPy, I'd certainly be interested to see it.

The plotting script is intended to be a gray box: learners can open the file and see what's going on inside, but it is not necessary to do so, and we will certainly not spend time in a Snakemake lesson on writing a Python plotting script.

tkphd commented 2 years ago

The current version seems to nail Amdahl.

amdahl-80-pct-parallel

bkmgit commented 2 years ago

Possibly writing out a csv file rather than json would make it easier, but if Python will be required for this, then using Pandas is ok.

tkphd commented 2 years ago

@bkmgit if you would like to offer a complete solution using NumPy, we will consider it, but that is outside the scope of this PR.

tkphd commented 2 years ago

Done tinkering with this for now. Here's what the scaling plot currently looks like.

amdahl-scaling-study

There's also a PR on amdahl that improves the random jitter code (to accept a jitter proportion on the CLI and only increase runtime, unless the proportion is negative, in which case only decrease runtime). If you're in a reviewing mood, please take a look!

reid-a commented 2 years ago

I think it's useful to get a version of this code into the repo. In the spirit of not letting the perfect be the enemy of the good, I favor merging.

hpc-carpentry / old-hpc-workflows

Initial development of Amdahl's Law plotting #28