Continuous benchmarking

pgrete commented 4 years ago

Regarding "continuous benchmarking" I spent some (don't ask...) time trying to figure out how we could go about it (specifically pinging @AndrewGaspar and @JoshuaSBrown but everyone else is welcome to provide input, too)

My original idea was to implement this as a two step process as (I hoped) the first one would be straightforward and the second one more involved.

Step 1: Have CI automatically "comment" on a PR uploading the performance image to the comment through the GitHub API (similar to directly pasting or drag&drop an image in the browser).
Step 2: Have a long term repo that stores the/all performance results and also have an interface to interactively plot the results, cf. Ginkgo Performance Explorer/GPE

The problem with step 1 is that uploading images to GitHub within a comment is not possible through the API. Thus, another "service" is required to host the images. One could store the images in another repo or use Imgur (or similar). On the other hand, the question is how important it is to retain those images for a long time (compared to just keeping the raw numbers). The main idea behind the image is to have an easy human-readable format that provides direct feedback on performance degradation in a PR.

I'm current leaning towards the following approach

[now] increase the artifact retention time to 7 or 10 days (which should be fine for most PRs to either get merged or updated so that a new artifact is created)
[now] have CI add a comment to the PR containing the image (hosted on GitLab until the retention time is reached)
[now] create a new repository where we start storing the results of the benchmarks in json format now
[between now and later] create a simple way to plot/visualize the results offline (e.g., using a jupyter notebook or similar)
[later] figure out a way to process the data repository automatically, e.g., by using GPE, and eventually use the automatically generated results to provide feedback on performance within PRs

The main idea behind this approach is that we start keeping track of performance now and can then figure out a way to plot the evolution over time later.

What are people's thoughts?

JoshuaSBrown commented 4 years ago

[now] increase the artifact retention time to 7 or 10 days (which should be fine for most PRs to either get merged or updated so that a new artifact is created)

I don't have a strong opinion on this, it could get annoying if you want to see the performance changes as you work on a pull request if images start disappearing but I'm not sure it's that big of a deal. So I think getting something working and then making change as needed is the way to go. So if you get something working I won't complain.

[now] have CI add a comment to the PR containing the image (hosted on GitLab until the retention time is reached)

Your saying having a link pointing to the image? If so this is the only way I could figure out how to get something working anyway. You could also host the images directly on github as I mentioned to you in one of our last discussions using orphan branches, but I leave this at your disgression.

[now] create a new repository where we start storing the results of the benchmarks in json format now

I would need to look into this to see if creating a separate repository is the best way to go as it is I don't have a strong opinion. In terms of .json format I think this is a valid option, but I think there are also a lot of tools that could benefit from having the information available in simple txt files with a time stamp on each line. This would also make it really easy for text manipulation using shell scripting.

[between now and later] create a simple way to plot/visualize the results offline (e.g., using a jupyter notebook or similar)

Sure, depending on the format of the data, data visualization might be trivial.

[later] figure out a way to process the data repository automatically, e.g., by using GPE, and eventually use the automatically generated results to provide feedback on performance within PRs

Do you mean automatically generated plots? The data will already be stored in the repository right? Or what kind of results are you referring to, as in are you thinking of adding fits etc? There also maybe other tools more suitable than Ginkgo Performance Explorer/GPE e.g. the elastic stack, I still have not looked much into what features are available through this particular api. Again, I think if we have something working we can always iterate, even a python script that runs in the ci to plot images would work.

pgrete commented 4 years ago

[now] increase the artifact retention time to 7 or 10 days (which should be fine for most PRs to either get merged or updated so that a new artifact is created)

I don't have a strong opinion on this, it could get annoying if you want to see the performance changes as you work on a pull request if images start disappearing but I'm not sure it's that big of a deal. So I think getting something working and then making change as needed is the way to go. So if you get something working I won't complain.

The disappearing part is why I suggested to increase the time to 7-10 days. And if we need the images for longer for a specific PR one can always keep the artifacts for longer for a specific job through the GitLab webinterface.

[now] have CI add a comment to the PR containing the image (hosted on GitLab until the retention time is reached)

Your saying having a link pointing to the image? If so this is the only way I could figure out how to get something working anyway. You could also host the images directly on github as I mentioned to you in one of our last discussions using orphan branches, but I leave this at your disgression.

The link would actually be embedded. The key issues is where the image is hosted. I'm generally not a big fan of a binary repository. Even the results in json (or some other format) are more suitable for a database than a repository.

[now] create a new repository where we start storing the results of the benchmarks in json format now

I would need to look into this to see if creating a separate repository is the best way to go as it is I don't have a strong opinion. In terms of .json format I think this is a valid option, but I think there are also a lot of tools that could benefit from having the information available in simple txt files with a time stamp on each line. This would also make it really easy for text manipulation using shell scripting.

We could think of a conversion tool (from json to simple one line txt). The main reason I suggested json was that the content may be extended over time and already now is quite complex (for a single line), e.g., a single performance number should be connected to the processor, the GPU, mpi yes/no, number of processes, number of threads, number of streams, mesh size, block size, problem type, and potentially more information (like build options, compiler version etc) but we can keep it more simple in the beginning.

[later] figure out a way to process the data repository automatically, e.g., by using GPE, and eventually use the automatically generated results to provide feedback on performance within PRs

Do you mean automatically generated plots? The data will already be stored in the repository right? Or what kind of results are you referring to, as in are you thinking of adding fits etc? There also maybe other tools more suitable than Ginkgo Performance Explorer/GPE e.g. the elastic stack, I still have not looked much into what features are available through this particular api. Again, I think if we have something working we can always iterate, even a python script that runs in the ci to plot images would work.

I was referring to using something like GPE (once it's setup) to create (and store or on demand) the images (versus using them from the GitLab CI with a limited retention time) based on the raw data (in json or similar) retained somewhere (e.g., in a data repo). I'm curious to hear more about the elastic stack but, as you say, I also expect that it'll be straightforward to adapt if use a flexible/easily convertible format to store the raw performance data.

JoshuaSBrown commented 4 years ago

The link would actually be embedded. The key issues is where the image is hosted. I'm generally not a big fan of a binary repository. Even the results in json (or some other format) are more suitable for a database than a repository.

Do you have ideas for this already? I guess we are back to our the discussions we had about where to store the gold standard.

We could think of a conversion tool (from json to simple one line txt). The main reason I suggested json was that the content may be extended over time and already now is quite complex (for a single line), e.g., a single performance number should be connected to the processor, the GPU, mpi yes/no, number of processes, number of threads, number of streams, mesh size, block size, problem type, and potentially more information (like build options, compiler version etc) but we can keep it more simple in the beginning.

I like this idea.

JoshuaSBrown commented 4 years ago

I'm curious to hear more about the elastic stack but, as you say, I also expect that it'll be straightforward to adapt if use a flexible/easily convertible format to store the raw performance data.

Elastic stack/ELK stack were developed for system admins for tracking large amounts of logging data. This approach might be overkill, but given that we are essentially logging the performance it is an option. It probably has more features than Ginkgo and has a very active user base, so there is that.

Yurlungur commented 4 years ago

Overall no objections. Json seems fine to me. I don't love having a second repository holding all of our performance numbers, though. I worry that distributing our project accross multiple repos is a slippery slope where someday we need a script to clone all the relevant repos to get a working install.

What about putting performance numbers in the wiki? The wiki is itself a github repo, so we should be able to put images and performance data as a json there, which the CI could automatically push to.

pgrete commented 4 years ago

Overall no objections. Json seems fine to me. I don't love having a second repository holding all of our performance numbers, though. I worry that distributing our project accross multiple repos is a slippery slope where someday we need a script to clone all the relevant repos to get a working install.

I agree that we should keep everything in one place that is required to successfully build and run Parthenon. For exactly this reason, I think the performance data should actually be in a separate place as it's not required for running or the correctness etc. itself and, thus, could be seen as a separate project.

What about putting performance numbers in the wiki? The wiki is itself a github repo, so we should be able to put images and performance data as a json there, which the CI could automatically push to.

That's an interesting idea -- I like it. Feels similarly abusing like using releases for gold standards but would do the trick and keep everything under one umbrella (while being separate from the raw code).

pgrete commented 3 years ago

Adding some additional thoughts here following recent discussion and the (more pressing) need to have an easy "continuous benchmarking" infrastructure. As a first step, to make data collection easier #378 added a lot more regions to the codebase so that we get more fine grained output based on the Kokkos profiling machinery. Following that, I asked the Kokkos team about more straightforward options to get json based output of the profiling data as this would be easier to postprocess (or remove that postprocessing step from our machinery). As a result, there's now (WIP) support for json output of the space time tool (https://github.com/kokkos/kokkos-tools/pull/114) as well as for the simple kernel/region timer (https://github.com/kokkos/kokkos-tools/pull/113). Based on those discussion, two existing tool chains were mentioned that effectively may already provide what we are looking for (collect and compare the "application profiles" [i.e., how much time is spent in potentially nested regions and kernels for a canonical test problem]).

The timemory toolkit, see also here for more specific instructions
Caliper and SPOT (should soon be available and we're in contact with the devs)
(plus the Ginkgo Performance Explorer I mentioned above, which may be able to directly process the json data in a repository based on some transformation schemes

Regarding the item we should actually test for, here's a first (incomplete) list of suggestion (always testing on both CPUs with different support for vector instructions as well as GPUs). A compute element (CE) here refers to a single core or a single GPU

Measuring overhead of static overdecomposition
- Use single CE for a fixed, static Mesh size with ever decreasing MeshBlock size
- Provides raw overhead involved in Mesh management and ghost zone handling
- Allows to compare "efficiency" on each architecture
Measuring "raw" performance
- Use all available CEs on a node (i.e., all cores or all GPUs) for both a static and AMR test
- Provides "speedup" number that cannot be absent on any good presentation ;)
- (Using all cores is important to address both issues around caches and thermal throttling)
Measure AMR performance
- Use all available CEs and run a test to reach some error bound both on a static grid and with AMR
- Measure the relative efficiency on the AMR framework on each architecture
Weak scaling (both AMR and static)
- Run on all available CEs on >0 nodes for a problem with roughly constant work per CE
- Provides information the MPI efficiency (I expect that the load balance part will be the most interesting/relevant part here)
Roofline models (nice to have as a mid to long term milestone but probably not top priority right now)

I'll open a PR with scripts to cover some cases soon (as I basically have these already on file anyway from previous tests). In addition to the "specific" goals of each test, the performance data of the individual regions and kernels will allow us to make informed decision on which piece of the framework needs additional performance improvements.

Pinging some more people who may be interested in this so that we can discuss here (or on Wednesday): @Yurlungur @jdolence @AndrewGaspar @JoshuaSBrown @cielling @forrestglines @jmstone

JoshuaSBrown commented 3 years ago

It looks like the documentation for timemory is better than Caliper in regards to kokkos support. I'm in favor with starting with this framework.

JoshuaSBrown commented 3 years ago

@pgrete what are your scripts going to do, and what kind of format are they going to be outputting as. Are you trying to generate files that can go directly in a repo that GINKO can parse (json format) or are you referring to scripts that are meant for running specific tests on certain architectures that you mentioned in our call?

JoshuaSBrown commented 3 years ago

Also if we are going to be storing the raw data then we probably aught to agree on a format. As in how are we storing the data. In a single file. How often should we be adding data. Only if the performance metrics change by certain amount or every time a merge is made to develop.

JoshuaSBrown commented 3 years ago

Possible information that we would want to add:

Meta Data

Date performance metrics were run
Commit
PR number
Architecture
All dependencies used with versions
Build configuration options
Test Specific Meta Data
Test name
Number of GPUs used
Number of MPI procs
Performance Metrics
Timings and performance metrics associated with test (memory, cycles etc as is appropriate)

JoshuaSBrown commented 3 years ago

Repository for hosting performance metrics

Yurlungur commented 3 years ago

@JoshuaSBrown is it possible to put the performance metrics into the wiki instead of a separate repo?

JoshuaSBrown commented 3 years ago

@JoshuaSBrown is it possible to put the performance metrics into the wiki instead of a separate repo?

Yes, that's right we did decide to do that.

pgrete commented 3 years ago

@pgrete what are your scripts going to do, and what kind of format are they going to be outputting as. Are you trying to generate files that can go directly in a repo that GINKO can parse (json format) or are you referring to scripts that are meant for running specific tests on certain architectures that you mentioned in our call?

I put a first sample script here #388 I think the next steps would be to

[ ] Change the profiling tools to produce json output
[ ] Then put all json output from a job in a single json file and augment with additional information (along what @JoshuaSBrown was suggesting above)
- [ ] Ideally, we would save all parameters that ended up being used in that file, i.e., the one in the parthinput file plus the ones overridden from the command line
[ ] Evaluate option on how we can process those outputs (e.g., with the GPE or other solutions and/or directly produce plots)
[ ] add more tests (along what I described above)
[ ] Expand the test to always test again a "reference" commit (which for continuous benchmarking should be develop, but may be another custom commit during development of a new feature). Having reference and new data collected in the job (i.e., on the same machine and environment) ensures that environmental effects (throttling due to a hot rack or faulty power supplies) are less pronounced.

Another reason (I didn't mention in the call yesterday) why I'd like to separate data collection and processing (visualization) is to have the opportunity to use the pipeline "offline", e.g., run the "test" script to get a json object with the raw data, then (optionally) transfer that json object to workstation, and then do additional (potentially custom) postprocessing there

JoshuaSBrown commented 3 years ago

https://github.com/lanl/parthenon/pull/395 authenticated github application for pushing files etc to repository.

jrmadsen commented 3 years ago

So I ran across this while looking at the kokkos-tools issue, not sure if y'all already decided on a path forward but from scanning the comments here, the toolkit API of timemory is basically designed for this exact sort of thing (as opposed to the pre-built kokkos-tools impl mentioned previous). The library has a kokkos-tools implementation but that is essentially the "kokkos-kernels" part of timemory. There is a whole "kokkos-core" toolkit API which will provide very easy solutions for these things:

Performance Metrics

Timings and performance metrics associated with test (memory, cycles etc as is appropriate)

There are a bunch of existing components for all these metrics and it's really easy to compose components into new components. You can use the built-in call-graph storage or manually construct the call-stack by holding onto the bundles, e.g. lightweight_tuple<wall_clock, papi_vector> essentially just holds values data for single wall-clock measurement and papi-counters and unless you call push() and pop() on that bundle object before/after start() and stop(), the data local to those component instances is not stored in the persistent call-graph data.

Since y'all work with Kokkos, y'all can probably easily digest this example (cmake -B build-timemory -DTIMEMORY_BUILD_EXAMPLES=ON /path/to/source && cmake --build build-timemory --target ex_array_of_bundles --parallel $(nproc) && ./build-timemory/ex_array_of_bundles).

In other words, y'all can create a unique bundle of components recording the exact set of metrics y'all want to collect and have full control over how the metrics get accumulated. And synchronizing the data across multiple MPI ranks is straight-forward even with manual accumulation: operation::finalize::mpi_get<ComponentT>{ std::vector<T>, T }.

In terms of .json format I think this is a valid option, I think there are also a lot of tools that could benefit from having the information available in simple txt files with a time stamp on each line

There isn't a command-line tool per-se but it would probably only require maybe 100-200 lines of code in a main. There is a serialization library built-in so you would just read back in the JSON and then write to the stream (or write the text file manually).

- Roofline models (nice to have as a mid to long term milestone but probably not top priority right now)

I work on the "roofline team" as NERSC so there are cpu_roofline and gpu_roofline components with a built-in ERT for calculating the empirical peak automatically as part of the roofline component "finalization".

- Weak scaling (both AMR and static)

Timemory has panda dataframe support via Hatchet for every component (even custom ones) and scaling studies are straight-forward.

Meta Data

...
Test Specific Meta Data

...

There is a tim::manager::add_metadata(string, Tp) which uses the serialization library to support adding basically anything into the <output-folder>/metadata.json file, which automatically includes a lot of those values, the entire set of environment variables, all the timemory settings, etc.

Caliper and SPOT (should soon be available and we're in contact with the devs)

I'm setting up a SPOT instance at NERSC for Kokkos and in the next couple months, I'll get timemory support into SPOT. Just waiting on a decision from SPOT team about whether I should write a SPOT-specific reader for timemory or whether we can re-use their internal hatchet support and just read in the necessary metadata from my metadata.json file (which is my preference).

It looks like the documentation for timemory is better than Caliper in regards to kokkos support. I'm in favor with starting with this framework.

FYI, this has been updated very recently and I fixed a pseudo-issue with kokkos-views causing unnecessary call-stack hierarchies. But, again, that's more front-end API stuff and it sounds like y'all would benefit much more from using the toolkit API.

jrmadsen commented 3 years ago

If y'all do indeed want to use the toolkit API, let me know if y'all want to set up a mini-hackathon or something. Unfortunately, I rolled out this thing out right around Feb 2020... covid basically cancelled any potential post-ISC tutorials so I haven't generated good toolkit guides.

Yurlungur commented 1 year ago

Mostly resolved. CC @jdolence Trenzo Survey

parthenon-hpc-lab / parthenon