Open jjellio opened 4 years ago
This is going to be quite nice.
Nice!
PR #7377 has merged, so the tools used to collect the stats are in the repo.
The next step is sorting out how to get CMake to use the statistic gathering compiler wrappers.
If anyone has comments, this is what I've outlined to go over w/Ross.
Wrapper creation.
Wrapper preservation
Iron out how downstream customer can interact with this. (or defer this, and focus on showing capability w/Trilinos)
Data aggregation
Sort out what to do with aggregated data
@jjellio, I posted WIP PR #7508 that gets the basic build wrappers in place. See the task list in that PR for next steps. I think with a few hours of work, we will have a very nice Minimal Viable Product that we can deploy and start using in a bunch of builds that post to CDash.
@jjellio, I talked with Zack Galbreath at Kitware today and he mentioned that there is another option for uploading files from and downloading files from CDash. That is the ctest_upload()
command. That command also allows you to define URLs that will be listed for the build. With that, I think would could provide the data and the hooks that are needed for your tool with the prototype at:
For example, you could define a URL like:
(the ???
part yet to be filled in).
Zack is going to add a strong automated test to CDash to make sure that you can download those files, one at a time.
To support this, I could add a hook to tribits_ctest_driver()
to call ctest_upload()
with a custom list of files (and URLs).
We will need to do some testing of this to work this out, but if this works, then any build that is run with the tribits_ctest_driver()
function would automatically support uploading the build stats file and the associated URL links to look at it in more detail. So we could easily support this for all of the ATDM Trilinos builds and all other builds that use the tribits_ctest_driver()
function. But this would not work for Trilinos PR builds since those don't use the tribits_ctest_driver()
function so we could not get them to build build stats. (But they could implement a call to ctest_upload()
.)
But even if we use ctest_upload()
to upload the build_stats.csv
file I think we still want a runtime test (TrilinosBuildStats_Results
) that will summarize the most important stats in text form (like shown in https://github.com/trilinos/Trilinos/pull/7508#issuecomment-642052702) so we can search them with the "Test Output" filter field to the cdash/queryTests.php page and so we can put strong checks for these max values so we can fail the test if these numbers get too high. And with this latter part, you could even fail PR builds of the numbers get too high.
Anyway, I have some more work to do before PR #7508 is ready to merge so I will get to it.
I actually tried to pass the cdash file to my github.io page:
It fails due to security policies that block javascript from loading files from a domain other than the script... I'm not browser savvy enough to know what to do about it.... it would be nice if I could work around that. But I guess if all else failed, the webpage could get hosted inside SNL (maybe that would avoid the security issue).
Even if I work around that security stuff, I'll still need to figure out how to decode a tarball (that should be doable, I see javascript libraries for it)
@jjellio, worst-case scenario, developers could just download the 'build_stats.csv' file off of CDash and then upload it to your site when they are doing deeper analysis. Otherwise, we can ask Kitware for help with the web issues.
But developers are not going to bother looking at any data unless they think there is a problem. That is what we can address with filling out the test TrilinosBuildStats_Results
to run a tool that summarizes the critical build stats. I suggested that in https://github.com/trilinos/Trilinos/pull/7508#issuecomment-642052702. What I propose is to write a Python tool called summarize_build_stats.py
that will read in the 'build_stats.csv' file and then produce, to STDOUT, a report like:
Full Project: max max_resident_size = <max_resident_size> (<file-name>)
Full Project: max elapsed_time: = <elapsed_time> (<file-name>)
Full Project: max file_size= <file_size> (<file-name.)
Kokkos: max max_resident_size = <max_resident_size> (<file-name>)
Kokkos: max elapsed_time: = <elapsed_time> (<file-name>)
Kokkos: max file_size= <file_size> (<file-name.)
Teuchos: max max_resident_size = <max_resident_size> (<file-name>)
Teuchos: max elapsed_time: = <elapsed_time> (<file-name>)
Teuchos: max file_size= <file_size> (<file-name.)
...
Panzer: max max_resident_size = <max_resident_size> (<file-name>)
Panzer: max elapsed_time: = <elapsed_time> (<file-name>)
Panzer: max file_size= <file_size> (<file-name.)
...
Such a tool needs to know how to map file names to TriBITS packages. There is already code in TriBITS that can do that.
Are you okay with me taking a crack at writing an initial version of summarize_build_stats.py
? It would be better to write that as a TriBITS utility because then I could use MockTrilinos to write strong unit tests for it.
What do you think?
@jjellio
So it turns out that CDash does not currently support downloading files from CDash uploaded using the ctset_upload()
command like you see here:
with the files (and URL) viewed at:
However, it does look like CDash supports downloading files uploaded to a test using the ATTACH_FILES
ctest property. For example, for the trial build and submit shown at:
if you get the JSON from:
you see (pretty printed):
{
...
test: {
id: 8313,
buildid: 5522160,
build: "Linux-gnu-openmp-shared-dbg-pt",
buildstarttime: "2020-06-09 15:45:48",
site: "crf450.srn.sandia.gov",
siteid: "187",
test: "TrilinosBuildStats_Results",
time: " 50ms",
...
measurements: [
{
name: "Pass Reason",
type: "text/string",
value: "Required regular expression found.Regex=[OVERALL FINAL RESULT: TEST PASSED .TrilinosBuildStats_Results.<br />\n]"
},
{
name: "Processors",
type: "numeric/double",
value: "1"
},
{
name: "build_stats.csv",
type: "file",
fileid: 1,
value: ""
}
]
},
generationtime: 0.04
}
So it looks like you can get that data in Python (converted to recursive list/dict datastructure) and you can loop over the dicts in data['test']['measurements']
and find the file as:
{
name: "build_stats.csv",
type: "file",
fileid: 1,
value: ""
}
That dict is data['test']['measurements'][2]
in this case.
Given that 'fileid' field value of '1', you can then download the data using the URL:
You can fund your way to this test, for example, by knowing the CDash Group, Site, Build Name, and Build Start Time and plug those into this query:
The JSON for that is shown at:
which has the element:
{
...
builds: [
{
testname: "TrilinosBuildStats_Results",
site: "crf450.srn.sandia.gov",
buildName: "Linux-gnu-openmp-shared-dbg-pt",
buildstarttime: "2020-06-09T15:47:22 MDT",
time: 0.05,
prettyTime: " 50ms",
details: "Completed\n",
siteLink: "viewSite.php?siteid=187",
buildSummaryLink: "build/5522163",
testDetailsLink: "test/18733864",
status: "Passed",
statusclass: "normal",
nprocs: 1,
procTime: 0.05,
prettyProcTime: " 50ms"
}
],
...
}
which has testDetailsLink: "test/18733864"
.
So there you have it. If you know the following fields:
you can find the test TrilinosBuildStats_Results
for that build and download its attached 'build_stats.csv' file (as a tared and zipped file'build_stats.csv.tgz').
So we can do what we need by attaching the file to a test. But it is a bit of a run-around to find what we need.
It would be more straightforward to to upload with ctest_upload()
and then directly download the file from CDash. But, again, CDash does not currently support that and the Trilinos PR ctest -S driver does not support that.
So for now, I would suggest that we just go with uploading the 'build_stats.csv' file to the test TrilinosBuildStats_Results
and then downloading it from there for any automated tools.
FYI: Further discussion about CDash upload and download options should occur in newly created issue:
so we can get some direct help/advice from Kitware.
Ross, I think the summarize would be better (for maintenance/extensibility) implemented as script CMake calls. That (optionally promising to generate a file if needed)
If there was a dummy script, commonTools/build_stats/summarize_build_stats.py
(or wherever), then you could build the CMake stuff now, but later changes would just need to fiddle with that file.
I can conceive how to implement summarize_build_stats.py
as just plain bash. Since the file is CSV, you'd head -n1
to get the header, store that as an array variable. Then search the header for the indexes of the metrics you want. Next, you'd have to have grep the file for FileName = packages/foo/
. From that subset of the file, cut -fN
, where N is the number from the header array. Pipe that through awk or bc to sum it up. Additionally, you could sort the subset matching the package, and select the top file for each metric. This would could be a fairly simple /bin/bash
script. Python + CSV makes sense if you want complex analysis, but just package summaries perhaps it would be easier via BASH.
I do think there would be value in showing package-level aggregates:
Panzer: max max_resident_size = <max_resident_size> (<file-name>)
Panzer: max elapsed_time: = <elapsed_time> (<file-name>)
Panzer: max file_size= <file_size> (<file-name.)
Panzer: Total time:
Panzer Total Memory:
Panzer Total Size:
All Files: Total Time (this is effectively total build time)
All Memory (to be consistent)
All Size: Roughly How much storage this build required
Size in particular is helpful, as it indicates how much filesystem servers need.
All of the above can be implemented via BASH I think, just a few loops + cut/grep/awk (which are coreutils so will always be present on machines)
Ross, I think the summarize would be better (for maintenance/extensibility) implemented as script CMake calls. That (optionally promising to generate a file if needed)
@jjellio, yes, that is exactly what I was suggesting.
I can conceive how to implement summarize_build_stats.py as just plain bash
Such a tool would be very hard to write, test, and maintain in bash. Do you have something against Python?
Just to get this started, I will add simple Python script:
Trilinos/commonTools/build_stats/summarize_build_stats.py
that will just provide project-level stats:
Full Project: sum(max_resident_size_size_mb) = <sum_max_resident_size_mb> (<num-entries> entries)
Full Project: max(max_resident_size_size_mb) = <max_max_resident_size_mb> (<file-name>)
Full Project: max(elapsed_real_time_sec) = <max_elapsed_time_sec> (<file-name>)
Full Project: sum(elapsed_real_time_sec) = <sum_elapsed_time_sec> (<num-entries> entries)
Full Project: sum(file_size_mb) = <sum_file_size_mb> (<num-entries> entries)
Full Project: max(file_size_mb) = <max_file_size_mb> (<file-name>)
That will avoid needing to deal with the package logic for now. We can always add package-level stats later when we have the time (and that will require using some TriBITS utilities to convert from file paths to package names). That way, we can turn this on for PR testing now and merge PR #7508.
Okay?
I have no issues with Python other than you have to be aware of 2.x vs 3.x stuff.
I love python's regex library. Python 3.x with 'format strings' is awesome. e.g.,f'A variable in scipe: {some_var}'
Tangents below:
Another issue to consider is how to interact with developers. I'll need to improve the webpage (better explanations, and styling for sure).
Yet another issue: can you use the info here, to feedback into Ninja or CMake to improve our build system. This could be an interesting question for Kitware. E.g., if we could provide a list of targets (file.o things) plus a weight. Could Kitware use that orchestrate a good Ninja file. Or perhaps we could do that ourselves (I already have dome something similar). Given weights + num_parallel_procs, coerce the existing build.ninja such that a certain memory highwater mark is avoided. (it's effectively a variant of the knapsack packing problem I believe)
@rmmilewi (CC Reed, this may be something he'd like to be abreast of)
I love python's regex library. Python 3.x with 'format strings' is awesome. e.g.,f'A variable in scipe: {some_var}'
But we can't use that if we keep Python 2.7 support (which I think we need to as long as that is the default Python with RHEL7). But comparing to support for programming with data-structures, there is no contest between Python and bash; Python is the clear winner.
FWIW, for me, bash is better if the code you are writing is going to be loading modules and mostly just running commands with very little logic otherwise. (Python can't really improve no that and dealing with the env with Python is very messy and hacky.) But if have any non-trivial logic or need any sophisticated data-structure manipulation, you are crazy to write that code in bash when Python is an option. (I have come to regret writing some code in bash recently that should have been written in python.)
CC: @jjellio
FYI: Already getting some interesting stats after getting this running in the ATDM Trilinos builds as shown here:
The max_resident_size_mb
varies from as small as 1776.56
(i.e. 1.8 GB) for the Trilinos-atdm-van1-tx2_arm-20.0_openmpi-4.0.2_openmp_static_dbg
build shown here showing
Full Project: sum(max_resident_size_mb) = 2290064.25 (10064 entries)
Full Project: max(max_resident_size_mb) = 6047.13 (packages/panzer/adapters-stk/example/main_driver/PanzerAdaptersSTK_main_driver.exe)
Full Project: sum(elapsed_real_time_sec) = 66346.24 (10064 entries)
Full Project: max(elapsed_real_time_sec) = 100.48 (packages/panzer/adapters-stk/example/main_driver/PanzerAdaptersSTK_main_driver.exe)
Full Project: sum(file_size_mb) = 135039.36 (10064 entries)
Full Project: max(file_size_mb) = 781.02 (packages/panzer/adapters-stk/example/main_driver/PanzerAdaptersSTK_main_driver.exe)
to as little as to 6047.13
(i.e. 6.0 GB) for the Trilinos-atdm-van1-tx2_arm-20.0_openmpi-4.0.2_openmp_static_opt
build shown here showing:
Full Project: sum(max_resident_size_mb) = 1494626.1 (10059 entries)
Full Project: max(max_resident_size_mb) = 1776.56 (packages/panzer/mini-em/example/BlockPrec/CMakeFiles/PanzerMiniEM_BlockPrec.dir/main.cpp.o)
Full Project: sum(elapsed_real_time_sec) = 100029.03 (10059 entries)
Full Project: max(elapsed_real_time_sec) = 378.77 (packages/zoltan2/test/driver/CMakeFiles/Zoltan2_test_driver.dir/test_driver.cpp.o)
Full Project: sum(file_size_mb) = 24429.11 (10059 entries)
Full Project: max(file_size_mb) = 119.06 (packages/panzer/adapters-stk/example/main_driver/PanzerAdaptersSTK_main_driver.exe)
The large memory usage for debug builds on 'van1-tx2' (ASTRA platform 'stria') explains why I had to turn down the parallel build level (in PR #7298).
And note that the above does not yet include the stats for building static libraries using ar
.
From: Bartlett, Roscoe A Sent: Wednesday, September 9, 2020 2:50 PM To: Elliott, James John jjellio@sandia.gov Subject: RE: Trilinos Build Stats wrappers?
Hello James,
I would like to merge what is in:
https://github.com/trilinos/Trilinos/pull/7508
so we don’t loose it and it does not get conflicts with other code.
Any objections to that?
Then we can address issues with AR, Intel Fotran, and other topics in future PRs as listed in:
https://github.com/trilinos/Trilinos/pull/7508
-Ross
@jjellio, with the merge of #7508, we should meet to discuss what has been done up to this point and what needs to be done as listed above in the Tasks section in order for this to be able to handle all of our builds and just be more robust.
FYI: Just met with @jjellio on this. He is going to create a new branch off of 'develop' and re-type his changes to support Intel Fortran and AR there and create a new PR for that. I will then review it and help test it out with the PR builds and the ATDM Trilinos builds so we can turn on these build stats wrappers in every PR and ATDM Trilinos build. If it works in all of those builds, it should work about anywhere.
Some good ideas that came out of my meeting with @jjellio:
*.a
or *.so.*
files), executables (i.e. *.exe
) and other files (mostly *.o
object files). That way, we can see where the big *.o
files are coming from that are creating bug *.a
files and therefore big *.exe
files.summarize-build-stats
(not attached to ALL
) that will call summarize-build-stats.py
with the right arguments and will write a file summarized_build_stats.txt
(or even a JSON version of this?). That would make it super easy for developers to get summaries of build stats while doing local development.*.timing
files into the build_stats.csv
file to instead merge by the column names so it could be robust to when the set (and location) of fields changes in the *.timing
files for older versions of the magic_wrapper.py
tool.Trilinos_<LANG>_BUILD_STATS_COMPILER_WRAPPER
to TrilinosConfig.cmake
so that downstream customers could set those to CMAKE_<LANG>_COMPLER
instead of the original unwrapped compilers listed in Trilinos_<LANG>_COMPILER
.TrilinosBuildStats_Results
output and sort them, plot them, write them into a *.json
file etc. (This is a cheaper way to get build summaries of build stands than having to download the entire build_stats.csv
files for each of these builds).build_stats.csv
file and display the results.Hello @jjellio, any progress on:
FYI: Just met with @jjellio on this. He is going to create a new branch off of 'develop' and re-type his changes to support Intel Fortran and AR there and create a new PR for that.
from our last meeting with notes above? Hopefully that is not too much work. Otherwise, can you point me to your other branch and then I can see if I can copy over the changes to get this working? I would like to wrap up this initial story to get build stats wrappers enabled in every ATDM and PR Trilinos build. After that, we can create a new GitHub (epic) issue for follow-on task that are listed above.
@jjellio and I just met with Elliot Ridgeway about how this build stats work might mesh with Watcher. We will meet again in about 6 weeks. Two immediate things come out that we should fix ASAP in:
*.timing
files and when gathering up the build_stats.csv
file. This is needed to make the site https://jjellio.github.io/build_stats/ work better.@jjellio, I updated and reorganized the list of Tasks [above]() to better reflect the current state of this Story based on my understanding of things. A couple of things that I changed:
magic_wrapper.py
because no such tests exist in PR #8638 (at least not given the classic definition of "unit tests").summarize-build-stats
since that target does not exist (not even in PR #8638)../
from the beginning of the FileName
field for the *.timing
files since that is now handled by the gather_build_stats.py
tool.Hopefully you agree with these updates.
Let's get PR jjellio/Trilinos#1 merged then PR #8638 tested and merged (and hopefully turned on in all ATDM Trilinos and Trilinos PR builds) and then we can discuss next steps.
@jjellio, following up from my comment https://github.com/trilinos/Trilinos/pull/8638#issuecomment-825221154 where I said:
If these build stats wrappers were used in TriBITS ...
actually, these build stats wrappers and the supporting CMake and Python code are more genetic than Trilinos or TriBITS. These really should really live in their own GitHub repo and then this repo should get snapshotted into TriBITS/tribits/build_stats
and then used and tested in TriBITS and therefore be able to be used and tested in Trilinos. That might take some work but that seems like the right thing to do and then we could provide very strong automated tests for this functionality in TriBITS (which provide more confidence in these tools and make them easier to develop further and maintain).
Just something to think about.
CC: @jjellio
A glorious day. The PR #8638 has finally been merged! This turns on the build stats wrappers in all of the ATDM Trilinos builds (when running test ctest -S driver) and in all of the Trilinos PR builds.
We can see 141 submissions of the test TrilinosBuildStats_Results
in the ATDM Trilinos builds showing the new gather script gather_build_stats.py
in this query.
And we are starting to see new PRs running this looking at this query.
We need to keep an eye on the PR builds for a few days.
It would be nice to break the build-stats summary reported in the TrilinosBuildStats_Results
test into libraries, executables and object files separately before we close this. But that could really be a separate story.
CC: @prwolfe, @jwillenbring
@jjellio, it occurred to me that adding begin and end time stamp fields to the *.timing
file generated by magic_wraper.py
could help to debug out-of-memory problems like reported in #9432. If you know the start and end time for when each target is getting built and you know the max RAM usage for each target, you can compute, at any moment in time, the max possible RAM getting used on the machine and you will know which targets are involved. That will tell you where you need to put in effort to reduce the RAM usage to build specific targets and get around a build bottle neck that consumes all the RAM.
Having the build start and end time stamps also has other uses as well. For example, when doing a rebuild with old targets lying around, if you only want to report build stats for targets that got built on a rebuild, you could add an argument to summarize_build_stats.py --after=<start-time>
that filtered build starts built only after the start of the last rebuild <start-time>
(which you know at configure time and you can put into the definition of the test) . This would also automatically filter out build stats for targets that no longer exist in the build system from rebuilds months (or years) old. This may also be used for other purposes that I don't even realize yet but these are the obvious ones.
CC: @jjellio, @jwillenbring
So I have a Trilinos PR #9894 that is stuck in loop of failed builds due to the compiler crashing running out of memory. Following on from the discussion above, it occurred to me that if you store the beginning and end time stamps for each target in the *.timing
file, then the summarize_build_stats.py
tool can sort the build stats by start time and end time and compute the high watermark on the machine due to building at any time. For example, if 10 object files are currently being built then you just add up max_resident_size_mb
for each of these targets and that gives you the max high water mark at that time for that build as the build stat:
Full Project: max_sum_over_active_targets(max_resident_size_mb)
This would show how close a build is to running out of memory on a given machine and we could plot that number as a function of time. In fact, we could have the CTest test that runs summarize_build_stats.py
create CTest test measurements for:
Full Project: max_sum_over_active_targets(max_resident_size_mb)
Full Project: sum(max_resident_size_mb)
Full Project: max(max_resident_size_mb)
Full Project: sum(elapsed_real_time_sec)
Full Project: max(elapsed_real_time_sec)
Full Project: sum(file_size_mb)
Full Project: max(file_size_mb)
using XML in the STDOUT like:
<DartMeasurement type="numeric/double" name="Full Project: sum(max_resident_size_mb)">4667989.73</DartMeasurement>
Then you could see a graph of these measurements over time right on CDash!
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE
label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE
.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.
FYI: Kitware is adding build stats support to native CMake, CTest, and CDash. See:
Therefore, I think there will be no need for a separate compiler wrapper tool to gather build stats or scripts to manage that and submit it to CDash.
Enhancement
This Issue is for tracking any effects from a PR I am submitting with tools for collecting very detailed build statistics, that seems to have near zero cost. A goal of this work is to enable packages/product owners to understand how their packages impact compile time, memory usage, and file size (among others).
The impact of using the tool seems to be zero. That is, the tool's overhead sits entirely inside the noise of the build. I did two builds on Rzansel, both use Cuda + Serial, which is the standard ATDM Trilinos settings (I actually built EMPIRE using the tool as well)
Updated Data (for the more complicated python wrappers + NM usage)
NM ON
NM OFF
Clearly using python has a price... but this is still on pretty tiny. A pass through the code for efficiency is planned (maybe move to in-memory files for the temporaries)
Path forward
The scripts work by wrapper '$MPICC' inside the ATDM trilinos ENV. CMake then uses these 'wrapped' compilers. The wrapped compilers emit copious data in the build tree along side the object file/library/executable that is created. After the build is complete, these 'timing' files are then aggregated into one massive CSV file. On Rzansel, the CSV is about 1.8MB, and it has lines equal to the number of things built.
To prevent the wrappers from tampering with CMake's configure phase, I've added a single line to CMakeLists.txt which sets an ENV
CMAKE_IS_IN_CONFIGURE_MODE
, this allows the wrappers to toggle on/off based on whether a real build is happening, versus the configuration phase.One idea for making this work, is to have CTest post the resulting build statistics files directory to CDash along with any testing data. For customers not posting, I can help provide a script that will aggregate the data manually.
Once the CSV data is posted, then others can develop tools for tracking this over time. I also have some tools that operate directly on the files (using Javascript).
The tool tracks:
This issue is also tracked in CDOFA-119.
@bartlettroscoe @jwillenbring
Links
Related to
Tasks:
Trilinos_ENABLE_BUILD_STATS=ON
in all of the Trilinos PR builds?magic_wrapper.py
script for the broken intel builds (see https://github.com/trilinos/Trilinos/pull/7508#issuecomment-646680938 and See merged PR #8638).magic_wrapper.py
script for wrapping AR so that we can get build statistics for static libraries as well (and put in hooks in Trilinos CMake build system to use this for static builds, see See merged PR #8638).find_program
commands for tools being wrapped if they are undefined (applies to AR/Ranlib/LD) ... See See merged PR #8638*.timing
files into thebuild_stats.csv
file to instead merge by the column names so it could be robust to when the set (and location) of fields changes in the*.timing
files for older versions of themagic_wrapper.py
tool. See merged PR #8638export Trilinos_ENABLE_BUILD_STATS=ON
for all of the remaining ATDM Trilinos builds (ctest -S driver only by default). See merged PR #8638summarize_build_stats.py
into libraries (i.e.*.a
or*.so.*
files), executables (i.e.*.exe
) and other files (mostly*.o
object files). That way, we can see where the big*.o
files are coming from that are creating big*.a
and.so
files and therefore big*.exe
files (and where the greatest expense lies). Also, this will allow you to see the expense of building just libraries compared to building tests and examples to some extent.