trilinos / Trilinos

Primary repository for the Trilinos Project

https://trilinos.org/

Other

1.22k stars 570 forks source link

Trilinos: Enable build statistics #7376

Open jjellio opened 4 years ago

jjellio commented 4 years ago

Enhancement

This Issue is for tracking any effects from a PR I am submitting with tools for collecting very detailed build statistics, that seems to have near zero cost. A goal of this work is to enable packages/product owners to understand how their packages impact compile time, memory usage, and file size (among others).

The impact of using the tool seems to be zero. That is, the tool's overhead sits entirely inside the noise of the build. I did two builds on Rzansel, both use Cuda + Serial, which is the standard ATDM Trilinos settings (I actually built EMPIRE using the tool as well)

Without build_stats: 53:34.98 (3214.98s)
With build_stats: 53:22.07 (3202.07s) -  0.5%

Updated Data (for the more complicated python wrappers + NM usage)

NM ON

---	configure	build (elapsed)	build (user time)
ON	45.08	1216.21 (2.5%)	58257.09 (2%)
OFF	43.51	1186	57089.23

NM OFF

---	configure	build (elapsed)	build (user time)
ON	45.82	1203.11 (1.2%)	57687.75 (1.1%)
OFF	44.55	1189.23	57062.09

Clearly using python has a price... but this is still on pretty tiny. A pass through the code for efficiency is planned (maybe move to in-memory files for the temporaries)

Path forward

The scripts work by wrapper '$MPICC' inside the ATDM trilinos ENV. CMake then uses these 'wrapped' compilers. The wrapped compilers emit copious data in the build tree along side the object file/library/executable that is created. After the build is complete, these 'timing' files are then aggregated into one massive CSV file. On Rzansel, the CSV is about 1.8MB, and it has lines equal to the number of things built.

To prevent the wrappers from tampering with CMake's configure phase, I've added a single line to CMakeLists.txt which sets an ENV CMAKE_IS_IN_CONFIGURE_MODE, this allows the wrappers to toggle on/off based on whether a real build is happening, versus the configuration phase.

One idea for making this work, is to have CTest post the resulting build statistics files directory to CDash along with any testing data. For customers not posting, I can help provide a script that will aggregate the data manually.

Once the CSV data is posted, then others can develop tools for tracking this over time. I also have some tools that operate directly on the files (using Javascript).

The tool tracks:

FileSize

# data from /usr/bin/time regarding memory use and build time
# this is the memory highwater mark
max_resident_size_Kb
# this is the actual time the file took to compile
elapsed_real_time_sec

# more data from /usr/bin/time
avg_total_memory_used_Kb
num_major_page_faults
num_filesystem_inputs
exit_status
perc_cpu_used
avg_size_unshared_data_area_Kb
num_waits
avg_size_unshared_text_area_Kb
cpu_sec_user_mode
num_swapped
num_signals
num_involuntary_context_switch
num_minor_page_faults
num_socket_msg_sent
cpu_sec_kernel_mode
num_socket_msg_recv
num_filesystem_outputs

# data from nm -aS about symbols
symbol_stack_unwind
symbol_ro_data_local
symbol_unique_global
symbol_ro_data_global
symbol_text_global
symbol_text_local
symbol_debug

This issue is also tracked in CDOFA-119.

@bartlettroscoe @jwillenbring

Related to

SEPW-203

Tasks:

[x] Add initial implementation of build stats wrappers to Trilinos 'develop' and test in PR and ATDM nightly builds ... See merged PR #7508
[x] Add a link to @jjellio's site https://jjellio.github.io/build_stats/ into the Trilinos documentation in https://docs.trilinos.org/files/TrilinosBuildReference.html#enabling-and-viewing-build-statistics and how to download and then upload build stats from CDash.
[x] Turn on Trilinos_ENABLE_BUILD_STATS=ON in all of the Trilinos PR builds?
[x] Fix the magic_wrapper.py script for the broken intel builds (see https://github.com/trilinos/Trilinos/pull/7508#issuecomment-646680938 and See merged PR #8638).
[x] Fix the magic_wrapper.py script for wrapping AR so that we can get build statistics for static libraries as well (and put in hooks in Trilinos CMake build system to use this for static builds, see See merged PR #8638).
[x] Add find_program commands for tools being wrapped if they are undefined (applies to AR/Ranlib/LD) ... See See merged PR #8638
[x] Update the script that gathers *.timing files into the build_stats.csv file to instead merge by the column names so it could be robust to when the set (and location) of fields changes in the *.timing files for older versions of the magic_wrapper.py tool. See merged PR #8638
[x] Test on Stria something broke there in the past (#8850) See merged PR #8638
[x] Add export Trilinos_ENABLE_BUILD_STATS=ON for all of the remaining ATDM Trilinos builds (ctest -S driver only by default). See merged PR #8638
[ ] Break out summary stats reported by summarize_build_stats.py into libraries (i.e. *.a or *.so.* files), executables (i.e. *.exe) and other files (mostly *.o object files). That way, we can see where the big *.o files are coming from that are creating big *.a and .so files and therefore big *.exe files (and where the greatest expense lies). Also, this will allow you to see the expense of building just libraries compared to building tests and examples to some extent.
[ ] Add build start and end timestamp fields for each target (see below). (Use to determine when targets are building together and may be uses up all RAM on system. Use to filter build stats based on only targets rebuilt in last build, etc.)
[ ] Review documentation after command line options removed (and moved to ENVs)
[ ] Verify install behavior after disabling installation of wrappers
[ ] Gather new info of build with and without stats to verify low overhead

bartlettroscoe commented 4 years ago

This is going to be quite nice.

csiefer2 commented 4 years ago

Nice!

jjellio commented 4 years ago

PR #7377 has merged, so the tools used to collect the stats are in the repo.

The next step is sorting out how to get CMake to use the statistic gathering compiler wrappers.

If anyone has comments, this is what I've outlined to go over w/Ross.

Wrapper creation.
- We discussed using an ENV variable to enable/disable the feature
- If done via CMake, then whatever changes needs to be propagatable to downstream clients (Sierra, EMPIRE, SPARC, …)
Wrapper preservation
- The wrappers make sense at build time, but Trilinos is a library consumed by customers - we want to enable this capability for those customers potentially
- If the wrappers get installed, then installed CMake packages needs to have their associated variables reset to match the installed wrappers. That is, we told cmake CXX is ./build_dir/build_stat_wrapeprs/wrapper_cxx, but now we need the installed Cmake files to have CXX as $install_dir/bin/build_stat_wrappers/wrapper_cxx (because the the installed Trilinos should never depend on the build dir)
Iron out how downstream customer can interact with this. (or defer this, and focus on showing capability w/Trilinos)
Data aggregation
- the wrappers leave *.timing files for everything a compiler creates (libraries, object code, executables)
- After building (make all finishes), we need to aggregate all *.timing files into a single file. (The data are just CSV files)
- The aggregation can be done outside Cmake (what I’ve done), by just adding 2 lines of bash. a. A more elegant solution would be to use a rule or some sort that fires after make all by default (or explicitly, make gather_build_stats)
Sort out what to do with aggregated data
- I think a good idea is to simply install this CSV file along with Trilinos a. If installing, this would fit naturally with the ‘rule’ to aggregate the data. That rule provides the stats.csv. (this rule just works w/install) b. Make sure this doesn’t break things if stats aren’t enabled!
- Another excellent idea is to have it posted to CDash. Ideally as a CSV file (not lumped inside Stdout, but as some file that is web-accessible, e.g., http://..../stats.csv). Optionally, we can also compress it: s992398:html jjellio$ du -hs trilinos.csv* 1.7M trilinos.csv 256K trilinos.csv.tar.bz2 336K trilinos.csv.tar.gz 256K trilinos.csv.tar.xz
- CDash/Posting is a dark-art to me, no idea how/what this entails

bartlettroscoe commented 4 years ago

@jjellio, I posted WIP PR #7508 that gets the basic build wrappers in place. See the task list in that PR for next steps. I think with a few hours of work, we will have a very nice Minimal Viable Product that we can deploy and start using in a bunch of builds that post to CDash.

bartlettroscoe commented 4 years ago

@jjellio, I talked with Zack Galbreath at Kitware today and he mentioned that there is another option for uploading files from and downloading files from CDash. That is the ctest_upload() command. That command also allows you to define URLs that will be listed for the build. With that, I think would could provide the data and the hooks that are needed for your tool with the prototype at:

https://jjellio.github.io/build_stats/index.html?csv_file=/build_stats/trilinos.csv

For example, you could define a URL like:

https://jjellio.github.io/build_stats/index.html?cdash_url=https://testing-dev.sandia.gov/cdash/index.php????

(the ??? part yet to be filled in).

Zack is going to add a strong automated test to CDash to make sure that you can download those files, one at a time.

To support this, I could add a hook to tribits_ctest_driver() to call ctest_upload() with a custom list of files (and URLs).

We will need to do some testing of this to work this out, but if this works, then any build that is run with the tribits_ctest_driver() function would automatically support uploading the build stats file and the associated URL links to look at it in more detail. So we could easily support this for all of the ATDM Trilinos builds and all other builds that use the tribits_ctest_driver() function. But this would not work for Trilinos PR builds since those don't use the tribits_ctest_driver() function so we could not get them to build build stats. (But they could implement a call to ctest_upload().)

But even if we use ctest_upload() to upload the build_stats.csv file I think we still want a runtime test (TrilinosBuildStats_Results) that will summarize the most important stats in text form (like shown in https://github.com/trilinos/Trilinos/pull/7508#issuecomment-642052702) so we can search them with the "Test Output" filter field to the cdash/queryTests.php page and so we can put strong checks for these max values so we can fail the test if these numbers get too high. And with this latter part, you could even fail PR builds of the numbers get too high.

Anyway, I have some more work to do before PR #7508 is ready to merge so I will get to it.

jjellio commented 4 years ago

I actually tried to pass the cdash file to my github.io page:

https://jjellio.github.io/build_stats/index.html?csv_file=https://testing-dev.sandia.gov/cdash/api/v1/testDetails.php?buildtestid=18733695&fileid=1

It fails due to security policies that block javascript from loading files from a domain other than the script... I'm not browser savvy enough to know what to do about it.... it would be nice if I could work around that. But I guess if all else failed, the webpage could get hosted inside SNL (maybe that would avoid the security issue).

Even if I work around that security stuff, I'll still need to figure out how to decode a tarball (that should be doable, I see javascript libraries for it)

bartlettroscoe commented 4 years ago

@jjellio, worst-case scenario, developers could just download the 'build_stats.csv' file off of CDash and then upload it to your site when they are doing deeper analysis. Otherwise, we can ask Kitware for help with the web issues.

But developers are not going to bother looking at any data unless they think there is a problem. That is what we can address with filling out the test TrilinosBuildStats_Results to run a tool that summarizes the critical build stats. I suggested that in https://github.com/trilinos/Trilinos/pull/7508#issuecomment-642052702. What I propose is to write a Python tool called summarize_build_stats.py that will read in the 'build_stats.csv' file and then produce, to STDOUT, a report like:

Full Project: max max_resident_size = <max_resident_size> (<file-name>)
Full Project: max elapsed_time: = <elapsed_time> (<file-name>)
Full Project: max file_size= <file_size> (<file-name.)

Kokkos: max max_resident_size = <max_resident_size> (<file-name>)
Kokkos: max elapsed_time: = <elapsed_time> (<file-name>)
Kokkos: max file_size= <file_size> (<file-name.)

Teuchos: max max_resident_size = <max_resident_size> (<file-name>)
Teuchos: max elapsed_time: = <elapsed_time> (<file-name>)
Teuchos: max file_size= <file_size> (<file-name.)

...

Panzer: max max_resident_size = <max_resident_size> (<file-name>)
Panzer: max elapsed_time: = <elapsed_time> (<file-name>)
Panzer: max file_size= <file_size> (<file-name.)

...

Such a tool needs to know how to map file names to TriBITS packages. There is already code in TriBITS that can do that.

Are you okay with me taking a crack at writing an initial version of summarize_build_stats.py? It would be better to write that as a TriBITS utility because then I could use MockTrilinos to write strong unit tests for it.

What do you think?

bartlettroscoe commented 4 years ago

@jjellio

So it turns out that CDash does not currently support downloading files from CDash uploaded using the ctset_upload() command like you see here:

https://open.cdash.org/index.php?project=CDash&date=2020-06-11&filtercount=1&showfilters=1&field1=buildname&compare1=61&value1=ctest_upload_example

with the files (and URL) viewed at:

https://open.cdash.org/viewFiles.php?buildid=6586787

However, it does look like CDash supports downloading files uploaded to a test using the ATTACH_FILES ctest property. For example, for the trial build and submit shown at:

https://testing-dev.sandia.gov/cdash/test/18733864

if you get the JSON from:

https://testing-dev.sandia.gov/cdash/api/v1/testDetails.php?buildtestid=18733864

you see (pretty printed):

{
   ...
   test: {
      id: 8313,
      buildid: 5522160,
      build: "Linux-gnu-openmp-shared-dbg-pt",
      buildstarttime: "2020-06-09 15:45:48",
      site: "crf450.srn.sandia.gov",
      siteid: "187",
      test: "TrilinosBuildStats_Results",
      time: " 50ms",
      ...
      measurements: [
         {
            name: "Pass Reason",
            type: "text/string",
            value: "Required regular expression found.Regex=[OVERALL FINAL RESULT: TEST PASSED .TrilinosBuildStats_Results.<br />\n]"
         },
         {
            name: "Processors",
            type: "numeric/double",
            value: "1"
         },
         {
            name: "build_stats.csv",
            type: "file",
            fileid: 1,
            value: ""
         }
      ]
   },
   generationtime: 0.04
}

So it looks like you can get that data in Python (converted to recursive list/dict datastructure) and you can loop over the dicts in data['test']['measurements'] and find the file as:

         {
            name: "build_stats.csv",
            type: "file",
            fileid: 1,
            value: ""
         }

That dict is data['test']['measurements'][2] in this case.

Given that 'fileid' field value of '1', you can then download the data using the URL:

https://testing-dev.sandia.gov/cdash/api/v1/testDetails.php?buildtestid=18733695&fileid=1

You can fund your way to this test, for example, by knowing the CDash Group, Site, Build Name, and Build Start Time and plug those into this query:

https://testing-dev.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercount=5&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=Experimental&field2=site&compare2=61&value2=crf450.srn.sandia.gov&field3=buildname&compare3=61&value3=Linux-gnu-openmp-shared-dbg-pt&field4=buildstarttime&compare4=81&value4=2020-06-09T15%3A47%3A22%20MDT&field5=testname&compare5=61&value5=TrilinosBuildStats_Results

The JSON for that is shown at:

https://testing-dev.sandia.gov/cdash/api/v1/queryTests.php?project=Trilinos&filtercount=5&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=Experimental&field2=site&compare2=61&value2=crf450.srn.sandia.gov&field3=buildname&compare3=61&value3=Linux-gnu-openmp-shared-dbg-pt&field4=buildstarttime&compare4=81&value4=2020-06-09T15%3A47%3A22%20MDT&field5=testname&compare5=61&value5=TrilinosBuildStats_Results

which has the element:

{
   ...
   builds: [
      {
         testname: "TrilinosBuildStats_Results",
         site: "crf450.srn.sandia.gov",
         buildName: "Linux-gnu-openmp-shared-dbg-pt",
         buildstarttime: "2020-06-09T15:47:22 MDT",
         time: 0.05,
         prettyTime: " 50ms",
         details: "Completed\n",
         siteLink: "viewSite.php?siteid=187",
         buildSummaryLink: "build/5522163",
         testDetailsLink: "test/18733864",
         status: "Passed",
         statusclass: "normal",
         nprocs: 1,
         procTime: 0.05,
         prettyProcTime: " 50ms"
      }
   ],
   ...
}

which has testDetailsLink: "test/18733864".

So there you have it. If you know the following fields:

Group
Site
Build Name
Build Start Time

you can find the test TrilinosBuildStats_Results for that build and download its attached 'build_stats.csv' file (as a tared and zipped file'build_stats.csv.tgz').

So we can do what we need by attaching the file to a test. But it is a bit of a run-around to find what we need.

It would be more straightforward to to upload with ctest_upload() and then directly download the file from CDash. But, again, CDash does not currently support that and the Trilinos PR ctest -S driver does not support that.

So for now, I would suggest that we just go with uploading the 'build_stats.csv' file to the test TrilinosBuildStats_Results and then downloading it from there for any automated tools.

bartlettroscoe commented 4 years ago

FYI: Further discussion about CDash upload and download options should occur in newly created issue:

https://gitlab.kitware.com/snl/project-1/-/issues/150

so we can get some direct help/advice from Kitware.

jjellio commented 4 years ago

Ross, I think the summarize would be better (for maintenance/extensibility) implemented as script CMake calls. That (optionally promising to generate a file if needed)

If there was a dummy script, commonTools/build_stats/summarize_build_stats.py (or wherever), then you could build the CMake stuff now, but later changes would just need to fiddle with that file.

I can conceive how to implement summarize_build_stats.py as just plain bash. Since the file is CSV, you'd head -n1 to get the header, store that as an array variable. Then search the header for the indexes of the metrics you want. Next, you'd have to have grep the file for FileName = packages/foo/. From that subset of the file, cut -fN, where N is the number from the header array. Pipe that through awk or bc to sum it up. Additionally, you could sort the subset matching the package, and select the top file for each metric. This would could be a fairly simple /bin/bash script. Python + CSV makes sense if you want complex analysis, but just package summaries perhaps it would be easier via BASH.

I do think there would be value in showing package-level aggregates:

Panzer: max max_resident_size = <max_resident_size> (<file-name>)
Panzer: max elapsed_time: = <elapsed_time> (<file-name>)
Panzer: max file_size= <file_size> (<file-name.)

Panzer: Total time:
Panzer Total Memory:
Panzer Total Size:

All Files: Total Time (this is effectively total build time)
All Memory (to be consistent)
All Size:  Roughly How much storage this build required

Size in particular is helpful, as it indicates how much filesystem servers need.

All of the above can be implemented via BASH I think, just a few loops + cut/grep/awk (which are coreutils so will always be present on machines)

bartlettroscoe commented 4 years ago

Ross, I think the summarize would be better (for maintenance/extensibility) implemented as script CMake calls. That (optionally promising to generate a file if needed)

@jjellio, yes, that is exactly what I was suggesting.

I can conceive how to implement summarize_build_stats.py as just plain bash

Such a tool would be very hard to write, test, and maintain in bash. Do you have something against Python?

Just to get this started, I will add simple Python script:

Trilinos/commonTools/build_stats/summarize_build_stats.py

that will just provide project-level stats:

Full Project: sum(max_resident_size_size_mb) = <sum_max_resident_size_mb> (<num-entries> entries)
Full Project: max(max_resident_size_size_mb) = <max_max_resident_size_mb> (<file-name>)
Full Project: max(elapsed_real_time_sec) = <max_elapsed_time_sec> (<file-name>)
Full Project: sum(elapsed_real_time_sec) = <sum_elapsed_time_sec> (<num-entries> entries)
Full Project: sum(file_size_mb) = <sum_file_size_mb> (<num-entries> entries)
Full Project: max(file_size_mb) = <max_file_size_mb> (<file-name>)

That will avoid needing to deal with the package logic for now. We can always add package-level stats later when we have the time (and that will require using some TriBITS utilities to convert from file paths to package names). That way, we can turn this on for PR testing now and merge PR #7508.

Okay?

jjellio commented 4 years ago

I have no issues with Python other than you have to be aware of 2.x vs 3.x stuff.

I love python's regex library. Python 3.x with 'format strings' is awesome. e.g.,f'A variable in scipe: {some_var}'

Tangents below:

Another issue to consider is how to interact with developers. I'll need to improve the webpage (better explanations, and styling for sure).

Yet another issue: can you use the info here, to feedback into Ninja or CMake to improve our build system. This could be an interesting question for Kitware. E.g., if we could provide a list of targets (file.o things) plus a weight. Could Kitware use that orchestrate a good Ninja file. Or perhaps we could do that ourselves (I already have dome something similar). Given weights + num_parallel_procs, coerce the existing build.ninja such that a certain memory highwater mark is avoided. (it's effectively a variant of the knapsack packing problem I believe)

@rmmilewi (CC Reed, this may be something he'd like to be abreast of)

bartlettroscoe commented 4 years ago

I love python's regex library. Python 3.x with 'format strings' is awesome. e.g.,f'A variable in scipe: {some_var}'

But we can't use that if we keep Python 2.7 support (which I think we need to as long as that is the default Python with RHEL7). But comparing to support for programming with data-structures, there is no contest between Python and bash; Python is the clear winner.

FWIW, for me, bash is better if the code you are writing is going to be loading modules and mostly just running commands with very little logic otherwise. (Python can't really improve no that and dealing with the env with Python is very messy and hacky.) But if have any non-trivial logic or need any sophisticated data-structure manipulation, you are crazy to write that code in bash when Python is an option. (I have come to regret writing some code in bash recently that should have been written in python.)

bartlettroscoe commented 4 years ago

CC: @jjellio

FYI: Already getting some interesting stats after getting this running in the ATDM Trilinos builds as shown here:

https://testing-dev.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2020-06-20&filtercount=5&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=buildname&compare2=65&value2=Trilinos-atdm-&field3=testname&compare3=61&value3=TrilinosBuildStats_Results&field4=status&compare4=61&value4=passed&field5=testoutput&compare5=97&value5=Full%20Project%3A%20max%5B(%5Dmax_resident_size_mb%5B)%5D

The max_resident_size_mb varies from as small as 1776.56 (i.e. 1.8 GB) for the Trilinos-atdm-van1-tx2_arm-20.0_openmpi-4.0.2_openmp_static_dbg build shown here showing

Full Project: sum(max_resident_size_mb) = 2290064.25 (10064 entries)
Full Project: max(max_resident_size_mb) = 6047.13 (packages/panzer/adapters-stk/example/main_driver/PanzerAdaptersSTK_main_driver.exe)
Full Project: sum(elapsed_real_time_sec) = 66346.24 (10064 entries)
Full Project: max(elapsed_real_time_sec) = 100.48 (packages/panzer/adapters-stk/example/main_driver/PanzerAdaptersSTK_main_driver.exe)
Full Project: sum(file_size_mb) = 135039.36 (10064 entries)
Full Project: max(file_size_mb) = 781.02 (packages/panzer/adapters-stk/example/main_driver/PanzerAdaptersSTK_main_driver.exe)

to as little as to 6047.13 (i.e. 6.0 GB) for the Trilinos-atdm-van1-tx2_arm-20.0_openmpi-4.0.2_openmp_static_opt build shown here showing:

Full Project: sum(max_resident_size_mb) = 1494626.1 (10059 entries)
Full Project: max(max_resident_size_mb) = 1776.56 (packages/panzer/mini-em/example/BlockPrec/CMakeFiles/PanzerMiniEM_BlockPrec.dir/main.cpp.o)
Full Project: sum(elapsed_real_time_sec) = 100029.03 (10059 entries)
Full Project: max(elapsed_real_time_sec) = 378.77 (packages/zoltan2/test/driver/CMakeFiles/Zoltan2_test_driver.dir/test_driver.cpp.o)
Full Project: sum(file_size_mb) = 24429.11 (10059 entries)
Full Project: max(file_size_mb) = 119.06 (packages/panzer/adapters-stk/example/main_driver/PanzerAdaptersSTK_main_driver.exe)

The large memory usage for debug builds on 'van1-tx2' (ASTRA platform 'stria') explains why I had to turn down the parallel build level (in PR #7298).

And note that the above does not yet include the stats for building static libraries using ar.

bartlettroscoe commented 4 years ago

From: Bartlett, Roscoe A Sent: Wednesday, September 9, 2020 2:50 PM To: Elliott, James John jjellio@sandia.gov Subject: RE: Trilinos Build Stats wrappers?

Hello James,

I would like to merge what is in:

https://github.com/trilinos/Trilinos/pull/7508

so we don’t loose it and it does not get conflicts with other code.

Any objections to that?

Then we can address issues with AR, Intel Fotran, and other topics in future PRs as listed in:

https://github.com/trilinos/Trilinos/pull/7508

-Ross

bartlettroscoe commented 3 years ago

@jjellio, with the merge of #7508, we should meet to discuss what has been done up to this point and what needs to be done as listed above in the Tasks section in order for this to be able to handle all of our builds and just be more robust.

bartlettroscoe commented 3 years ago

FYI: Just met with @jjellio on this. He is going to create a new branch off of 'develop' and re-type his changes to support Intel Fortran and AR there and create a new PR for that. I will then review it and help test it out with the PR builds and the ATDM Trilinos builds so we can turn on these build stats wrappers in every PR and ATDM Trilinos build. If it works in all of those builds, it should work about anywhere.

Some good ideas that came out of my meeting with @jjellio:

Break out 'file_size_mb' stats into libraries (i.e. *.a or *.so.* files), executables (i.e. *.exe) and other files (mostly *.o object files). That way, we can see where the big *.o files are coming from that are creating bug *.a files and therefore big *.exe files.
Add a new target summarize-build-stats (not attached to ALL) that will call summarize-build-stats.py with the right arguments and will write a file summarized_build_stats.txt (or even a JSON version of this?). That would make it super easy for developers to get summaries of build stats while doing local development.
Update the script that gathers *.timing files into the build_stats.csv file to instead merge by the column names so it could be robust to when the set (and location) of fields changes in the *.timing files for older versions of the magic_wrapper.py tool.
Add variables Trilinos_<LANG>_BUILD_STATS_COMPILER_WRAPPER to TrilinosConfig.cmake so that downstream customers could set those to CMAKE_<LANG>_COMPLER instead of the original unwrapped compilers listed in Trilinos_<LANG>_COMPILER.
Create a Python tool to download build stats from the TrilinosBuildStats_Results output and sort them, plot them, write them into a *.json file etc. (This is a cheaper way to get build summaries of build stands than having to download the entire build_stats.csv files for each of these builds).
Update the site https://jjellio.github.io/build_stats/index.html to allow providing the URL to a build on CDash and using that to get the associated build_stats.csv file and display the results.
Get customers like SPARC, EMPIRE, and Sierra to use these build stats wrappers and get feedback on them.

bartlettroscoe commented 3 years ago

Hello @jjellio, any progress on:

FYI: Just met with @jjellio on this. He is going to create a new branch off of 'develop' and re-type his changes to support Intel Fortran and AR there and create a new PR for that.

from our last meeting with notes above? Hopefully that is not too much work. Otherwise, can you point me to your other branch and then I can see if I can copy over the changes to get this working? I would like to wrap up this initial story to get build stats wrappers enabled in every ATDM and PR Trilinos build. After that, we can create a new GitHub (epic) issue for follow-on task that are listed above.

bartlettroscoe commented 3 years ago

@jjellio and I just met with Elliot Ridgeway about how this build stats work might mesh with Watcher. We will meet again in about 6 weeks. Two immediate things come out that we should fix ASAP in:

Remove './' off of the beginning of the file names in the generated *.timing files and when gathering up the build_stats.csv file. This is needed to make the site https://jjellio.github.io/build_stats/ work better.
Add a link to @jjellio's site https://jjellio.github.io/build_stats/ into the Trilinos documentation in https://docs.trilinos.org/files/TrilinosBuildReference.html#enabling-and-viewing-build-statistics and how to download and then upload build stats from CDash

bartlettroscoe commented 3 years ago

@jjellio, I updated and reorganized the list of Tasks [above]() to better reflect the current state of this Story based on my understanding of things. A couple of things that I changed:

I unchecked any tasks that are addressed by open PR #8638 since that PR is not merged yet. (A task can't be considered complete until it is merged to 'develop')
I move several items two the below list Additional things that might be nice to do since they are not critical to get done for the build stats wrappers to be useful.
I reorganized the tasks to put those that are complete first, then those that will be complete after the merge of PR jjellio/Trilinos#1 and P$ #8638, then those that have not been started yet.
I unchecked the task for unit tests for magic_wrapper.py because no such tests exist in PR #8638 (at least not given the classic definition of "unit tests").
I unchecked the task for summarize-build-stats since that target does not exist (not even in PR #8638).
I removed the item to remove ./ from the beginning of the FileName field for the *.timing files since that is now handled by the gather_build_stats.py tool.

Hopefully you agree with these updates.

Let's get PR jjellio/Trilinos#1 merged then PR #8638 tested and merged (and hopefully turned on in all ATDM Trilinos and Trilinos PR builds) and then we can discuss next steps.

bartlettroscoe commented 3 years ago

@jjellio, following up from my comment https://github.com/trilinos/Trilinos/pull/8638#issuecomment-825221154 where I said:

If these build stats wrappers were used in TriBITS ...

actually, these build stats wrappers and the supporting CMake and Python code are more genetic than Trilinos or TriBITS. These really should really live in their own GitHub repo and then this repo should get snapshotted into TriBITS/tribits/build_stats and then used and tested in TriBITS and therefore be able to be used and tested in Trilinos. That might take some work but that seems like the right thing to do and then we could provide very strong automated tests for this functionality in TriBITS (which provide more confidence in these tools and make them easier to develop further and maintain).

Just something to think about.

bartlettroscoe commented 3 years ago

CC: @jjellio

A glorious day. The PR #8638 has finally been merged! This turns on the build stats wrappers in all of the ATDM Trilinos builds (when running test ctest -S driver) and in all of the Trilinos PR builds.

We can see 141 submissions of the test TrilinosBuildStats_Results in the ATDM Trilinos builds showing the new gather script gather_build_stats.py in this query.

And we are starting to see new PRs running this looking at this query.

We need to keep an eye on the PR builds for a few days.

It would be nice to break the build-stats summary reported in the TrilinosBuildStats_Results test into libraries, executables and object files separately before we close this. But that could really be a separate story.

bartlettroscoe commented 3 years ago

CC: @prwolfe, @jwillenbring

@jjellio, it occurred to me that adding begin and end time stamp fields to the *.timing file generated by magic_wraper.py could help to debug out-of-memory problems like reported in #9432. If you know the start and end time for when each target is getting built and you know the max RAM usage for each target, you can compute, at any moment in time, the max possible RAM getting used on the machine and you will know which targets are involved. That will tell you where you need to put in effort to reduce the RAM usage to build specific targets and get around a build bottle neck that consumes all the RAM.

Having the build start and end time stamps also has other uses as well. For example, when doing a rebuild with old targets lying around, if you only want to report build stats for targets that got built on a rebuild, you could add an argument to summarize_build_stats.py --after=<start-time> that filtered build starts built only after the start of the last rebuild <start-time> (which you know at configure time and you can put into the definition of the test) . This would also automatically filter out build stats for targets that no longer exist in the build system from rebuilds months (or years) old. This may also be used for other purposes that I don't even realize yet but these are the obvious ones.

bartlettroscoe commented 3 years ago

CC: @jjellio, @jwillenbring

So I have a Trilinos PR #9894 that is stuck in loop of failed builds due to the compiler crashing running out of memory. Following on from the discussion above, it occurred to me that if you store the beginning and end time stamps for each target in the *.timing file, then the summarize_build_stats.py tool can sort the build stats by start time and end time and compute the high watermark on the machine due to building at any time. For example, if 10 object files are currently being built then you just add up max_resident_size_mb for each of these targets and that gives you the max high water mark at that time for that build as the build stat:

Full Project: max_sum_over_active_targets(max_resident_size_mb)

This would show how close a build is to running out of memory on a given machine and we could plot that number as a function of time. In fact, we could have the CTest test that runs summarize_build_stats.py create CTest test measurements for:

Full Project: max_sum_over_active_targets(max_resident_size_mb)
Full Project: sum(max_resident_size_mb)
Full Project: max(max_resident_size_mb)
Full Project: sum(elapsed_real_time_sec)
Full Project: max(elapsed_real_time_sec)
Full Project: sum(file_size_mb)
Full Project: max(file_size_mb)

using XML in the STDOUT like:

<DartMeasurement type="numeric/double" name="Full Project: sum(max_resident_size_mb)">4667989.73</DartMeasurement>

Then you could see a graph of these measurements over time right on CDash!

github-actions[bot] commented 2 years ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

bartlettroscoe commented 1 month ago

FYI: Kitware is adding build stats support to native CMake, CTest, and CDash. See:

https://gitlab.kitware.com/cmake/cmake/-/issues/26247

Therefore, I think there will be no need for a separate compiler wrapper tool to gather build stats or scripts to manage that and submit it to CDash.

trilinos / Trilinos

Trilinos: Enable build statistics #7376

Enhancement

Path forward

Links

Related to

Tasks: