memtt / numaprof

NUMAPROF is a NUMA memory profliler based on Pintool to track your remote memory accesses.
https://memtt.github.io/numaprof
Other
45 stars 7 forks source link
instrumentation memory numa profiler

Numaprof

What is it ?

Numaprof is a NUMA memory profiler. The idea is to instrument the read/write operations in the application and check the NUMA location of the thread at accesss time to compare it to the memory location of the data.

The tool is currently based on Pintool, a dynamic instrumentation tool from Intel offering a little bit the same service than valgrind but supporting threads so faster for parallel applications.

You can find more details and screenshots on the dedicated website: https://memtt.github.io/numaprof/.

NUMAPROF GUI

Metrics

Numaprof extract the given metrics per call site and per malloc call site :

Dependencies

NUMAPROF needs:

If you use the git repo (eg. master branch) instead of a release file :

If you don't have npm and pip on your server, prefer using the release archive which already contain all the required libraries and do not depend anymore on those two commands.

Install

First download the last version of pintool (tested : 3.24 on x86_64 arch : https://software.intel.com/en-us/articles/pin-a-binary-instrumentation-tool-downloads) and extract it somewhere. TAKE CARE, PINTOOL IS NOT OPEN-SOURCE AND IS FREE ONLY FOR NON-COMMERCIAL USE.

Then use the configure script :

mkdir build
cd build
../configure --prefix=PREFIX --with-pintool=PINTOOL_PATH
make
make install

For those who prefer cmake, the configure is just a wrapper to provide autotools-like semantic and --help. You can of course call cmake as you want in place of it. Notice the script provide the cmake command if you use --show option.

Usage

Setup your paths (you can also use absolute paths if you don't want to change your env):

export PATH=PREFIX/bin:$PATH

Run you program using the wrapper:

numaprof ./benchmark --my-option

The numaprof GUI is based on a webserver and be viewed in the browser at http://localhost:8080. The GUI password is currently fixed to admin/admin. You can launch the webserver by running :

numaprof-webview numaprof-1234.json

The first time you launch the GUI you will need to provide a user/password to secure the interface. You can change the password or add other users by using :

numaprof-passwd {USER}

The users are stored into ~/.numaprof/htpasswd by following the htpasswd format.

If you run the webview on a remote node, you can forward the http session to your local browser by using :

ssh myhost -L8080:localhost:8080

If you have Qt5-webkit installed you can also automatically open a bowser view by using ssh X-Forward by using :

numaprof-qt5 numaprof-1234.json

MPI Support

If you want to profile an MPI application you will get a profile per process so at least one per rank.

In order to name the files with the given MPI rank instead of the PID you can add the option :

mpirun -np 16 numaprof --mpi ./my_program

Kcachgrind compatibility

If you want to generate the callgrind compatible output, use:

numaprof-to-callgrind numaprof-45689.json

Then you can open the callgrind file with kcachegrind (http://kcachegrind.sourceforge.net/html/Home.html):

kcachegrind numaprof-12345.callgrind

Available options

Here the config file which can be given by using -c FILE option to numaprof-pintool. You can also give a specific entry by using -o SECTION:NAME=value,SECTION2:NAME2=value2.

[output]
name=numaprof-%1-%2.%3
indent=true
json=true
dumpConfig=false
silent=false
removeSmall=false
removeRatio=0.5

[core]
skipStackAccesses=true
threadCacheEntries=512
objectCodePinned=false
skipBinaries=
accessBatchSize=0

[info]
hidden=false

[cache]
;can be 'dummy' or 'L1' or 'L1_static'
type=dummy
size=32K
associativity=8

[mpi]
useRank=false
rankVar=auto

[emulate]
numa=-1

On huge application

NUMAPROF was not yet testing on multi-million line application so we expect some slow down on such big code. But it should be able to work. Although, the web GUI might lag due to too much data. In this case, enable filtering option at profiling time by using option to remove all entries smaller than 0.2% from the output profile:

numaprof-pintool -o output:removeSmall=true,output:removeRatio=0.2 ./benchmark --my-option

View on another machine

If you want to view the NUMAPROF profile on another machine than the one you profiled on, you can copy the json file and open it. Ideally the sources need to be placed at the same path than the one where you profiled.

If this is not the case you can use the override option of the GUI to redirect some directories :

numaprof-webview -o /home/my_server_user/server_path/project:/home/my_local_user/loal_path/project ./numaprof-1234.json

numactl

If you want to profile an application while using the numactl tool to setup the memory binding you need to use the command line in given order:

numactl {OPTIONS} numaprof-pintool ./MY_APP

Cache simulation

NUMAPROF report all the memory accesses to account local/remote/MCDRAM. But this is biased compared to the reallity as your processor has CPU caches which reduce a lot the accesses to the RAM. If you want to take this into account there is currently a slight cache simulation infrastructure embedded into NUMAOROF. It currently only provide one L1 cache per thread (32K by default) with LRU replacement policy. This does not match with the multi-level and shared caches of current architectures but can be used for example to eliminate spinlocks and access to global variables from the profile as it for sure finish in the cache.

Caution, this is currently an experimental feature.

You can enable it by using command line option and can optionally change its size using the standard way to override config file options via command line (or provide a config file) :

numaprof-pintool --cache L1 -o cache:size=32K -o cache:associativity=8 {YOUR_APP}

Not having a NUMA server for dev

If you want to test your application about NUMA without having a NUMA server under your hand, you can use the option emulate:numa to make numaprof runnning as it would run on a NUMA server.

In the option, give the desired number of NUMA nodes to emulated. It should be a multiple of the core count which will be distributed over the X requested NUMA domains.

Notice that in this case NUMAPROF provide a purely theoritical view not fetching any infos from the OS about NUMA as it request for a normal run.

Pointers:

If you search pointers about similar tools, interesting related papers, you can refer to the docs/bibliography.md file.

License

Numaprof is distributed under CeCILL-C licence which is LGPL compatible. Take care, NUMAPROF currently strongly depend on Intel Pintool which is free only for non commercial use.

I would like to make a port to DynamoRIO to avoid this, if someone want to help !.

Discussion

You can join the google group to exchange ideas and ask questions : https://groups.google.com/forum/#!forum/memtt-numaprof.