Profiling in Madrona - Githubissues

ShawnshanksGui commented 10 months ago

Hi, How can I do Up-Front Profiling in Madrona? What script can I use?

shacklettbp commented 10 months ago

It can be done with the scripts/profile.py script in this repository. You need to write a python script that profiles your environment (takes some number of steps with random actions for example), and pass it to scripts/profile.py. Once complete, the script will generate a json file, which you use by using the MADRONA_MWGPU_EXEC_CONFIG_FILE=path_to_json environment variable when running the environment.

This process is very rough and unpolished right now, I don't really recommend using it. Making it more plug and play is (somewhere) on our TODO list.

ShawnshanksGui commented 10 months ago

thx, I have tried to use it on several benchmarks, but errors happened.

For gpu_hideseek,

gf@lm1:~/projects/madrona/test/gpu_hideseek/external/madrona$ python scripts/profile.py /home/gf/miniconda3/envs/madrona/bin/python ../../scripts/benchmark.py ''' Error at /home/gf/projects/madrona/test/gpu_hideseek/external/madrona/src/mw/cuda_exec.cpp:1044 in madrona::GPUKernels madrona::buildKernels(const madrona::CompileConfig &, Span, madrona::ExecutorMode, int32_t, std::pair<int, int>) CUDA_ERROR_NOT_FOUND: named symbol not found Aborted (core dumped) Traceback (most recent call last): File "/home/gf/projects/madrona/test/gpu_hideseek/external/madrona/scripts/profile.py", line 107, in parse_traces() File "/home/gf/projects/madrona/test/gpu_hideseek/external/madrona/scripts/profile.py", line 51, in parse_traces with open(path_to_trace, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/profile_1_block_madrona_device_tracing.bin' '''

1701344393551

For madrona_escape_room, 1701343403963 $python ./external/madrona/scripts/profile.py /home/gf/miniconda3/envs/madrona/bin/python ./scripts/train.py '' Error at /home/gf/projects/madrona/madrona_escape_room/external/madrona/src/mw/cuda_exec.cpp:1044 in madrona::GPUKernels madrona::buildKernels(const madrona::CompileConfig &, Span, madrona::ExecutorMode, int32_t, std::pair<int, int>) CUDA_ERROR_NOT_FOUND: named symbol not found Aborted (core dumped) Traceback (most recent call last): File "/home/gf/projects/madrona/madrona_escape_room/./external/madrona/scripts/profile.py", line 121, in parse_traces() File "/home/gf/projects/madrona/madrona_escape_room/./external/madrona/scripts/profile.py", line 65, in parse_traces with open(path_to_trace, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/profile_1_block_madrona_device_tracing.bin' '' gf@lm1:~/projects/madrona/test/madrona_simple_example/external/madrona$ python scripts/profile.py /home/gf/miniconda3/envs/madrona/bin/python ../../scripts/run.py ''' Traceback (most recent call last): File "/home/gf/projects/madrona/test/madrona_simple_example/external/madrona/scripts/profile.py", line 121, in parse_traces() File "/home/gf/projects/madrona/test/madrona_simple_example/external/madrona/scripts/profile.py", line 65, in parse_traces with open(path_to_trace, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/profile_1_block_madrona_device_tracing.bin' '''

ShawnshanksGui commented 10 months ago

And, if not use this tool, how can I know whether my self-defined system is executed on GPU, not CPU? where the bottleneck of the whole simulator is？ thx!

ShawnshanksGui commented 10 months ago

Sorry for the disorder: the third screenshot is for the "madrona_simple_example" simulator.

shacklettbp commented 10 months ago

Whether or not code runs on the CPU or GPU is COMPLETELY unrelated to this setting. If the simulator is executing under MWCudaExecutor (search in mgr.cpp in one of the simulators), the ENTIRE simulation (every ECS system) will run on the GPU. In other words, for hide and seek and madrona escape room, if you pass exec_mode = madrona::ExecMode::CUDA when creating the Manager, you're getting fully on the GPU execution. Madrona does not support heterogenous CPU - GPU execution of the core simulation currently. You either run fully on the GPU or fully on the CPU.

With regard to the errors you posted: you also need to pass -DMADRONA_ENABLE_TRACING=ON to cmake when building the simulator for up-front profiling to work. Note that this option has a negative performance impact, so you don't want it enabled for normal training runs. This option is exclusively for making the GPU backend more performant, but like I said above I wouldn't worry about it (it will likely be very fast without this option enabled as well).

ShawnshanksGui commented 10 months ago

Thx, I already cmake with -DMADRONA_ENABLE_TRACING=ON, but it still does not work, the error is the following:

“madrona_simple_example” 0 events were logged in total At the end, complete traces for -1 steps are generated Traceback (most recent call last): File "/home/gf/projects/madrona/test/madrona_simple_example/./external/madrona/scripts/profile.py", line 121, in parse_traces() File "/home/gf/projects/madrona/test/madrona_simple_example/./external/madrona/scripts/profile.py", line 70, in parse_traces assert len(log_steps) > BASE_STEP AssertionError

"madrona_escape_room" Error at /home/guifei/projects/madrona/madrona_escape_room/external/madrona/src/mw/cuda_exec.cpp:1044 in madronels madrona::buildKernels(const madrona::CompileConfig &, Span, madrona::Exe int32_t, std::pair<int, int>) CUDA_ERROR_NOT_FOUND: named symbol not found Aborted (core dumped) 0 events were logged in total At the end, complete traces for -1 steps are generated Traceback (most recent call last): File "/home/guifei/projects/madrona/madrona_escape_room/external/madrona/scripts/profile.py", line 120, in < parse_traces() File "/home/guifei/projects/madrona/madrona_escape_room/external/madrona/scripts/profile.py", line 69, in pa assert len(log_steps) > BASE_STEP AssertionError

Can you give a concrete example, which can be normally run without any modification? Thx.

ShawnshanksGui commented 10 months ago

btw, whether users can nsight-compute to profile their implemented simulator? if can, how?

shacklettbp commented 10 months ago

Ok I finally looked into this and there was a change that broke this feature in Madrona awhile back, sorry about that! I've just gone and fixed it, along with cleaning up the scripts/profile.py script a little bit (now you can just pass it a command with any arguments you want).

Here are step by step instructions, start with madrona_escape_room latest main with all submodules updated.

cd madrona_escape_room
mkdir build
cd build
cmake .. -DMADRONA_ENABLE_TRACING=ON
make -j # num cores
cd ..
# Collect the profiling info! This script will take a LONG time because the data analysis code is written in python :(
python external/madrona/scripts/profile.py python scripts/sim_bench.py --num-worlds 8192 --num-steps 20

At the end of this process, a file will have been created in /tmp: /tmp/profile_blocks__megakernel_events/node_blocks.json. This file contains the megakernel configuration that has been chosen based on the profiling. To use it, you need to pass a couple environment variables to train.py (or any other script where you want to run with the optimized config):

MADRONA_MWGPU_EXEC_CONFIG_FILE=/tmp/profile_blocks__megakernel_events/node_blocks.json MADRONA_MWGPU_ENABLE_PGO=1 scripts/train.py # normal train.py arguments

For maximum performance, I recommend saving the node_blocks.json file somewhere and then recompiling the entire system without MADRONA_ENABLE_TRACING=ON. Even without tracing you can still use the json config file with those environment variables to get a speedup. Be warned the total speedup is pretty low on my 3090 for escape room.

To answer your other questions:

You can use nsight just like any other application. Run the python executable under nsight with arguments set to execute the train.py script as normal (or the sim_bench.py script I refer to above and have just added to escape room to isolate only Madrona simulation). Unfortunately, nsight is not super useful because all the code gets compiled together into a single megakernel. You can get some information out of nsight by seeing where hotspots are in the resulting megakernel, but it's very hard to determine relative costs of each ECS system, unfortunately.

Because of this, the MADRONA_ENABLE_TRACING infrastructure can be used not just for profile guided optimization but (more importantly actually!) to also profile the GPU performance of each ECS system. This is how we created Figure 5 in the paper for example.

Basically when MADRONA_ENABLE_TRACING is on, every time you run your simulator a file will be created in /tmp, along the lines of /tmp/386282_madrona_device_tracing.bin (the number is the PID of the process). You can pass this file to a script in the madrona repo: scripts/parse_device_tracing.py:

python scripts/sim_bench.py --num-worlds 8192 --num-steps 20
# Check in /tmp, find device_tracing file
python external/madrona/scripts/parse_device_tracing.py --trace_file /tmp/401362_madrona_device_tracing.bin

This will create new new directory, /tmp/401362_madrona_device_tracing.bin_megakernel_events/, which will have a number of pngs in it that show the profiling information for a single step (for example /tmp/401362_madrona_device_tracing.bin_megakernel_events/step10.png) shows the execution breakdown of the 10th step): step10

These correspond to the same structure as figure 5 in the paper, so refer there for a more involved explanation. In short, the horizontal axis is time, the y axis is GPU SMs. Each solid color corresponds to a single ECS system. At the bottom of each colored region there is some text that gives information on the top 16 most expensive ECS systems: taskgraph function ID, time elapsed and percent of execution time. The important detail is the function ID, drawn as f: 31, for example in the png. Unfortunately, we don't have an automated way to map from function ID to the text name of the ECS system in C++ code. To figure out the mapping between IDs and your ECS systems, you need to run madrona with a special environment variable:

MADRONA_MWGPU_PRINT_FUNC_IDS=1 python scripts/sim_bench.py --num-worlds 8192 --num-steps 20

This will print out a bunch of text that looks like this:

uint32_t _ZN7madrona5mwGPU14UserFuncIDBaseINS_21CustomParallelForNodeINS_7ContextEXadL_ZNS_4phys6solver13setVelocitiesERS3_RKNS_4base8PositionERKNS7_8RotationERKNS5_16SubstepPrevStateERNS4_8VelocityEEELi1ELi1EJS8_SB_SE_SH_EEEXadL_ZNSJ_3runEiEEE2idE = 27;
uint32_t _ZN7madrona5mwGPU14UserFuncIDBaseINS_21CustomParallelForNodeINS_7ContextEXadL_ZNS_4phys6solver15solveVelocitiesERS3_RNS4_10SolverDataEEELi1ELi1EJS7_EEEXadL_ZNS_9TaskGraph7Builder19dynamicCountWrapperIS9_EEvPT_iEEE2idE = 28;
uint32_t _ZN7madrona5mwGPU14UserFuncIDBaseINS_21CustomParallelForNodeINS_7ContextEXadL_ZNS_4phys6solver15solveVelocitiesERS3_RNS4_10SolverDataEEELi1ELi1EJS7_EEEXadL_ZNS9_3runEiEEE2idE = 29;
uint32_t _ZN7madrona5mwGPU14UserFuncIDBaseINS_21CustomParallelForNodeINS_7ContextEXadL_ZNS_4phys11narrowphase20runNarrowphaseSystemERS3_RKNS4_18CandidateCollisionEEELi1ELi1EJS7_EEEXadL_ZNS_9TaskGraph7Builder19dynamicCountWrapperISA_EEvPT_iEEE2idE = 30;
uint32_t _ZN7madrona5mwGPU14UserFuncIDBaseINS_21CustomParallelForNodeINS_7ContextEXadL_ZNS_4phys11narrowphase20runNarrowphaseSystemERS3_RKNS4_18CandidateCollisionEEELi1ELi1EJS7_EEEXadL_ZNSA_3runEiEEE2idE = 31;
uint32_t _ZN7madrona5mwGPU14UserFuncIDBaseINS_21CustomParallelForNodeINS_7ContextEXadL_ZNS_4phys10broadphase24updateLeafPositionsEntryERS3_RKNS5_6LeafIDERKNS_4base8PositionERKNSA_8RotationERKNSA_5ScaleERKNSA_8ObjectIDERKNS4_8VelocityEEELi1ELi1EJS7_SB_SE_SH_SK_SN_EEEXadL_ZNS_9TaskGraph7Builder19dynamicCountWrapperISQ_EEvPT_iEEE2idE = 32;

You can make it a bit more readable by piping it through the c++filt program:

uint32_t madrona::mwGPU::UserFuncIDBase<madrona::CustomParallelForNode<madrona::Context, &madrona::phys::solver::setVelocities, 1, 1, madrona::base::Position, madrona::base::Rotation, madrona::phys::solver::SubstepPrevState, madrona::phys::Velocity>, &madrona::CustomParallelForNode<madrona::Context, &madrona::phys::solver::setVelocities, 1, 1, madrona::base::Position, madrona::base::Rotation, madrona::phys::solver::SubstepPrevState, madrona::phys::Velocity>::run>::id = 27;
uint32_t madrona::mwGPU::UserFuncIDBase<madrona::CustomParallelForNode<madrona::Context, &madrona::phys::solver::solveVelocities, 1, 1, madrona::phys::SolverData>, &(void madrona::TaskGraph::Builder::dynamicCountWrapper<madrona::CustomParallelForNode<madrona::Context, &madrona::phys::solver::solveVelocities, 1, 1, madrona::phys::SolverData> >(madrona::CustomParallelForNode<madrona::Context, &madrona::phys::solver::solveVelocities, 1, 1, madrona::phys::SolverData>*, int))>::id = 28;
uint32_t madrona::mwGPU::UserFuncIDBase<madrona::CustomParallelForNode<madrona::Context, &madrona::phys::solver::solveVelocities, 1, 1, madrona::phys::SolverData>, &madrona::CustomParallelForNode<madrona::Context, &madrona::phys::solver::solveVelocities, 1, 1, madrona::phys::SolverData>::run>::id = 29;
uint32_t madrona::mwGPU::UserFuncIDBase<madrona::CustomParallelForNode<madrona::Context, &madrona::phys::narrowphase::runNarrowphaseSystem, 1, 1, madrona::phys::CandidateCollision>, &(void madrona::TaskGraph::Builder::dynamicCountWrapper<madrona::CustomParallelForNode<madrona::Context, &madrona::phys::narrowphase::runNarrowphaseSystem, 1, 1, madrona::phys::CandidateCollision> >(madrona::CustomParallelForNode<madrona::Context, &madrona::phys::narrowphase::runNarrowphaseSystem, 1, 1, madrona::phys::CandidateCollision>*, int))>::id = 30;
uint32_t madrona::mwGPU::UserFuncIDBase<madrona::CustomParallelForNode<madrona::Context, &madrona::phys::narrowphase::runNarrowphaseSystem, 1, 1, madrona::phys::CandidateCollision>, &madrona::CustomParallelForNode<madrona::Context, &madrona::phys::narrowphase::runNarrowphaseSystem, 1, 1, madrona::phys::CandidateCollision>::run>::id = 31;
uint32_t madrona::mwGPU::UserFuncIDBase<madrona::CustomParallelForNode<madrona::Context, &madrona::phys::broadphase::updateLeafPositionsEntry, 1, 1, madrona::phys::broadphase::LeafID, madrona::base::Position, madrona::base::Rotation, madrona::base::Scale, madrona::base::ObjectID, madrona::phys::Velocity>, &(void madrona::TaskGraph::Builder::dynamicCountWrapper<madrona::CustomParallelForNode<madrona::Context, &madrona::phys::broadphase::updateLeafPositionsEntry, 1, 1, madrona::phys::broadphase::LeafID, madrona::base::Position, madrona::base::Rotation, madrona::base::Scale, madrona::base::ObjectID, madrona::phys::Velocity> >(madrona::CustomParallelForNode<madrona::Context, &madrona::phys::broadphase::updateLeafPositionsEntry, 1, 1, madrona::phys::broadphase::LeafID, madrona::base::Position, madrona::base::Rotation, madrona::base::Scale, madrona::base::ObjectID, madrona::phys::Velocity>*, int))>::id = 32;

The numbers at the end of each line are the IDs printed into the profiling PNGs. For example we can see that ID 31 (the orange color in my example & the most expensive part of the the step) corresponds to madrona::phys::runNarrowphaseSystem (narrowphase collision detection).

Hopefully this is helpful. I apologize that the profiling infrastructure is so painful to work with. Improving it is on my TODO list, but pretty low priority unfortunately. I'm open to PRs that try to improve this workflow if you're interested :)

By the way, do you know about the MADRONA_MWGPU_KERNEL_CACHE environment variable? It's very helpful because it will cache the GPU compilation into the file the environment variable is set to, so you don't need to recompile on each training run. Be careful though, if you change the C++ code you need to delete the cache manually! There is no checking to see if the source files are newer.

ShawnshanksGui commented 10 months ago

thx, I will have a try.

ShawnshanksGui commented 10 months ago

I do it in the following order from scratch, but I still get an error:

git clone --recursive https://github.com/shacklettbp/madrona_escape_room.git
cd madrona_escape_room
mkdir build
cd build
cmake .. -DMADRONA_ENABLE_TRACING=ON
make -j 60 
cd ..
python external/madrona/scripts/profile.py python scripts/sim_bench.py --num-worlds 8192 --num-steps 20

"Error:" Error at /home/gf/projects/madrona/test/madrona_escape_room/external/madrona/src/mw/cuda_exec.cpp:1061 in GPUKernels madrona::buildKernels(const CompileConfig &, Span, ExecutorMode, int32_t, std::pair<int, int>) CUDA_ERROR_NOT_FOUND: named symbol not found Aborted (core dumped) Error at /home/gf/projects/madrona/test/madrona_escape_room/external/madrona/src/mw/cuda_exec.cpp:1061 in GPUKernels madrona::buildKernels(const CompileConfig &, Span, ExecutorMode, int32_t, std::pair<int, int>) CUDA_ERROR_NOT_FOUND: named symbol not found Aborted (core dumped) 0 events were logged in total At the end, complete traces for -1 steps are generated Traceback (most recent call last): File "/home/gf/projects/madrona/test/madrona_escape_room/external/madrona/scripts/profile.py", line 110, in parse_traces() File "/home/gf/projects/madrona/test/madrona_escape_room/external/madrona/scripts/profile.py", line 60, in parse_traces assert len(log_steps) > BASE_STEP AssertionError

ShawnshanksGui commented 10 months ago

And, I also try it on madrona_smple_example: $python external/madrona/scripts/profile.py python scripts/run.py 1 --gpu not error.

$python external/madrona/scripts/parse_device_tracing.py --trace_file /tmp/1733051_madrona_device_tracing.bin some errors occurred

1860 events were logged in total At the end, complete traces for 19 steps are generated metrics file: /tmp/1733051_madrona_device_tracing.bin_megakernel_events/metrics.xlsx For each SM on average, 18.328% of the time there is at least one block is running on the figure size will be 9528x4764 Color mapping for functions: {51: 'blue', 52: 'orange'} Percentage of active warps is 12.50% Traceback (most recent call last): File "/home/gf/projects/madrona_simple_example/external/madrona/scripts/parse_device_tracing.py", line 485, in step_analysis(LOG_STEPS[s], dir_path + "/step{}.png".format(s), File "/home/gf/projects/madrona_simple_example/external/madrona/scripts/parse_device_tracing.py", line 424, in step_analysis plot_events(step_log, nodes_map, block_exec_time["blocks"], file_name, args) File "/home/gf/projects/madrona_simple_example/external/madrona/scripts/parse_device_tracing.py", line 396, in plot_events font = ImageFont.load_default(size=fontsize) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: load_default() got an unexpected keyword argument 'size'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/gf/projects/madrona_simple_example/external/madrona/scripts/parse_device_tracing.py", line 483, in with pd.ExcelWriter(dir_path + "/metrics.xlsx") as writer: File "/home/gf/anaconda3/envs/madrona/lib/python3.11/site-packages/pandas/io/excel/_base.py", line 1370, in exit self.close() File "/home/gf/anaconda3/envs/madrona/lib/python3.11/site-packages/pandas/io/excel/_base.py", line 1374, in close self._save() File "/home/gf/anaconda3/envs/madrona/lib/python3.11/site-packages/pandas/io/excel/_openpyxl.py", line 110, in _save self.book.save(self._handles.handle) File "/home/gf/anaconda3/envs/madrona/lib/python3.11/site-packages/openpyxl/workbook/workbook.py", line 386, in save save_workbook(self, filename) File "/home/gf/anaconda3/envs/madrona/lib/python3.11/site-packages/openpyxl/writer/excel.py", line 294, in save_workbook writer.save() File "/home/gf/anaconda3/envs/madrona/lib/python3.11/site-packages/openpyxl/writer/excel.py", line 275, in save self.write_data() File "/home/gf/anaconda3/envs/madrona/lib/python3.11/site-packages/openpyxl/writer/excel.py", line 89, in write_data archive.writestr(ARC_WORKBOOK, writer.write()) ^^^^^^^^^^^^^^ File "/home/gf/anaconda3/envs/madrona/lib/python3.11/site-packages/openpyxl/workbook/_writer.py", line 150, in write self.write_views() File "/home/gf/anaconda3/envs/madrona/lib/python3.11/site-packages/openpyxl/workbook/_writer.py", line 137, in write_views active = get_active_sheet(self.wb) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/gf/anaconda3/envs/madrona/lib/python3.11/site-packages/openpyxl/workbook/_writer.py", line 35, in get_active_sheet raise IndexError("At least one sheet must be visible") IndexError: At least one sheet must be visible

shacklettbp commented 10 months ago

I wouldn't expect simple example to work right now. My fix is only on the latest version of Madrona and I've only updated escape room.

I think the problem is you need to pip install -e . in the newly cloned escape room repo. I think python is likely still pulling in an install of escape room that you have elsewhere. So it should be:

cd build
cmake .. -DMADRONA_ENABLE_TRACING=ON
make -j 60
cd ..
pip install -e .
python external/madrona/scripts/profile.py python scripts/sim_bench.py --num-worlds 8192 --num-steps 20

ShawnshanksGui commented 10 months ago

thx, for the first problem, I rebuild it again from the following order, there still exist errors.

git clone --recursive https://github.com/shacklettbp/madrona_escape_room.git
cd madrona_escape_room
mkdir build
cd build
cmake .. -DMADRONA_ENABLE_TRACING=ON
make -j 60
cd ..
pip install -e .
python external/madrona/scripts/profile.py python scripts/sim_bench.py --num-worlds 81 --num-steps 20

1701744729172

ShawnshanksGui commented 10 months ago

For the other problem “nsight-compute like profiler” : A error occured: 1701745097514

But, it seems works after I made a small modification on the 396-th line of parse_device_tracing.py :

#font = ImageFont.load_default(size=fontsize)
font = ImageFont.load_default()

1701745184920

ShawnshanksGui commented 10 months ago

step10.png: step10

shacklettbp commented 10 months ago

For that python error when running parse_device_tracing.py, you need to upgrade your installed version of pillow. pip install --upgrade pillow

I'm very confused by those errors you're getting when running the profile script. Do you have the CUDA version of pytorch installed? I've never seen those MKL errors before and don't know why MKL would be relevant - that's typically for CPU DNN acceleration (CPU version of pytorch maybe?).

The key error that is stopping profiling from working is the symbol not found error. I don't know why that is happening on your machine, it doesn't on mine. What version of python are you using? Can you check if the profile.py script is correctly passing the MADRONA_MWGPU_ENABLE_PGO environment variable to the benchmark script? The issue seems to be that in cuda_exec.cpp line 1756, max_megakernel_blocks_per_sm isn't being set to 6.

step10.png looks good. You're running a very low number of worlds (low batch size) so the GPU is heavily underutilized. That's why there is so much white space, a lot of the SMs don't have any work to do.

ShawnshanksGui commented 10 months ago

For MKL error, my env. configuration is as following:

1701754506167

For the up-front profling problem, I will check it latter.

ShawnshanksGui commented 10 months ago

BTW, I use nvidia usight compute to make a profile:

ncu -o output_report.ncu-rep  python   scripts/train.py --num-worlds 8 --num-updates 50 --profile-report --fp16 --gpu-sim --ckpt-dir build/checkpoints/

1701787062913

It seems it would stopped at the beginning of megakernel, do you ever make a success? How do you do that?

shacklettbp commented 10 months ago

It seems like it crashes before the megakernel even. Try profiling only the megakernel (I think pass -k regex:madronaMWGPUMegakernel to ncu)

ShawnshanksGui commented 10 months ago

it works, thx.

shacklettbp / madrona

Profiling in Madrona #20