Closed telegraphic closed 2 years ago
For performance analysis you could try rebuilding with TRACE=1 and running through nvprof:
$ nvprof -o bf_gpuspec.nvprof python bf_gpuspec.py
(or you can run it from the the nvvp GUI directly).
In theory that will give you a complete timeline of the whole pipeline, though I have to admit that I've barely tested this functionality.
A call to 'reset' resets the terminal settings after a pycurses crash.
What's the full directory structure inside /dev/shm/bifrost?
On 6/8/17 8:28 PM, Danny Price wrote:
I finally have a bifrost version of our gpuspec code! However, it's running very slowly, and I'd like to find out why. Unfortunately, the monitoring tools haven't been working for me.
My code is here: https://github.com/telegraphic/bunyip/blob/master/bf_gpuspec.py Note: it requires a GPU with 8GB RAM to run -- really, more RAM would be better.
If I try running the monitor tools they crash:
|dancpr@bldcpr:/bldata/bifrost/tools$ ./like_bmon.py Traceback (most recent call last): File "./like_bmon.py", line 415, in
main(sys.argv[1:]) File "./like_bmon.py", line 264, in main blockList = _getTransmitReceive() File "./like_bmon.py", line 102, in _getTransmitReceive contents = load_by_pid(pid) File "build/bdist.linux-x86_64/egg/bifrost/proclog.py", line 127, in load_by_pid File "build/bdist.linux-x86_64/egg/bifrost/proclog.py", line 79, in load_by_filename IOError: [Errno 21] Is a directory: '/dev/shm/bifrost/17263/Pipeline_0/HdfWriteBlock_2' | The like_top script is the worst, as it fubars my command line. It needs a more graceful failure mode...
screen shot 2017-06-08 at 7 24 46 pm https://user-images.githubusercontent.com/713251/26958533-294889b8-4c80-11e7-803d-1afbf180b9be.png
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ledatelescope/bifrost/issues/90, or mute the thread https://github.com/notifications/unsubscribe-auth/AGaC4-D-8l2KwDnvj-_GkzoRLmMtoCd6ks5sCK3BgaJpZM4N01VH.
Hey Jayce,
here's the directory structure via (newly discovered) tree
:
dancpr@bldcpr:/bldata/bifrost/tools$ tree /dev/shm/bifrost
/dev/shm/bifrost
└── 17263
└── Pipeline_0
├── AccumulateBlock_0
│ ├── bind
│ ├── in
│ ├── out
│ ├── perf
│ └── sequence0
├── BlockScope_1
│ ├── PrintHeaderBlock_0
│ │ ├── bind
│ │ ├── in
│ │ ├── out
│ │ ├── perf
│ │ └── sequence0
│ └── TransposeBlock_0
│ ├── bind
│ ├── in
│ ├── out
│ ├── perf
│ └── sequence0
├── BlockScope_13
│ ├── CopyBlock_3
│ │ ├── bind
│ │ ├── in
│ │ ├── out
│ │ ├── perf
│ │ └── sequence0
│ └── TransposeBlock_1
│ ├── bind
│ ├── in
│ ├── out
│ ├── perf
│ └── sequence0
├── BlockScope_16
│ ├── AccumulateBlock_1
│ │ ├── bind
│ │ ├── in
│ │ ├── out
│ │ ├── perf
│ │ └── sequence0
│ ├── CopyBlock_4
│ │ ├── bind
│ │ ├── in
│ │ ├── out
│ │ ├── perf
│ │ └── sequence0
│ ├── CopyBlock_5
│ │ ├── bind
│ │ ├── in
│ │ ├── out
│ │ ├── perf
│ │ └── sequence0
│ ├── DetectBlock_1
│ │ ├── bind
│ │ ├── in
│ │ ├── out
│ │ ├── perf
│ │ └── sequence0
│ ├── FftBlock_1
│ │ ├── bind
│ │ ├── in
│ │ ├── out
│ │ ├── perf
│ │ └── sequence0
│ └── FftShiftBlock_1
│ ├── bind
│ ├── in
│ ├── out
│ ├── perf
│ └── sequence0
├── BlockScope_25
│ ├── CopyBlock_6
│ │ ├── bind
│ │ ├── in
│ │ ├── out
│ │ ├── perf
│ │ └── sequence0
│ ├── PrintHeaderBlock_2
│ │ ├── bind
│ │ ├── in
│ │ ├── out
│ │ ├── perf
│ │ └── sequence0
│ └── TransposeBlock_2
│ ├── bind
│ ├── in
│ ├── out
│ ├── perf
│ └── sequence0
├── BlockScope_29
│ ├── AccumulateBlock_2
│ │ ├── bind
│ │ ├── in
│ │ ├── out
│ │ ├── perf
│ │ └── sequence0
│ ├── CopyBlock_7
│ │ ├── bind
│ │ ├── in
│ │ ├── out
│ │ ├── perf
│ │ └── sequence0
│ ├── CopyBlock_8
│ │ ├── bind
│ │ ├── in
│ │ ├── out
│ │ ├── perf
│ │ └── sequence0
│ ├── DetectBlock_2
│ │ ├── bind
│ │ ├── in
│ │ ├── out
│ │ ├── perf
│ │ └── sequence0
│ ├── FftBlock_2
│ │ ├── bind
│ │ ├── in
│ │ ├── out
│ │ ├── perf
│ │ └── sequence0
│ └── FftShiftBlock_2
│ ├── bind
│ ├── in
│ ├── out
│ ├── perf
│ └── sequence0
├── BlockScope_4
│ ├── CopyBlock_0
│ │ ├── bind
│ │ ├── in
│ │ ├── out
│ │ ├── perf
│ │ └── sequence0
│ ├── CopyBlock_1
│ │ ├── bind
│ │ ├── in
│ │ ├── out
│ │ ├── perf
│ │ └── sequence0
│ ├── DetectBlock_0
│ │ ├── bind
│ │ ├── in
│ │ ├── out
│ │ ├── perf
│ │ └── sequence0
│ ├── FftBlock_0
│ │ ├── bind
│ │ ├── in
│ │ ├── out
│ │ ├── perf
│ │ └── sequence0
│ └── FftShiftBlock_0
│ ├── bind
│ ├── in
│ ├── out
│ ├── perf
│ └── sequence0
├── CopyBlock_2
│ ├── bind
│ ├── in
│ ├── out
│ ├── perf
│ └── sequence0
├── GuppiRawSourceBlock_0
│ ├── bind
│ ├── in
│ ├── out
│ └── perf
├── HdfWriteBlock_0
│ ├── bind
│ ├── in
│ ├── out
│ └── sequence0
├── HdfWriteBlock_1
│ ├── bind
│ ├── in
│ ├── out
│ ├── perf
│ └── sequence0
├── HdfWriteBlock_2
│ ├── bind
│ ├── in
│ ├── out
│ ├── perf
│ └── sequence0
├── PrintHeaderBlock_1
│ ├── bind
│ ├── in
│ ├── out
│ ├── perf
│ └── sequence0
└── PrintHeaderBlock_3
├── bind
├── in
├── out
├── perf
└── sequence0
40 directories, 158 files
nvprof
seems to work:
Ah, your directory structure is deeper than I expected. Can you try this version of proclog/like_top?
@jaycedowell I am impressed.
(but unsure how to read where the bottlenecks are)
Ha ha, thanks. I usually look at the processing, reserve, and acquire times to try and figure out what is happening. You can also sort by these columns with the p, r, and a keys, respectively.
Thanks -- may I confirm what process, reserve and acquire mean? That is, does large reserve fraction means it's idling (bottleneck in another block), large processing fraction means it's working hard (compute bound), and large acquire means lots of reading from the ring (memory bound)?
I'm also guessing that CPU% being at 100% means that the block is CPU bound. In which case, the accumulate block for low-resolution FFTs is one of my bottlenecks (from a different test), presumably due to overhead on calling on_data()
.
Acquire is the time spent waiting for input (i.e., waiting on upstream blocks), Process is the time spent processing data, and Reserve is the time spent waiting for output space to become available in the ring (i.e., waiting for downstream blocks).
The CPU fraction will probably be 100% on any GPU block because it's currently set to spin while waiting for the GPU.
@telegraphic, this will probably be much faster on the latest version with ctypesgen. If not, I think tuning the gulp sizes and nframes should help a lot.
If not, this tool is extremely useful for tracking down superfluous python code (if there still is any).
To what extent has testing of ctypesgen been tested? L.
On August 4, 2017 1:25:57 PM PDT, Miles Cranmer notifications@github.com wrote:
@telegraphic, this will probably be much faster on the latest version with ctypesgen. If not, I think tuning the gulp sizes and nframes should help a lot.
If not, this tool is extremely useful for tracking down superfluous python code (if there still is any).
-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/ledatelescope/bifrost/issues/90#issuecomment-320345358
-- Sent from Droid Turbo. Please excuse use of typos and abbreviations.
I have some changes coming that should further improve perf of these sorts of pipelines. I expect to finish them on the weekend.
All tests (>200 now) pass with ctypesgen.
The new reduce block and transpose kernels might be helpful here. And the profiler is a good way to see what the bottlenecks are in your pipeline.
Are there any remaining problems here or can we close this out?
I finally have a bifrost version of our gpuspec code! However, it's running very slowly, and I'd like to find out why. Unfortunately, the monitoring tools haven't been working for me.
My code is here: https://github.com/telegraphic/bunyip/blob/master/bf_gpuspec.py Note: it requires a GPU with 8GB RAM to run -- really, more RAM would be better.
If I try running the monitor tools they crash:
The like_top script is the worst, as it fubars my command line. It needs a more graceful failure mode...