like_top crashes and low performance

telegraphic commented 7 years ago

I finally have a bifrost version of our gpuspec code! However, it's running very slowly, and I'd like to find out why. Unfortunately, the monitoring tools haven't been working for me.

My code is here: https://github.com/telegraphic/bunyip/blob/master/bf_gpuspec.py Note: it requires a GPU with 8GB RAM to run -- really, more RAM would be better.

If I try running the monitor tools they crash:

dancpr@bldcpr:/bldata/bifrost/tools$ ./like_bmon.py
Traceback (most recent call last):
  File "./like_bmon.py", line 415, in <module>
    main(sys.argv[1:])
  File "./like_bmon.py", line 264, in main
    blockList = _getTransmitReceive()
  File "./like_bmon.py", line 102, in _getTransmitReceive
    contents = load_by_pid(pid)
  File "build/bdist.linux-x86_64/egg/bifrost/proclog.py", line 127, in load_by_pid
  File "build/bdist.linux-x86_64/egg/bifrost/proclog.py", line 79, in load_by_filename
IOError: [Errno 21] Is a directory: '/dev/shm/bifrost/17263/Pipeline_0/HdfWriteBlock_2'

The like_top script is the worst, as it fubars my command line. It needs a more graceful failure mode...

benbarsdell commented 7 years ago

For performance analysis you could try rebuilding with TRACE=1 and running through nvprof:

$ nvprof -o bf_gpuspec.nvprof python bf_gpuspec.py

(or you can run it from the the nvvp GUI directly).

In theory that will give you a complete timeline of the whole pipeline, though I have to admit that I've barely tested this functionality.

jaycedowell commented 7 years ago

A call to 'reset' resets the terminal settings after a pycurses crash.

What's the full directory structure inside /dev/shm/bifrost?

On 6/8/17 8:28 PM, Danny Price wrote:

I finally have a bifrost version of our gpuspec code! However, it's running very slowly, and I'd like to find out why. Unfortunately, the monitoring tools haven't been working for me.

My code is here: https://github.com/telegraphic/bunyip/blob/master/bf_gpuspec.py Note: it requires a GPU with 8GB RAM to run -- really, more RAM would be better.

If I try running the monitor tools they crash:

|dancpr@bldcpr:/bldata/bifrost/tools$ ./like_bmon.py Traceback (most recent call last): File "./like_bmon.py", line 415, in main(sys.argv[1:]) File "./like_bmon.py", line 264, in main blockList = _getTransmitReceive() File "./like_bmon.py", line 102, in _getTransmitReceive contents = load_by_pid(pid) File "build/bdist.linux-x86_64/egg/bifrost/proclog.py", line 127, in load_by_pid File "build/bdist.linux-x86_64/egg/bifrost/proclog.py", line 79, in load_by_filename IOError: [Errno 21] Is a directory: '/dev/shm/bifrost/17263/Pipeline_0/HdfWriteBlock_2' |

The like_top script is the worst, as it fubars my command line. It needs a more graceful failure mode...

screen shot 2017-06-08 at 7 24 46 pm https://user-images.githubusercontent.com/713251/26958533-294889b8-4c80-11e7-803d-1afbf180b9be.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ledatelescope/bifrost/issues/90, or mute the thread https://github.com/notifications/unsubscribe-auth/AGaC4-D-8l2KwDnvj-_GkzoRLmMtoCd6ks5sCK3BgaJpZM4N01VH.

telegraphic commented 7 years ago

Hey Jayce,

here's the directory structure via (newly discovered) tree:

dancpr@bldcpr:/bldata/bifrost/tools$ tree /dev/shm/bifrost
/dev/shm/bifrost
└── 17263
    └── Pipeline_0
        ├── AccumulateBlock_0
        │   ├── bind
        │   ├── in
        │   ├── out
        │   ├── perf
        │   └── sequence0
        ├── BlockScope_1
        │   ├── PrintHeaderBlock_0
        │   │   ├── bind
        │   │   ├── in
        │   │   ├── out
        │   │   ├── perf
        │   │   └── sequence0
        │   └── TransposeBlock_0
        │       ├── bind
        │       ├── in
        │       ├── out
        │       ├── perf
        │       └── sequence0
        ├── BlockScope_13
        │   ├── CopyBlock_3
        │   │   ├── bind
        │   │   ├── in
        │   │   ├── out
        │   │   ├── perf
        │   │   └── sequence0
        │   └── TransposeBlock_1
        │       ├── bind
        │       ├── in
        │       ├── out
        │       ├── perf
        │       └── sequence0
        ├── BlockScope_16
        │   ├── AccumulateBlock_1
        │   │   ├── bind
        │   │   ├── in
        │   │   ├── out
        │   │   ├── perf
        │   │   └── sequence0
        │   ├── CopyBlock_4
        │   │   ├── bind
        │   │   ├── in
        │   │   ├── out
        │   │   ├── perf
        │   │   └── sequence0
        │   ├── CopyBlock_5
        │   │   ├── bind
        │   │   ├── in
        │   │   ├── out
        │   │   ├── perf
        │   │   └── sequence0
        │   ├── DetectBlock_1
        │   │   ├── bind
        │   │   ├── in
        │   │   ├── out
        │   │   ├── perf
        │   │   └── sequence0
        │   ├── FftBlock_1
        │   │   ├── bind
        │   │   ├── in
        │   │   ├── out
        │   │   ├── perf
        │   │   └── sequence0
        │   └── FftShiftBlock_1
        │       ├── bind
        │       ├── in
        │       ├── out
        │       ├── perf
        │       └── sequence0
        ├── BlockScope_25
        │   ├── CopyBlock_6
        │   │   ├── bind
        │   │   ├── in
        │   │   ├── out
        │   │   ├── perf
        │   │   └── sequence0
        │   ├── PrintHeaderBlock_2
        │   │   ├── bind
        │   │   ├── in
        │   │   ├── out
        │   │   ├── perf
        │   │   └── sequence0
        │   └── TransposeBlock_2
        │       ├── bind
        │       ├── in
        │       ├── out
        │       ├── perf
        │       └── sequence0
        ├── BlockScope_29
        │   ├── AccumulateBlock_2
        │   │   ├── bind
        │   │   ├── in
        │   │   ├── out
        │   │   ├── perf
        │   │   └── sequence0
        │   ├── CopyBlock_7
        │   │   ├── bind
        │   │   ├── in
        │   │   ├── out
        │   │   ├── perf
        │   │   └── sequence0
        │   ├── CopyBlock_8
        │   │   ├── bind
        │   │   ├── in
        │   │   ├── out
        │   │   ├── perf
        │   │   └── sequence0
        │   ├── DetectBlock_2
        │   │   ├── bind
        │   │   ├── in
        │   │   ├── out
        │   │   ├── perf
        │   │   └── sequence0
        │   ├── FftBlock_2
        │   │   ├── bind
        │   │   ├── in
        │   │   ├── out
        │   │   ├── perf
        │   │   └── sequence0
        │   └── FftShiftBlock_2
        │       ├── bind
        │       ├── in
        │       ├── out
        │       ├── perf
        │       └── sequence0
        ├── BlockScope_4
        │   ├── CopyBlock_0
        │   │   ├── bind
        │   │   ├── in
        │   │   ├── out
        │   │   ├── perf
        │   │   └── sequence0
        │   ├── CopyBlock_1
        │   │   ├── bind
        │   │   ├── in
        │   │   ├── out
        │   │   ├── perf
        │   │   └── sequence0
        │   ├── DetectBlock_0
        │   │   ├── bind
        │   │   ├── in
        │   │   ├── out
        │   │   ├── perf
        │   │   └── sequence0
        │   ├── FftBlock_0
        │   │   ├── bind
        │   │   ├── in
        │   │   ├── out
        │   │   ├── perf
        │   │   └── sequence0
        │   └── FftShiftBlock_0
        │       ├── bind
        │       ├── in
        │       ├── out
        │       ├── perf
        │       └── sequence0
        ├── CopyBlock_2
        │   ├── bind
        │   ├── in
        │   ├── out
        │   ├── perf
        │   └── sequence0
        ├── GuppiRawSourceBlock_0
        │   ├── bind
        │   ├── in
        │   ├── out
        │   └── perf
        ├── HdfWriteBlock_0
        │   ├── bind
        │   ├── in
        │   ├── out
        │   └── sequence0
        ├── HdfWriteBlock_1
        │   ├── bind
        │   ├── in
        │   ├── out
        │   ├── perf
        │   └── sequence0
        ├── HdfWriteBlock_2
        │   ├── bind
        │   ├── in
        │   ├── out
        │   ├── perf
        │   └── sequence0
        ├── PrintHeaderBlock_1
        │   ├── bind
        │   ├── in
        │   ├── out
        │   ├── perf
        │   └── sequence0
        └── PrintHeaderBlock_3
            ├── bind
            ├── in
            ├── out
            ├── perf
            └── sequence0

40 directories, 158 files

telegraphic commented 7 years ago

nvprof seems to work:

jaycedowell commented 7 years ago

Ah, your directory structure is deeper than I expected. Can you try this version of proclog/like_top?

proclog.zip

telegraphic commented 7 years ago

@jaycedowell I am impressed.

(but unsure how to read where the bottlenecks are)

jaycedowell commented 7 years ago

Ha ha, thanks. I usually look at the processing, reserve, and acquire times to try and figure out what is happening. You can also sort by these columns with the p, r, and a keys, respectively.

telegraphic commented 7 years ago

Thanks -- may I confirm what process, reserve and acquire mean? That is, does large reserve fraction means it's idling (bottleneck in another block), large processing fraction means it's working hard (compute bound), and large acquire means lots of reading from the ring (memory bound)?

I'm also guessing that CPU% being at 100% means that the block is CPU bound. In which case, the accumulate block for low-resolution FFTs is one of my bottlenecks (from a different test), presumably due to overhead on calling on_data().

benbarsdell commented 7 years ago

Acquire is the time spent waiting for input (i.e., waiting on upstream blocks), Process is the time spent processing data, and Reserve is the time spent waiting for output space to become available in the ring (i.e., waiting for downstream blocks).

The CPU fraction will probably be 100% on any GPU block because it's currently set to spin while waiting for the GPU.

MilesCranmer commented 7 years ago

@telegraphic, this will probably be much faster on the latest version with ctypesgen. If not, I think tuning the gulp sizes and nframes should help a lot.

If not, this tool is extremely useful for tracking down superfluous python code (if there still is any).

ledatelescope commented 7 years ago

To what extent has testing of ctypesgen been tested? L.

On August 4, 2017 1:25:57 PM PDT, Miles Cranmer notifications@github.com wrote:

@telegraphic, this will probably be much faster on the latest version with ctypesgen. If not, I think tuning the gulp sizes and nframes should help a lot.

If not, this tool is extremely useful for tracking down superfluous python code (if there still is any).

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/ledatelescope/bifrost/issues/90#issuecomment-320345358

-- Sent from Droid Turbo. Please excuse use of typos and abbreviations.

benbarsdell commented 7 years ago

I have some changes coming that should further improve perf of these sorts of pipelines. I expect to finish them on the weekend.

All tests (>200 now) pass with ctypesgen.

benbarsdell commented 7 years ago

The new reduce block and transpose kernels might be helpful here. And the profiler is a good way to see what the bottlenecks are in your pipeline.

jaycedowell commented 6 years ago

Are there any remaining problems here or can we close this out?

ledatelescope / bifrost

like_top crashes and low performance #90