iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.5k stars 557 forks source link

torch based bert-large-encased training performance tracking #10678

Open dan-garvey opened 1 year ago

dan-garvey commented 1 year ago

Raising this issue to track performance optimization for bert training via the torch-mlir path.

Due to how torch-mlir currently lowers training workloads, the input for an iree benchmark command includes several hundred input files. For anyone working on this's convenience, I've written a simple python script to run the benchmark command.

If you'd like to try it out, all you have to do is:

mkdir bert_large_uncased
cd bert_large_uncased
wget https://storage.googleapis.com/shark_tank/dan/bert_large_uncased/benchmark_bert.py
wget https://storage.googleapis.com/shark_tank/dan/bert_large_uncased/inputs.npz
wget https://storage.googleapis.com/shark_tank/dan/bert_large_uncased/bert_large_uncased.mlir
python benchmark_bert.py

Once https://github.com/nod-ai/SHARK/pull/387 lands, bert-large will be tested and benchmarked as part of the normal shark tank nightly tests.

Current numbers for iree on a a100 system look like: image

On base pytorch(cuda): total_time(ms)/iter: 69.1148742198944

I'll upload a trace shortly

benvanik commented 1 year ago

fyi you can pass a single npy file with all of the inputs - npy files can contain any number of arrays:

with open('test.npy', 'wb') as f:
    np.save(f, np.array([1, 2]))
    np.save(f, np.array([1, 3]))

then you pass that as --function_input=test.npy and it'll expand to two inputs

the bigger thing is fixing the export such that you aren't passing in hundreds of things - that's not something we can really fix the performance of. still probably lots of other things to fix, but that'll likely become your bottleneck.

ThomasRaoux commented 1 year ago

what's the performance on TF? I don't see it on shark tank

dan-garvey commented 1 year ago

what's the performance on TF? I don't see it on shark tank

TF training benchmarks + tracy coming soon™

fyi you can pass a single npy file with all of the inputs - npy files can contain any number of arrays:

with open('test.npy', 'wb') as f:
    np.save(f, np.array([1, 2]))
    np.save(f, np.array([1, 3]))

then you pass that as --function_input=test.npy and it'll expand to two inputs

the bigger thing is fixing the export such that you aren't passing in hundreds of things - that's not something we can really fix the performance of. still probably lots of other things to fix, but that'll likely become your bottleneck.

Thanks! that is much more convenient. I'm not sure what the best way forward is going to be for the export. Given that it essentially becomes a recursive function, I suppose there should be a way to write a pass for that

dan-garvey commented 1 year ago

bert_profile.txt Here's a trace

image @benvanik @ThomasRaoux is this what you would expect to see from a single iteration? Is the significant buffer alloc/free/dispatch time a result of the large input count?

benvanik commented 1 year ago

that's a lot of dispatches - can you compare the number with the non-torch version? at the point we're spending half as long enqueuing the dispatches as we are executing them something is pretty wrong :)

dan-garvey commented 1 year ago

Its difficult because the tensorflow equivalent doesn't have the handicap of passing all the weights as inputs on every iteration, but I suppose then at least we would hopefully see a significantly lower percentage of time spent on enqueuing the dispatches?

benvanik commented 1 year ago

yeah - less about time taken and more the 4122 number - if that's 2000 in the tf version that'd indicate that fusion was better there, while if it's the same then we'd expect the times to be the same as well

dan-garvey commented 1 year ago

While I'm waiting for a bug on the tf path to get fixed, here's an nsys profile: image image

I don't see a way to identify individual iterations in Nsight, but hopefully this is still useful information

benvanik commented 1 year ago

You can filter events to a region of the timeline by drag selecting and choosing "filter and reorder": image that's important for eliminating the startup times

The biggest thing is reducing kernel launch count, after which we need to kill the synchronize time (now possible on CPU but CUDA HAL needs to be made async). When we get graphs working the launch time will go down a lot (or at least, to as low as possible for the given number of launches).

The 34k h->d copies is pretty weird - what are those?

dan-garvey commented 1 year ago

image

It looks like those are startup only, as they disappear on filtering to a single iteration

powderluv commented 1 year ago

how are we looking if we just added kernel runtimes and add that up (so we can calculate the overhead of the cuda sync).

@ThomasRaoux Are the top items the same kind of reductions we see on the vulkan side ?

34k h->d is for the real datasets being used I assume ? Since this is testing BERT training accuracy with real data.

dan-garvey commented 1 year ago

More info: gpu kernel info pytorch vs iree

https://docs.google.com/spreadsheets/d/1OkbohUpWcB0vwtaHeENDmsF6DYek67G5hqasadAGUNo/edit?usp=sharing

TF stuff still blocked by an upstream bug.

dan-garvey commented 1 year ago

image This is via the most recent iree. It seems a vast majority of the time is spent doing this 128x1024x30522 matmul @ThomasRaoux is there something I can do to knock this down?

benvanik commented 1 year ago

I'm surprised that the host side blocks there - we shouldn't be synchronizing like that - would be interested to know why that's happening. Maybe a GPU context switch? Does it reproduce at the same place all the time? At the right side of the trace you can see what's expected: host side doesn't wait for the commands to complete until it gets to the synchronize.

dan-garvey commented 1 year ago

more zoomed out: image

benvanik commented 1 year ago

Crazy! Can you share the capture file? Wonder if there's any other hints in there as to why that's happening.

ThomasRaoux commented 1 year ago

image This is via the most recent iree. It seems a vast majority of the time is spent doing this 128x1024x30522 matmul @ThomasRaoux is there something I can do to knock this down?

The recent improvements from @okkwon should help this. Once specialization also supports tensorcore there should be an even bigger boost.

dan-garvey commented 1 year ago

Crazy! Can you share the capture file? Wonder if there's any other hints in there as to why that's happening.

https://storage.googleapis.com/shark_tank/dan/report11%20(1).nsys-rep

benvanik commented 1 year ago

Interesting - I didn't catch it from the shots but it's not running lock-step with launch/execute but instead the CPU is blocking until a prior execution completes. We may be hitting a stream buffer size limit and causing the driver to block which is unfortunate. Would love to know if there's a way we can avoid that but graphs would solve it for us so it's probably not a big deal (wouldn't help latency here but would reduce CPU time).

Capturing with more data may help hint at what's happening:

What this means is that so long as it's happening the cuLaunchKernel times (even in aggregate) are effectively meaningless - that's unfortunate :(

dan-garvey commented 1 year ago

Capturing with more data may help hint at what's happening:

  • "Collect GPU context switch trace"
  • "Collect CPU IP/backtrace samples"
  • "Collect CPU context switch trace"

here's a trace where I added these flags: --gpuctxsw=true --cpuctxsw=system-wide --samples-per-backtrace=1

https://storage.googleapis.com/shark_tank/dan/report1.nsys-rep

What this means is that so long as it's happening the cuLaunchKernel times (even in aggregate) are effectively meaningless - that's unfortunate :(

Can you help me understand this?

benvanik commented 1 year ago

Darn no libcuda.so symbols but at least confirms it's the driver waiting in a spinloop for 60ms - that's silly, but such is CUDA I suppose:

Call to _forward_dispatch_1668_generic_1024x128
■ Kernel launcher
Begins: 19.7277s
Ends: 19.7859s (+58.148 ms)
Return value: 0
GPU: 0000:00:04.0 - NVIDIA A100-SXM4-40GB
Stream: 13
Latency: 121.968 ms→
Correlation ID: 152133

Call stack:
libc.so.6!__sched_yield
libcuda.so.515.43.04[7 Frames]
iree-benchmark-module!iree_hal_cuda_stream_command_buffer_dispatch
iree-benchmark-module!iree_hal_command_buffer_dispatch
iree-benchmark-module!iree_hal_deferred_command_buffer_apply
iree-benchmark-module!iree_hal_cuda_device_queue_execute
iree-benchmark-module!iree_hal_device_queue_execute

It means that CPU performance numbers of a CUDA run are meaningless as this shows that we are using 100% of the CPU thread for 60ms doing this launch but we aren't really - CUDA is - and that if you were to sum up all the cuLaunchKernel times here or in tracy/perf it'd not be measuring how much it costs us to launch kernels but how much CUDA is spinlooping inside the driver. Would be really good to find a way to avoid this e.g. if there are CUDA options for increasing the buffer size such that it doesn't block like this or mode changes for non-blocking. but the only thing it hurts here is the planet (lots of wasted power) and multi-tenant systems (where effectively an entire CPU core is taken offline for the duration of the execution while CUDA spins) - it shouldn't change how long the GPU takes to execute its work.

When we use graphs we'll avoid the stream buffering and hopefully see just a short cudaGraphLaunch followed by the synchronize (which we can remove in async modes).

dan-garvey commented 1 year ago

I see what you meant now, thanks.

dan-garvey commented 1 year ago

@ThomasRaoux I'm working on adapting okwan's changes to work with tensorcore, but some of the sizes in this model aren't supported by the tensorcore pipeline, is there anything else we can do on our side to help with the performance of this model?

this is a snapshot of our most recent tuned model kernels image

okkwon commented 1 year ago

My WIP PR currently only supports matmuls with aligned K. I will take a look at peeling to support unaligned K cases. For example, 128x1024x30522 is not supported since 30522 is not a multiple of 16.

@dan-garvey do you happen to have dispatch perf numbers before and after with my PR? I wonder if there is any improved case.

jpienaar commented 1 year ago

What work is further planned here?

ThomasRaoux commented 1 year ago

What work is further planned here?

This falls into the unaligned matmul support plan. This is being worked on as part of the matmul/conv sprint however I don’t think we have a clear timeline yet.

jpienaar commented 1 year ago

Sounds good, I just saw assignments changed and wasn't sure. Thanks for clarification.

allieculp commented 1 year ago

@qcolombet @nicolasvasilache @manishucsd @mattwalsh adding here in case this is helpful for the unaligned matmul support.