harvardnlp / genbmm

CUDA kernels for generalized matrix-multiplication in PyTorch
79 stars 13 forks source link

About the performance #7

Closed speedcell4 closed 3 years ago

speedcell4 commented 3 years ago

Hi~

Could you provide some performance about these functions? As a result of my trials, they are much slower (about 6 times) than a plain combination of PyTorch functions like below. I'm not sure if I'm using it incorrectly, could you please provide some performance comparison? Thank you so much~

@jit.script
def logbmm(a: Tensor, b: Tensor) -> Tensor:
    return (a[..., :, :, None] + b[..., None, :, :]).logsumexp(dim=-2)
srush commented 3 years ago

Yes, these functions are slower than the built in CUDA functions, but they scale to much larger memory since they don't create an intermediate tensor.

Ideally they would be faster in the small case as well, but they need to be further optimized with TVM or Cutlass, and I haven't had the time to figure that out.

srush commented 3 years ago

Although actually I haven't tried it recently with the JIT? Does PyTorch fuse those operators now?

speedcell4 commented 3 years ago

Thank you for your quick response.

Although actually I haven't tried it recently with the JIT? Does PyTorch fuse those operators now?

I guess so. In my test, @jit.script decorated functions are faster

srush commented 3 years ago

The test that would be interesting to me would be to see speed as you increase the inner dimension size. In my experiments your code will start running out of memory once you reach medium size tensor inner shapes.

speedcell4 commented 3 years ago

That's true, but why would the in_size become a problem? Generally logbmm is used to implement CRF, I think the number of target labels will not be very large (<100)?

srush commented 3 years ago

We use your code until the target labels get large:

https://github.com/harvardnlp/pytorch-struct/blob/master/torch_struct/semirings/semirings.py#L117

This code is just an option for people who want to run bigger model.