Closed speedcell4 closed 3 years ago
Yes, these functions are slower than the built in CUDA functions, but they scale to much larger memory since they don't create an intermediate tensor.
Ideally they would be faster in the small case as well, but they need to be further optimized with TVM or Cutlass, and I haven't had the time to figure that out.
Although actually I haven't tried it recently with the JIT? Does PyTorch fuse those operators now?
Thank you for your quick response.
Although actually I haven't tried it recently with the JIT? Does PyTorch fuse those operators now?
I guess so. In my test, @jit.script
decorated functions are faster
The test that would be interesting to me would be to see speed as you increase the inner dimension size. In my experiments your code will start running out of memory once you reach medium size tensor inner shapes.
That's true, but why would the in_size
become a problem?
Generally logbmm
is used to implement CRF, I think the number of target labels will not be very large (<100)?
We use your code until the target labels get large:
https://github.com/harvardnlp/pytorch-struct/blob/master/torch_struct/semirings/semirings.py#L117
This code is just an option for people who want to run bigger model.
Hi~
Could you provide some performance about these functions? As a result of my trials, they are much slower (about 6 times) than a plain combination of PyTorch functions like below. I'm not sure if I'm using it incorrectly, could you please provide some performance comparison? Thank you so much~