performance notes

The good:

the algorithm is designed in a way that it doesn't need to know in advance whether the 1-to-N mapping exists in the permutes.
_all_keys_used_once is no longer needed
no longer need a torch.cat before calling the old operator
no need to use _pin_and_move for the meta data (arguments), it will be handled inside the operator, it's more friendly to tracing.

The same bad:

it requires several HtoD communications (move tensor to device): a) [resolved] 3 tensors, which are permutes, input_lengths, and output_lengths. Those tensors needs to be on the device so that the cuda kernels has access to it. b) [resolved] 2 lists of (scalar_t*) pointers, input and output tensor lists. c) [resolved] Didn't find a good way to let the kernel knows the address of the lists of input/output tensors, because the lists are also need to be on the device.
tensor.contiguous for the backward function, it looks like the grad from the backward are somehow not contiguous.

benchmark

op-level results

INFO:root:size: 1024 x 57168; permute_multi_embedding: 1.5612200498580933 ms; permute_pooled_embs_auto_grad: 0.9015970826148987 ms
INFO:root:size: 1024 x 134096; permute_multi_embedding: 3.0794131755828857 ms; permute_pooled_embs_auto_grad: 2.114053726196289 ms
INFO:root:size: 1024 x 136752; permute_multi_embedding: 2.6919198036193848 ms; permute_pooled_embs_auto_grad: 2.159184455871582 ms
INFO:root:size: 1024 x 260944; permute_multi_embedding: 4.805435180664063 ms; permute_pooled_embs_auto_grad: 4.098493576049805 ms
INFO:root:size: 1024 x 538432; permute_multi_embedding: 9.359790802001953 ms; permute_pooled_embs_auto_grad: 8.504887580871582 ms
INFO:root:size: 1024 x 536592; permute_multi_embedding: 9.375926017761232 ms; permute_pooled_embs_auto_grad: 8.459586143493652 ms

fn-level results

_regroup_keyed_tenors               | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1011.0
KeyedTensor.regroup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   5.0 ms | Memory (P90): 1517.0
KTRegroupAsDict                     | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   4.9 ms | Memory (P90): 1517.0
permute_multi_embs                  | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.2 ms | Memory (P90): 1011.0
_regroup_keyed_tenors_dup           | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
KeyedTensor.regroup_dup             | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
KTRegroupAsDict_dup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
permute_multi_embs_dup              | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   3.2 ms | Memory (P90): 1011.0

traces

files

[hhy@50836.od /data/sandcastle/boxes/fbsource (ae677c240)]$ ll *.json
-rw-rw-r-- 1 hhy hhy 8062993 Jun 21 23:26 trace-KeyedTensor.regroup_dup.json
-rw-rw-r-- 1 hhy hhy  949610 Jun 21 23:26 trace-KeyedTensor.regroup.json
-rw-rw-r-- 1 hhy hhy 5140143 Jun 21 23:26 trace-KTRegroupAsDict_dup.json
-rw-rw-r-- 1 hhy hhy  350370 Jun 21 23:26 trace-KTRegroupAsDict.json
-rw-rw-r-- 1 hhy hhy  581033 Jun 21 23:26 trace-permute_multi_embs_dup.json
-rw-rw-r-- 1 hhy hhy  582607 Jun 21 23:26 trace-permute_multi_embs.json
-rw-rw-r-- 1 hhy hhy 8025337 Jun 21 23:26 trace-_regroup_keyed_tenors_dup.json
-rw-rw-r-- 1 hhy hhy 8041586 Jun 21 23:26 trace-_regroup_keyed_tenors.json

Differential Revision: D58906839

facebook-github-bot commented 3 months ago

This pull request was exported from Phabricator. Differential Revision: D58906839

netlify[bot] commented 3 months ago

Deploy Preview for pytorch-fbgemm-docs failed.

Name	Link
Latest commit	ea1ca6a1de3721543d2fae45f22d8cc341c68f59
Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6676eeb209af39000818affb

facebook-github-bot commented 3 months ago

This pull request was exported from Phabricator. Differential Revision: D58906839

facebook-github-bot commented 2 months ago

This pull request has been merged in pytorch/FBGEMM@f8021eea2bb3da9baac31a45d16775368b876223.

pytorch / FBGEMM

benchmark of fbgemm op - permute_multi_embedding #2771

performance notes

benchmark

traces

Deploy Preview for pytorch-fbgemm-docs failed.

pytorch / FBGEMM

benchmark of fbgemm op - permute_multi_embedding #2771

performance notes

benchmark

traces

❌ Deploy Preview for pytorch-fbgemm-docs failed.

Deploy Preview for pytorch-fbgemm-docs failed.