Fix performance issues in gpt2_bpe_tokenizer

pytorch / torcharrow

High performance model preprocessing library on PyTorch

https://pytorch.org/torcharrow/beta/index.html

BSD 3-Clause "New" or "Revised" License

649 stars 79 forks source link

Fix performance issues in gpt2_bpe_tokenizer #401

Closed laithsakka closed 2 years ago

laithsakka commented 2 years ago

Summary: complex structures in c++ should be passed as const ref instead of value to avoid data copy. A bunch of functions was passing by value gpt2_bpe_tokenizer

Differential Revision: D37423480

facebook-github-bot commented 2 years ago

This pull request was exported from Phabricator. Differential Revision: D37423480

facebook-github-bot commented 2 years ago

This pull request was exported from Phabricator. Differential Revision: D37423480

Nayef211 commented 2 years ago

Wow.Thanks!

AFAIK these functions are copied from TorchText.

Does that mean a lot of functions in TorchText itself (e.g. https://github.com/pytorch/text/blob/88b251f9cebae86feb5edf459b978bf211b65183/torchtext/csrc/gpt2_bpe_tokenizer.cpp#L120 ) can also be optimized?

Yup thanks for catching this @laithsakka. We can create a followup PR in torchtext with these changes.

cc @abhinavarora

facebook-github-bot commented 2 years ago

This pull request was exported from Phabricator. Differential Revision: D37423480