Closed AlexMikhalev closed 2 years ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Numbers for illustration: Pure transformers BERT QA i7 Laptop CPU: 10.871693504 seconds single run (inference) vs 398.41 requests per second.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
🚀 Feature request
I managed to achieve 1.46 ms inference for BERT QA (bert-large-uncased-whole-word-masking-finetuned-squad) on a laptop with Intel i7–10875H on RedisAI during RedisLabs RedisConf 2021 hackathon (finished 15th May), CPU only inference. Blog write up while current deployment heavily relies on Redis as you can imagine, it was build during RedisConf Hackathon, I believe the learnings can be incorporated into the core transformers library and speed up inference for NLP tasks.
Motivation
For NLP tasks like BERT QA inference, it's not trivial to load GPU for over 50% - normally tokenization is CPU intensive, while inference can be GPU intensive. It's very easy to cache summarization responses, but QA requires user input and more fiddling.
Proposal
In summary: As a starting point - incorporate cache as a first-class citizen for all tokenisers, probably via decorators. Redis (and Redis Cluster) is a good candidate for cache with in-build sharding and Redis node in cluster configuration occupies 20 MB RAM. I used additional modules RedisGears (distributed compute) and RedisAI - store tensors and run inference(pytorch).
Implementation details: It's part of QAsearch API "medium complexity project" The Pattern. Full code.
Convert and pre-load BERT models on each shard of RedisCluster (code)
Pre-tokenise all potential answers using RedisGears and distribute potential answers on each shard of Redis Cluster using RedisGears(code for batch and for event-based RedisGears function)
Amend calling API to direct question query to shard with most likely answers. Code. The call is using graph-based ranking and zrangebyscore to find the most ranked sentences in response to the question and then gets relevant hashtag from sentence key
Tokenise question. Code. Tokenisation happening on shard and uses RedisGears and RedisAI integration via
import redisAI
Concatenate user question and pre-tokenised potential answers. Code
Run inference using RedisAI. Code model run in async mode without blocking the main Redis threat, so shard can still serve users
Select the answer using max score and convert tokens to words. Code
Cache the answer using Redis — next hit on API with the same question returns the answer in nanoseconds.Code this function uses ‘keymiss’ event.
The code was working in May and uses transformers==4.4.2. Even with all casting into NumPy arrays for RedisAI inference, it was running under 2 ms on Clevo laptop with Intel(R) Core(TM) i7–10875H CPU @ 2.30GHz, 64 GB RAM, SSD. When API calls hit cache response time under 600 ns.