salesforce / QueryBlazer

Official source code repository for QueryBlazer: Efficient Query Autocompletion Framework
BSD 3-Clause "New" or "Revised" License
19 stars 7 forks source link

Typo correction #1

Open rob1142 opened 3 years ago

rob1142 commented 3 years ago

Hi guys.

In the paper you compare QueryBlazer to different models, and one of them is the language model that was described in the "Realtime query completion via deep language models" paper. This paper not only implements query autocompletion but also provides an algorithm to perform query autocompletion and query autocorrection at the same time. The paper uses character-level language models. To perform query autocorrection, scores at each generation step are readjusted using edit distance.

I'm wondering if it's possible to achieve something similar using QueryBlazer at the expense of QPS (queries per second) metric. Perhaps remove 2nd FST? But maybe there is a better way...

Thanks.

youngmo-kang commented 3 years ago

Spelling correction has not been implemented in QueryBlazer, but it can certainly be extended. During the encoding stage, instead of following a single path from the user input, you could do a beam-search in conjunction with the language model to keep track of top-k paths with some maximal edit-distance threshold, over which you discard the paths.

The current implementation, however, does encoder stage and language modeling stage in series, so for the method described above, you would need to tightly integrate them. That is, for each character you read in from the user input, you would enqueue the "correct" encoder path as well as possible "misspelled" paths to the next beam, where you would also enqueue the language model score so far for each path. Repeat for all beams. Then prune the beams to leave only top-k beams with the highest scores / edit distance below threshold. Repeat for next character.

This would be very similar to how automatic speech recognition (ASR) works with WFST decoding graph. This is the simplest method I can think of, but maybe there is a more efficient way. Hope this helps.