stephbuon / hansard-shiny

Code for the "Hansard Viewer" web app (a prototype app for applying to future support).
https://shinyviz.smu.edu/shiny/public/hansard-shiny/
MIT License
5 stars 0 forks source link

Make KWIC Perform Better by Filtering the Data for the Keyword First #51

Open stephbuon opened 2 years ago

stephbuon commented 2 years ago

This preprocessing will be key -- but how do I proceed? str_contains may be too slow -- could I use a binary search on a two col data set, where the index is a token of a sentence, and the second col is the debate text? I'm not sure if this data set would be prohibitively large. What about after I remove stop tokens?

The problem with this approach ^ is that it would only enable search-by token.

What about data.table's %like% operator?

stephbuon commented 2 years ago

I think instead of %like%, I should have a sorted index with the words mentioned in the embeddings model, and a column with the sentences. Then I should do a binary search on the index. This would prohibit searches for length > 1, however.