Open stephbuon opened 2 years ago
I think instead of %like%
, I should have a sorted index with the words mentioned in the embeddings model, and a column with the sentences. Then I should do a binary search on the index. This would prohibit searches for length > 1, however.
This preprocessing will be key -- but how do I proceed? str_contains may be too slow -- could I use a binary search on a two col data set, where the index is a token of a sentence, and the second col is the debate text? I'm not sure if this data set would be prohibitively large. What about after I remove stop tokens?
The problem with this approach ^ is that it would only enable search-by token.
What about data.table's
%like%
operator?