Closed thijsfranck closed 1 month ago
I did a bit of digging and it seems like we could use NTLK to stem and tokenize words.
Also found Sentence Transformers which we can use to create vector representations of the message. This is more compact. I suppose we could also store the message ID in the hash that would contain this text and create the discord link to the message on demand.
What do you think? @thijsfranck @isaa-ctaylor
There is also a stemmer/tokenizer built into Redis:
https://redis.io/docs/latest/develop/interact/search-and-query/advanced-concepts/stemming/
I imagine this might scale better since it would require fewer database interactions. But I have no idea how well it works (never used it).
Set up a function that, given a piece of text, calculates its uniqueness based on the occurrence of common words. The fewer common words, the greater the uniqueness.