thijsfranck / courageous-comets

This is the workspace of the Courageous Comets team for the Python Discord Summer Code Jam 2024! ☄️
https://thijsfranck.github.io/courageous-comets/
MIT License
2 stars 0 forks source link

feat: add word counter #11

Closed thijsfranck closed 1 month ago

thijsfranck commented 1 month ago

Set up a function that, given a piece of text, calculates its uniqueness based on the occurrence of common words. The fewer common words, the greater the uniqueness.

elfkuzco commented 1 month ago

I did a bit of digging and it seems like we could use NTLK to stem and tokenize words.

Also found Sentence Transformers which we can use to create vector representations of the message. This is more compact. I suppose we could also store the message ID in the hash that would contain this text and create the discord link to the message on demand.

What do you think? @thijsfranck @isaa-ctaylor

thijsfranck commented 1 month ago

There is also a stemmer/tokenizer built into Redis:

https://redis.io/docs/latest/develop/interact/search-and-query/advanced-concepts/stemming/

I imagine this might scale better since it would require fewer database interactions. But I have no idea how well it works (never used it).