Sentiment Analysis Backend

BasicallyOk commented 7 months ago

Is your feature request related to a problem? Please describe. Split off from #7. Provides sentiment analysis for chat logs.

Describe the solution you'd like Use the LLM to find list of common experiences, not just sentiments. Essentially clustering, then labeling the cluster. Very difficult, hence impressive if done. To do this, we use a model to encode the submissions, cluster them (the hard part is figuring out if the submission does not belong to any cluster, and whether moving the cluster centre would affect the assignment of some submissions). Afterwards, we feed the encoding at the centre (mean of everything in cluster) to the LLM and ask it to label it for us. Another interesting problem to solve here is figuring out what to encode. Singular messages (more detailed, horrible scalability) or the full chat (less detailed, better scalability). I can't really tell how good it would be even. So maybe that's worth a ticket just for experimenting and figuring out if it's worth a shot. This method relies almost exclusively on scale, as more submissions = better clustering; naturally, this will make testing very very difficult.

Describe alternatives you've considered Preset sentiments. Essentially provide a list of sentiments to analyze and potentially give a score to. I imagine we can use the LLM's encoder, feed it the database (potentially via langchain vectordb, which can be linked to our sql database, but there is a potential scalability issue with that). From there, you can store the percentage of submissions that convey certain emotions in a Sentiment table in the database, which can be calculated dynamically after every new submission. Still fun to do, but kinda boring compared to the other solution

Additional context Something like what @IbDaGib made here, but shows statistics among everybody Ib's idea

BasicallyOk commented 6 months ago

Multi-language support will be difficult, since the clustering depends on the embedding, which is different for every model. A good solution would be to assume by the time more languages is required, there will be an existing english database. So the experiences in other languages can only be clustered using existing experience clusters. We can simply promp the language model itself to align the experience with existing cluster tags (possibly with some translation).

Since this is complicated, it's fairly low priority and can be left to future maintainers.

mustafa-tariqk commented 6 months ago

Your method seems difficult. I wouldn't want you spending too much time on this and detracting from other goals of the project. Remember, do what you know, do what's simple, do what's popular.

You know the type of leader I am so I'm not gonna force you to go about doing things in any specific way, but I suggest you have a huggingface model ready as a backup first to: a) test your model against b) have something in case the path you're going down doesn't yield satisfactory results.

Multi emotion sentiment is a pretty hard problem that'll require extensive amounts of compute, so maybe we should also tone down our expectations. Using TextBlob to extract emotion (pos/neg), subjectivity and polarity, or VaderSentiment. This is especially true given the fact that our VM will have 1 vCPU and 1GB of RAM.

To give a little background, sentiment/text categorization is what I did for my first QMIND project, an internship project, and my neural nets group project.

But yeah, no hard set rules, just my concerns. I trust you'll do what's best.

BasicallyOk commented 6 months ago

I have a decently solid plan of action that should not take too long given what I have now. The plan is to use an online k-mean algorithm. Here are the steps every time a submission is made.

Submission is saved to SQL database, but at the same time is embedded and the embedding-id pair sent to the vectorstore.
The submission is then compared known cluster centroids, which are defined as individual messages.
If the distance exceeds a certain point, create a new experience cluster in database, and store that message as the centroid. Use FAISS to query for clusters that is within the limit and changes their classification (we can also decide to have multiple classifications per message).
Otherwise, the message joins a known cluster. This is when the cluster centre gets embedded, and then we add the direction vector to the new message (scaled by inverse of the amount of messages already present in cluster). This embedding gets stored in the vectorstore, along with the id of a newly created message (that is also the new centroid of that cluster).
Regardless of what happens previously, we use FAISS again to query for the closest k messages to the cluster centroid and prompt gpt to find the common experience.

Sounds complicated, but I'm already 75% done, if testing doesn't look good, most of the code can be repurposed for the easy method (where we simply prompt gpt). Plus, if my research is correct, FAISS is pretty lightweight and can handle low RAM automatically. If performance does not look good, we can move to per-chat sentiment instead of per-message (straightforward fix). Also, renaming clusters can be made to happen periodically instead of after every submission to avoid lag.

BasicallyOk commented 6 months ago

Can't be done unless we have data to test + seed. #68 will be done first.

mustafa-tariqk / mindscape

Sentiment Analysis Backend #34