mustafa-tariqk / mindscape

Experience the truth of the trip
https://research.cs.queensu.ca/home/cisc498/index.html
MIT License
16 stars 0 forks source link

Sentiment Analysis Backend #34

Closed BasicallyOk closed 5 months ago

BasicallyOk commented 7 months ago

Is your feature request related to a problem? Please describe. Split off from #7. Provides sentiment analysis for chat logs.

Describe the solution you'd like Use the LLM to find list of common experiences, not just sentiments. Essentially clustering, then labeling the cluster. Very difficult, hence impressive if done. To do this, we use a model to encode the submissions, cluster them (the hard part is figuring out if the submission does not belong to any cluster, and whether moving the cluster centre would affect the assignment of some submissions). Afterwards, we feed the encoding at the centre (mean of everything in cluster) to the LLM and ask it to label it for us. Another interesting problem to solve here is figuring out what to encode. Singular messages (more detailed, horrible scalability) or the full chat (less detailed, better scalability). I can't really tell how good it would be even. So maybe that's worth a ticket just for experimenting and figuring out if it's worth a shot. This method relies almost exclusively on scale, as more submissions = better clustering; naturally, this will make testing very very difficult.

Describe alternatives you've considered Preset sentiments. Essentially provide a list of sentiments to analyze and potentially give a score to. I imagine we can use the LLM's encoder, feed it the database (potentially via langchain vectordb, which can be linked to our sql database, but there is a potential scalability issue with that). From there, you can store the percentage of submissions that convey certain emotions in a Sentiment table in the database, which can be calculated dynamically after every new submission. Still fun to do, but kinda boring compared to the other solution

Additional context Something like what @IbDaGib made here, but shows statistics among everybody Ib's idea

BasicallyOk commented 6 months ago

Multi-language support will be difficult, since the clustering depends on the embedding, which is different for every model. A good solution would be to assume by the time more languages is required, there will be an existing english database. So the experiences in other languages can only be clustered using existing experience clusters. We can simply promp the language model itself to align the experience with existing cluster tags (possibly with some translation).

Since this is complicated, it's fairly low priority and can be left to future maintainers.

mustafa-tariqk commented 6 months ago

Your method seems difficult. I wouldn't want you spending too much time on this and detracting from other goals of the project. Remember, do what you know, do what's simple, do what's popular.

You know the type of leader I am so I'm not gonna force you to go about doing things in any specific way, but I suggest you have a huggingface model ready as a backup first to: a) test your model against b) have something in case the path you're going down doesn't yield satisfactory results.

Multi emotion sentiment is a pretty hard problem that'll require extensive amounts of compute, so maybe we should also tone down our expectations. Using TextBlob to extract emotion (pos/neg), subjectivity and polarity, or VaderSentiment. This is especially true given the fact that our VM will have 1 vCPU and 1GB of RAM.

To give a little background, sentiment/text categorization is what I did for my first QMIND project, an internship project, and my neural nets group project.

But yeah, no hard set rules, just my concerns. I trust you'll do what's best.

BasicallyOk commented 6 months ago

I have a decently solid plan of action that should not take too long given what I have now. The plan is to use an online k-mean algorithm. Here are the steps every time a submission is made.

Sounds complicated, but I'm already 75% done, if testing doesn't look good, most of the code can be repurposed for the easy method (where we simply prompt gpt). Plus, if my research is correct, FAISS is pretty lightweight and can handle low RAM automatically. If performance does not look good, we can move to per-chat sentiment instead of per-message (straightforward fix). Also, renaming clusters can be made to happen periodically instead of after every submission to avoid lag.

BasicallyOk commented 6 months ago

Can't be done unless we have data to test + seed. #68 will be done first.