rafaljanwojcik / Unsupervised-Sentiment-Analysis

How to extract sentiment from opinions without any labels
MIT License
137 stars 42 forks source link

running for roman urdu #1

Open nayanhalder opened 4 years ago

nayanhalder commented 4 years ago

The sample here given is for polish. i noticed no pre trained embedding is used to polish.

I have run the same for roman urdu( urdu in english alphabet)

all cluster value refers to negative sentiment.

can you help me.

i have 10k roman urdu text consist of positive and negative text only. so i want to run unsupervised clustering for sentiment analysis.

if you provide me your email, i can send you roman urdu dataset also.

rafaljanwojcik commented 4 years ago

Hello, thank you for choosing my article and method for your problem! I didn't use any pretrained embeddings - I trained them only on my dataset. For embeddings trained only on certain dataset, you couldn't have achieved embeddings belonging to only one cluster, so you probably used pretrained embeddings without fine tuning them on your dataset. To be honest, I'm not sure what exactly is your question about, could you please provide me with additional information/ask about something more specific? Best regards Rafał

sob., 11 lip 2020, 18:42 użytkownik nayanhalder notifications@github.com napisał:

The sample here given is for polish. i noticed no pre trained embedding is used to polish.

I have run the same for roman urdu( urdu in english alphabet)

all cluster value refers to negative sentiment.

can you help me.

i have 10k text consist of positive and negative text only. so i want to run unsupervised clustering for sentiment analysis.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rafaljanwojcik/Unsupervised-Sentiment-Analysis/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQBJWJKANXVR3INDAEJDX3R3CI73ANCNFSM4OXMDGOA .

nayanhalder commented 4 years ago

Hi thanks for your reply. I now understood that no pre trained urdu embedding cannot be used. i need to use that embedding of your code and run k mean clustering.

i run properly, but all sentiment are negative in cluster output though there are 50% postitive and 50% negative sentiment.

please provide your email id. i will send my code , the size is very small.

nayanhalder commented 4 years ago

https://github.com/nayanhalder/sentiment-analysis

I have uploaded the code in my github. you can check the code. the problem is all are negative sentiment in sentiment_dictionary , cluster output

nayanhalder commented 4 years ago

after the initial all negative result with roman urdu, i have downloaded twitter airline English dataset. https://www.kaggle.com/crowdflower/twitter-airline-sentiment/data

It works fine i mean both positive and negative sentiment coming in sentiment_dictionary. confusion matrix for twitter airline English dataset

       0       1

0 5643 | 3444 1 894 | 1404

scores for twitter airline English dataset accuracy 0.618972 precision 0.289604 recall 0.610966 f1 0.392947

my roman urdu dataset link http://archive.ics.uci.edu/ml/datasets/Roman+Urdu+Data+Set

 confusion matrix for above roman Urdu dataset

0 | 1 0 5219 | 0 1 5790 | 1

scores for above roman Urdu dataset

accuracy 0.474114 precision 1.000000 recall 0.000173 f1 0.000345

rafaljanwojcik commented 4 years ago

Hello, sorry for responding after such a long time! I will do my best to resolve this issue soon, probably it might require some additional functionalities in the models used, or maybe some additional data cleaning, and it is also possible that this very simple method of sentiment analysis is just not able to capture more advanced dependencies present in your data. Nevertheless, I will do my best to investigate it, and will come back to you soon with some conclusions ;)