BytePairLanguage plots for arabic tweets

hashirabdulbasheer commented 3 years ago

Hi

I checked out Rasa Whatlies on arabic and English using BytePairLanguage. I used pca and umap to see if similar messages cluster. But they don't. Both for English and arabic, they don't seems to cluster.

I used tweets from two categories:

CATEGORY	ARABIC	ENGLISH

TRANSFER	السلام عليكم يلزمنا حجز موعد للجوازات لاستلام اقامة بعد نقل خدمات عمالة مهنية ولكن لا اجد عند اجراء حجز الموعد مايخص استلام الاقامة	Peace be upon you. We need to book an appointment for passports to receive residency after transferring professional employment services, but I do not find when making appointment reservations about receiving residency
TRANSFER	نقل خدمات عبر ابشر ولكن من دون طباعة الاقامة، يعني لازم مراجعة الجوازات للطباعة، ايش الفايدة،اراجع الجوازات وانقل واطبع اذا كان في الاخير لازم مراجعة	Transferring services via Absher, but without printing the residency, I mean, it is necessary to review passports for printing, what is the payment, return passports, transfer and print if in the last one is necessary
TRANSFER	السلام عليكم هل بالإمكان تغيير مهنة العامل المنزلي إلى مهنه عمل أخرى لنفس الكفيل مع وجود العامل داخل المملكة. وشكرا	Peace be upon you. Is it possible to change the profession of the domestic worker to another profession for the same sponsor, with the presence of the worker inside the Kingdom? Thank you

TRAFFIC	السلام عليكم ورحمة الله وبركاتة يااخوان انا جتني مخالفة مرورية بالغلط وقدمت اعتراض وارفضوه وفي مكان بحياتي ماجيته ولاوصلته كيف الطريقة اسعفوني	Peace, mercy and blessings of God be upon you brothers. I accidentally got a traffic violation and lodged an objection and rejected it.
TRAFFIC	احاول بيع سيارتي عن طريق ابشر وعند اتمام بيع جميع البيانات تظهر هذه الجمله ( هذا الشخص لا يملك هذه السياره ) و عند مراجعه اداره مرور يفيدني الموظف ان عليها حظر نقل ملكيه من قبل الوكاله مع العلم اني انا الملك وتم مراجعه تايوتا الوكاله ولا يوجد حظر على المركبه	I am trying to sell my car through Absher, and upon completion of the sale of all the data, this sentence appears (This person does not own this car), and when reviewing the Traffic Department, the employee informs me that it must prohibit the transfer of ownership by the agency, knowing that I am the king and the Toyota agency has been reviewed and there is no ban on Vehicle
TRAFFIC	السلام عليكم تم تجديد مركبه عن طريق تطبيق الراجحي وعند نقل ملكيه المركبه في ابشر يرفض بسبب عدم السداد ؟؟	Peace be upon you. His vehicle was renewed through the Al-Rajhi application and when the ownership of the vehicle was transferred in Absher, it was rejected due to non-payment ??

For Arabic version here are the graphs

1) PCA : https://drive.google.com/file/d/1Dtdpaigzv6SqhLuL6GJT-HRK5QaZh_0z/view?usp=sharing 2) UMAP: https://drive.google.com/file/d/1WvawUrtNWNg7yzeAMCBmeC18guZGc21M/view?usp=sharing

For English:

1) PCA: https://drive.google.com/file/d/15kes9eKEnLognix_E5w9QZ_ixHjPab56/view?usp=sharing 2) UMAP: https://drive.google.com/file/d/12E5V_Q0B73nkUxdpdmIpYRb-bDcb9Hwd/view?usp=sharing

Any ideas?

thanks hashir

hashirabdulbasheer commented 3 years ago

Here is the code that I used

tweets_ar = [
    "السلام عليكم يلزمنا حجز موعد للجوازات لاستلام اقامة بعد نقل خدمات عمالة مهنية ولكن لا اجد عند اجراء حجز الموعد مايخص استلام الاقامة",
    "نقل خدمات عبر ابشر ولكن من دون طباعة الاقامة، يعني لازم مراجعة الجوازات للطباعة، ايش الفايدة،اراجع الجوازات وانقل واطبع اذا كان في الاخير لازم مراجعة",
    "السلام عليكم هل بالإمكان تغيير مهنة العامل المنزلي إلى مهنه عمل أخرى لنفس الكفيل مع وجود العامل داخل المملكة. وشكرا",
    "السلام عليكم ورحمة الله وبركاتة يااخوان انا جتني مخالفة مرورية بالغلط وقدمت اعتراض وارفضوه وفي مكان بحياتي ماجيته ولاوصلته كيف الطريقة اسعفوني",
    "احاول بيع سيارتي عن طريق ابشر وعند اتمام بيع جميع البيانات تظهر هذه الجمله ( هذا الشخص لا يملك هذه السياره ) و عند مراجعه اداره مرور يفيدني الموظف ان عليها حظر نقل ملكيه من قبل الوكاله مع العلم اني انا الملك وتم مراجعه تايوتا الوكاله ولا يوجد حظر على المركبه",
    "السلام عليكم تم  تجديد مركبه عن طريق تطبيق الراجحي وعند نقل ملكيه المركبه في ابشر يرفض بسبب عدم السداد ؟؟"
]

tweets_en = [
    "Peace be upon you. We need to book an appointment for passports to receive residency after transferring professional employment services, but I do not find when making appointment reservations about receiving residency",
    "Transferring services via Absher, but without printing the residency, I mean, it is necessary to review passports for printing, what is the payment, return passports, transfer and print if in the last one is necessary",
    "Peace be upon you. Is it possible to change the profession of the domestic worker to another profession for the same sponsor, with the presence of the worker inside the Kingdom? Thank you",
    "Peace, mercy and blessings of God be upon you brothers. I accidentally got a traffic violation and lodged an objection and rejected it.",
    "I am trying to sell my car through Absher, and upon completion of the sale of all the data, this sentence appears (This person does not own this car), and when reviewing the Traffic Department, the employee informs me that it must prohibit the transfer of ownership by the agency, knowing that I am the king and the Toyota agency has been reviewed and there is no ban on Vehicle",
    "Peace be upon you. His vehicle was renewed through the Al-Rajhi application and when the ownership of the vehicle was transferred in Absher, it was rejected due to non-payment ??"
]

# lang_bp_ar = BytePairLanguage("ar", dim=300, vs=200_000)
lang_bp_en = BytePairLanguage("en", dim=300, vs=200_000)

embset = lang_bp_en[tweets_en]
p1 = (
    embset.transform(Pca(2))
    .plot_interactive(title="pca")
    .properties(width=500, height=500)
)
p2 = (
    embset.transform(Umap(2))
    .plot_interactive(title="umap")
    .properties(width=500, height=500)
)
p1.show()
p2.show()

koaning commented 3 years ago

Hi @hashirabdulbasheer, thanks for sharing your experiment 😄

There's a few ideas in my mind.

Your texts are relatively long. There are multiple sentences and each sentence seems to discuss a separate intent. My first gut feeling might be to split them up into separate sentences first and to check if they then cluster. What is happening internally is that the vectors for all the (sub)words are being added together. Given a lot of subwords, I can certainly imagine a noisy vector at the end.
You might want to try out a visualization that doesn't require a dimensionality reduction. The getting started guide now lists a similarity chart. If you first sort your (shorter) sentences and then plot, you might get another "view" into clusters that may/may not appear.
If you're interested in embeddings that are designed to tackle longer sentences (which seems your use-case) then you might be interested in trying out BERT-style embeddings instead. I haven't worked with these myself, but the Arabic-BERT models listed here might be worth a try. They seem to be compatible with huggingface, so the code below might work;

from whatlies.langauge import HFTransformersLanguage

HFTransformersLanguage("asafaya/bert-base-arabic")

I'd be curious to see if the Arabic-BERT model makes a difference. But splitting up the text into shorter sentences might have the biggest impact.

hashirabdulbasheer commented 3 years ago

Thanks a lot for your suggestions. I do not know if I can split into smaller sentences because these are from tweets. The arabic tweets doesn't seems to have full stops for smaller sentences. Anyway, I will think about trying to make it smaller. Maybe I could just use x number of words as a sentence. I will try that and get back to you.

Here is the Bert result. The UMAP plot shows two distinct clusters while the PCA doesn't. I will try with more samples and get back to you.

PCA:

UMAP:

koaning commented 3 years ago

Could you try running it on just the words? That might also give an insight. Just to check; Arabic can be split using the "whitespace" right?

In general; for clustering sentences I've sofar also found that sentence/context embedding work very well with UMAP.

hashirabdulbasheer commented 3 years ago

This is the similarity chart from Bert. It is a bit confusing. Check it out.

similarity-bert

koaning commented 3 years ago

Mhmm ... I agree ... the UMAP chart better describes the clusters.

Out of curiosity, what is your end goal for these embedding? Rasa? If so, got labels?

hashirabdulbasheer commented 3 years ago

Just learning clustering and nlp. ;-). Also, I thought, this might help someone who is looking into arabic version of rasa or nlp.

I was trying to find clusters with arabic in the noble quran dataset. I was trying to see if a cluster of verses in a chapter could give some insight into what that chapter was about.

I tried whatlies on it but it didn't make any sense. So I thought I would try on something simpler and thought of using tweets because people comment on a particular post. So the post would form the category and people's comments would be natural language.

My idea was to see if this works in tweets. If it does then I could try it on the Quran verses. But the arabic there is a bit different since its more Classical Arabic. Anyway, that was what I had in mind.

But when I checked the tweets I found that there is a lot going on in there. I mean, people are interacting a lot with the govt through tweets. This could be a good project for the government too. If it could be automated. We could have Rasa reply to all the tweets automatically.

So that's the story :-)

koaning commented 3 years ago

Sounds like fairly interesting work!

Just to check ... how big is your tweet dataset? If you're up for it, I wouldn't mind helping you write a small guide that we can host on the documentation page. It doesn't have to be big, maybe a small demo would suffice.

I made this package to help out folks who want to explore "Non-English" embedding more. So an Arabic example with the tweets might be nice to host.

koaning commented 3 years ago

Also, if you've got a link to the dataset, I might be able to try out my quick bag of tricks on a colab notebook that I can share with you.

hashirabdulbasheer commented 3 years ago

That would be great. I forgot to mention that I was interested in improving and helping with whatlies and your work too.

I don't have it as a dataset. I was just copying it off twitter. Could we build a dataset from the twitter? Each government department seems to have a twitter page and people are posting their problems in it and the government is replying back.

koaning commented 3 years ago

Might this suffice? https://www.kaggle.com/mksaad/arabic-sentiment-twitter-corpus

koaning commented 3 years ago

That dataset looks appropriate, and I will be able to quickly write some code for it if you think it is appropriate. My only concern is that I prefer not to demonstrate any dataset that clearly contains very toxic language. I cannot judge this myself because I don't speak the language but if your impression is that this should not be too much of a concern then I think this dataset might work.

koaning commented 3 years ago

One thing that's nice about that dataset though is that it also allows us to benchmark embeddings for a classification task.

hashirabdulbasheer commented 3 years ago

I am trying to understand, so the idea is to use this dataset to see if they cluster automatically into positive and negative ?

koaning commented 3 years ago

We can also attach a scikit-learn model behind the embedding to quantify their predictive power.

hashirabdulbasheer commented 3 years ago

I think that's great idea. lets do it

koaning commented 3 years ago

I'll come back with a first draft! :)

koaning commented 3 years ago

Here's the code I have to explore the embedding.

import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

from whatlies.language import BytePairLanguage, HFTransformersLanguage
from whatlies.transformers import Umap

lang_bp1 = BytePairLanguage("ar", vs=10000, dim=300)
lang_bp2 = BytePairLanguage("ar", vs=200000, dim=300)
lang_hf = HFTransformersLanguage("asafaya/bert-base-arabic")

df = pd.concat([
    pd.read_csv("test_Arabic_tweets_negative_20190413.tsv", sep="\t", names=["label", "text"]),
    pd.read_csv("test_Arabic_tweets_positive_20190413.tsv", sep="\t", names=["label", "text"])
], axis=0).sample(frac=1).reset_index(drop=True)

small_text_list = list(set(df[:1000]['text']))

def mk_plot(lang, title=""):
    return (lang[small_text_list]
            .transform(Umap(2))
            .plot_interactive(annot=False)
            .properties(title=title, width=200, height=200))

mk_plot(lang_bp1, "bp_small") | mk_plot(lang_bp2, "bp_big") | mk_plot(lang_hf, "huggingface")

koaning commented 3 years ago

The interactive charts suggest that there's certainly some form of clustering happening. I just cannot say if it's any good.

koaning commented 3 years ago

This code is running a comparison between language backends. The BERT-stuff is definitely the slowest part.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(list(df['text']), df['label'])

def run_experiment(embedder):
    embedders = {
        'cv': CountVectorizer(),
        'bp1': lang_bp1,
        'bp2': lang_bp2,
        'hf': lang_hf
    }

    pipe = Pipeline([
        ("emb", embedders[embedder]), 
        ("mod", LogisticRegression())
    ])

    y_pred = pipe.fit(X_train, y_train).predict(X_test)
    print(f"--- results for {embedders[embedder]} --- ")
    print(classification_report(y_test, y_pred))

for e in ['cv', 'bp1', 'bp2', 'hf']:
    run_experiment(e)

koaning commented 3 years ago

Mhmm ... crud the BERT model seems to throw an error. Will investigate.

koaning commented 3 years ago

Found the bug. There's a few long tweets that need to be removed from the dataframe.

df_ml = df.loc[lambda d: d['text'].str.len() < 200].head(500)

Here's the results on just the small part of the dataframe.

--- results for CountVectorizer() --- 
              precision    recall  f1-score   support

         neg       0.59      0.72      0.65        61
         pos       0.66      0.52      0.58        64

    accuracy                           0.62       125
   macro avg       0.62      0.62      0.61       125
weighted avg       0.62      0.62      0.61       125

--- results for BytePairLanguage(cache_dir=None, dim=None, lang=None, vs=None) --- 
              precision    recall  f1-score   support

         neg       0.60      0.59      0.60        61
         pos       0.62      0.62      0.62        64

    accuracy                           0.61       125
   macro avg       0.61      0.61      0.61       125
weighted avg       0.61      0.61      0.61       125

--- results for BytePairLanguage(cache_dir=None, dim=None, lang=None, vs=None) --- 
              precision    recall  f1-score   support

         neg       0.61      0.64      0.62        61
         pos       0.64      0.61      0.62        64

    accuracy                           0.62       125
   macro avg       0.62      0.62      0.62       125
weighted avg       0.62      0.62      0.62       125

--- results for HFTransformersLanguage(model_name_or_path=None) --- 
              precision    recall  f1-score   support

         neg       0.72      0.72      0.72        61
         pos       0.73      0.73      0.73        64

    accuracy                           0.73       125
   macro avg       0.73      0.73      0.73       125
weighted avg       0.73      0.73      0.73       125

Will now rerun on the big dataset.

hashirabdulbasheer commented 3 years ago

Great, got the interactive charts. I am studying them. I can see some clusters. For eg. some sentences were prayers and I could see them together. I will study them more and get back to you.

hashirabdulbasheer commented 3 years ago

if you want to see the meaning in English, you could use "google lens" mobile application to translate the arabic popup from the graph. ;-)

hashirabdulbasheer commented 3 years ago

this is the popup that appears when we tap on a dot. would it be possible to call the google translate api and add a English translation in this popup?

koaning commented 3 years ago

Theoretically, it's indeed possible. It'd require an API key from google though and technically you can already add this information without adding it as a feature in whatlies.

I'm currently running a grid-search on the predictions and the simple conclusion seems to be that the byte-pair embedding barely contribute anything to the prediction. You might as well just use count-vectors. The bert models do cause an improvement but they are horrible inefficient when it comes to compute time.

I'll let the grid-search run overnight and report back in the morning :)

hashirabdulbasheer commented 3 years ago

I do not know if this is useful. I checked out some groups/clusters formed.

CLUSTER	ARABIC	ENGLISH	COMMENTS
1	🌸 أصبحنا وأصبح الملك لله والحمد لله لا إله إلا الله وحده لا شريك له له الملك وله الحمد وهو على كل شيء قدير اللهم ا…	We became the king, and the king became God, praise be to God, there is no god but God alone, he has no partner, the king has the king, and praise be to him, and he is over all things.	Sentences that mention GOD
1	استغفر الله الذي لا إله إلا هو الحي القيوم و أتوب إليه .	I ask forgiveness for God, who has no god but He who is the living and living, and I repent to Him.
1	😪 صدقت والله.. انه قد أثر على قلوبنا وعقولنا بل وصحتنا واخلاقنا الا من رحم الله ووفقه لضبط نفسه وس…	I swear by God ... that it has affected our hearts and minds, and even our health and morals, except for those who have mercy and accordingly, to control themselves and ...
1	"خذ دقيقة من وقتك وقل سبحان الله والحمد لله ولا إله إلا الله والله أكبر ولآحول ولا قوة إلا بالله كن سببا في تذكير الكثيرين بذكر الله ♥",	Take a minute of your time and say: Glory be to God, praise be to God, and there is no god but God.
1	الحمد لله .. كانت مباراه صعبه ولكن بتوفيق الله ثم دعمكم تجاوزنا المباراه 💙	Praise be to God .. It was a difficult match, but with the grace of God and then your support, we passed the match 💙

2	أصبحت سلوكيات من ضيع ثقافته ليحاول ان يكون ندا لثقافتنا عن طريق الترويج لثقافات اخرى معروفه ومعلومة للكل ، استوقفت…	The behaviors of those who lost their culture in order to try to be equivalent to our culture by promoting other cultures known and known to all, have been stopped ...	Negative
2	يتحدثون عن اخلاق حسين ونجوم فرقهم نهاياتهم الرياضية أليمة ومخجلة نختلف ونتفق حول حسين ولكن المؤكد أن صحيفته الأخلاق…	They talk about the morals of Hussein and the stars of their teams. Their sporting ends are painful and shameful. We disagree and agree about Hussain, but it is certain that his newspaper is the ethics ...
2	لا يمكن ان اتعاطف مع اي سعودي من رجال الاعمال قام بالاستعانه بإحانب وسرقوه بل اقول الله يزيدك يا من فضلت الاجنبي اب…	I cannot sympathize with any Saudi businessman who sought help from him affectionately and stole him. Rather, I say God will increase you, who preferred the foreigner, father ...
2	كلنا محتاجين كل فترة نتأسف لنفسنا بجد عن سوء إختياراتنا وعن العشم الى مكنش فى محله, وعن نظرتنا فى قرايب صحاب حبايب…	We are all in need every period of time we seriously regret ourselves for the bad choices we made, for the hope that it was not in place, and for our view of the relatives of my beloved friends ...

3	فيني نوم بس مابي انام 🙂	I do not sleep, but I do not sleep 🙂	Funny quotes
3	حلو جوج ابله مجتهده 😂	Sweet goog idiot industrious 😂
3	شو صايرة 🤔 كل عمري هيك بس ما حدا عم ينتبه😂..صباحو ابو جوان 🙋🌹	What happened to my entire life, but no one would pay attention ... Morning Abu Jawan
3	شوفيه وانتي مخروسة 🐸	Chauvet and you are mocking 🐸


4	"اعلق احلامي على حافة #الغيم ان طحنا #غيث وان بقينا #سحابه 🌸 #مملكة_شوق_الريم #نبض_الامل_للدعم #ربيع_القلوب_للدعم…",	I hang my dreams on the edge of the clouds, that we milled # Ghaith and that we remain # clouds 🌸 # Kingdom_Souq_Raim # Pulse_Hope_for support # Rabie_hearts_for support ..	Negative i think
4	"#مسلسل_ودك_يتكرر_برمضان رمضان صلاة وعباده 🌚",	The series “Wodak” is repeated in Ramadan, Ramadan prayers and worshipers 🌚
4	#الاتحاد_النصر #زلزل_الملعب_نصرنا_بيلعب #احتزم_يانصر_معك_رجال #العالمي آمين يارب العالمين والله يعطيهم العافيه 🌹💛…	# The Union_ Victory # The stadium shook our victory and played # Bezam_anasar_with you_men # The global Amen, Lord of the worlds, and God gives them wellness 🌹💛…
4	لا_أحب_أن_أكون_آنثى_بالشكل_الذي_يرضي_الناس 👌 أنا_آنثى_بالطريقة_التي_تروق_لي_بلا_تصنع .. وهذا_يكفيني 😌	I do not like to be a female in the way that pleases people I am female in the way that I like you without making ... this is enough for me 😌
4	وسط دارفور زالنجي ❤ #لم_تسقط_بعد	Central Darfur, Zalingei, has not yet fallen

5	ال كيل 😭	The agent 😭	Small exclamations, i think
5	♔➥ لۧا تقتڕب حډ #ٵٳلۧاﺣتڕۧاق! ♔➥ ﯛلۧا تبتعډ حډ #ٵٳلۧافتڕۧاق! ↺ ڪن لۧا بعيډ مڼ ٵٳحډ؟ ↺ ﯛلۧا قڕيب مڼ ٵٳح…	Don't get close to #! ♔➥! ۧ ٵٳ? ↺ ...
5	الديود 😊 شفتوها 🙈🙊	Diode see her 🙈🙊

hashirabdulbasheer commented 3 years ago

is it possible to print the selection from the chart. if I select a circle in the chart, it shows the tooltip containing the values now. is it possible to print this out in console? it would be easy to check translations that way since we could copy the text from the console and translate it.

hashirabdulbasheer commented 3 years ago

the funny thing about categorisation is that, we can see any group and call it anything we want. For eg. if we see apple and oranges then we call them 'fruits'. but if apples happen to come with cabbages then we could call it 'food'. etc. If we go about it randomly then its never ending.

Another thing that I noticed is that, I am getting a different shape each time I run it. It is changing.

koaning commented 3 years ago

@hashirabdulbasheer yes it is possible to make a selection visible in the notebook, you'll need some extra tools to make this easy though. I've made a video on this topic here. The bulk labelling notebook demo can be found here. You'll need to change the notebook to make it fit your use-case but it should be relatively straightforward.

koaning commented 3 years ago

The reason why you see a different shape each time is because the Umap dimensionality reduction method is stochastic.

If you pass random_state=42 to the creation call then it should be the same.

from whatlies.transformers import Umap 

embset.transform(Umap(2, random_state=42))

hashirabdulbasheer commented 3 years ago

great, thanks. will try it out.

Regarding the cluster studies, will it help if I check if a cluster is either positive or negative and not both? from the above study of 5 clusters, I think we can say if each is either positive or negative.

What will be the ideal case? Is the ideal case two clusters with one having all positive and one having all negatives? But I think, that never going to happen. But there is a chance that each cluster that gets formed is either a positive or negative one and not the mix of both.

koaning commented 3 years ago

The clusters that pop op are caused by the embedding. Not by the training task. The clusters that appear only appear like they do because a pre-trained model determined that on a specific dataset (typically Wikipedia) it makes sense.

This also means that we need to be careful when we interpret embedding. In embedded space "hot" and "cold" are probably close to each other because they are used in similar ways in a sentence (to describe the temperature of something). This means that word vectors that are "similar" in their vector-space can have words attached that have the opposite meanings.

You might enjoy watching this video that explains this phenomenon in more detail.

hashirabdulbasheer commented 3 years ago

Thats interesting. I will check it out, thanks a lot. Also, Do you know if there is a Rasa arabic chatbot demo available anywhere?

hashirabdulbasheer commented 3 years ago

I checked out human-learn. It is totally awesome. I liked it lots.

I did the following studies with the arabic tweets dataset and bytepairlanguage for arabic.

lang_bp_ar = BytePairLanguage("ar", dim=300, vs=200_000)

For this dataset, the clusters were not well defined with BytePairLanguage. I mean, it was all together and it was hard to separate them out. But when I checked out some clusters, they made sense. But others, I couldn't understand the pattern.

Here are some results

Cluster 1: Good Mornings

Cluster 2: Prayers

Cluster 3: No idea

Cant figure out this one.

hashirabdulbasheer commented 3 years ago

In your video, you mention that embeddings in a cluster doesn't not necessarily imply that they have same meaning. It depends on how the words appeared in the training data.

This was very obvious when I tried the Quran dataset. In the Quran dataset, the arabic is a classic arabic. the words are different from wikipedia words. the cluster that I found were of verses that had different meanings. I tried with both CountVectorLanguage and BytePairLanguage. Both didn't work in that case.

However, in modern conversations like tweets and chats, it works because people use more modern arabic.Just like wikipedia.

Is there anything else that we can try? Do you know if there is any technique that can work with the Quran Dataset?

koaning commented 3 years ago

I mean ... you can train your own embeddings I guess with gensim?

hashirabdulbasheer commented 3 years ago

because in Quran arabic dataset, when using Bert, three sentences like

1) remain in it forever (ماكثين فيه أبدا) 2) took another path (فأتبع سببا) 3) and then took another path (ثم أتبع سببا)

are together. 2 and 3 are touching each other. they are similar words. 1 is slightly apart. but how could 1 be related to 2 and 3? no idea how Bert did that.

koaning commented 3 years ago

One thing to remember is that we're also doing a huge dimensionality reduction here. That means that there may also be an observation here due to chance.

hashirabdulbasheer commented 3 years ago

thanks, that might explain it. I went back to words to understand it better.

There clearly is meaning captured for the words. Check it out.

here are some results:

lang_bp_ar = BytePairLanguage("ar", dim=300, vs=200_000)

1) BytePairLanguage arabic words

arabic:

translation:

arabic chart with legends translated:

English Words

Arabic Words

hashirabdulbasheer commented 3 years ago

its difficult to identify the clusters in the umap chart. however, we can see similar words together.

the other chart shows relationship much clearer. for eg. male student, female student, university, school and are related. cat and kitten are clearly related. female student and woman shows some similarity too.

koaning commented 3 years ago

We need to be careful with the word "meaning" when we're talking about word embeddings.

A word embedding doesn't really capture meaning at all. It instead captures "how it is being used in a sentence". This is why words like "hot" and "cold" appear so similar even though they are opposites of each other. That said, these embedding can still be useful for prediction tasks such as predicting intents in a virtual assistant.

Have you had a look at my code that predicts sentiment?

hashirabdulbasheer commented 3 years ago

thanks a lot. havent seen the sentiment prediction code, where is it?

koaning commented 3 years ago

It's here.

hashirabdulbasheer commented 3 years ago

checked it out. what is the conclusion? Bert is better? countvector and bytepair are similar ?

hashirabdulbasheer commented 3 years ago

I don’t have the expertise to interpret the results. But it could also be because tweets contain many English hash tags etc. Tweets are tricky. Are those numbers normal or low ? Looks like low to me.

hashirabdulbasheer commented 3 years ago

There was a competition on sentiment analysis of Arabic tweets at an university here : https://wti.kaust.edu.sa/dexam/pages/events/2020/11/25/wti-calendar/introduction-of-the-competition

hashirabdulbasheer commented 3 years ago

They mention this paper: https://arxiv.org/pdf/2011.00578.pdf

ASAD: A TWITTER-BASED BENCHMARK ARABIC SENTIMENT ANALYSIS DATASET

koaning commented 3 years ago

Sorry for the radio silence, I went on holiday last week. But I'm back.

I'll get started on a page for the docs, would you be willing to review it?

hashirabdulbasheer commented 3 years ago

hey, welcome back. hope you had a great holiday. I was thinking about using the twitter dataset mentioned in the paper. they had cleaned it out. but didn't get a chance still.

sure, I will try to check it out. let me know what needs to be done.

koaning / whatlies