Poor Results on Large Corpus

stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings

Apache License 2.0

6.81k stars 1.51k forks source link

Poor Results on Large Corpus #210

Open KarahanS opened 1 year ago

KarahanS commented 1 year ago

Greetings,

I'm trying to train my own GloVe word embeddings for Turkish language using a corpus of size ~10 GB. I have enough disk capacity on my computer and 16 GB memory. I created the vocab.txt successfully, I can confirm that there is no problem with it. Now, I believe I successfully generated a cooccurrence matrix as well which is of size ~35 GB but afterwards shuffling took too long and suddenly terminated. In contrast to the cooccurrence generation step, shuffling seems non-responsive, it's not really printing anything to the console. Then I decided to train my model on an unshuffled cooccurrence matrix directly. I trained it for 20 iterations. Cost for each iteration was something like this (numbers are not precise but my point is that the cost increased for first 3 iterations and then gradually decreased to ~0.11):

itr=1   cost = ~2.5
itr=2   cost = ~10.5
itr=3   cost = ~14.5
itr=4   cost = ~12.5
itr=5   cost = ~10.5
           ...
itr=19 cost = ~0.14
itr=20 cost  = ~0.11

Then I loaded the word vectors using load_word2vec_format function provided by gensim. Tested the vectors with several analogy tasks and unfortunately, the results are terrible. So, here is my questions:

How vital is shuffling? Can such terrible results be explained by the fact that I skipped the shuffling part?
Or, isn't ~0.11 cost enough to produce some reasonable results? Should I have iterated longer?
When I run the shuffling operation, I get an output like this:
```
Using random seed 1680251209
SHUFFLING COOCCURRENCES
array size: 1020054732
Shuffling by chunks: processed 0 lines.
```
I tried to print out some local variables and saw that they are increasing. So, the program is actually running but it feels like it will run to forever (if not terminates due to some error). Is it really supposed to take that long (even longer than cooccurrence matrix generation)? I'm suspicious that my memory is not enough. If it's the case, is there any solution rather than simply switching to another hardware/remote server etc.? (Also, it would be really weird that my memory is enough for matrix generation but not for shuffling o.O')

Note: I'm training on Windows using Ubuntu WSL. FYI

KarahanS commented 1 year ago

Update: I waited for a while for shuffling to terminate and it terminated with the following error:

$ build/shuffle -memory 16.0 -verbose 2 < out/cooccur.bin > out/cooccurrence.shuf.bin
Using random seed 1680251209
SHUFFLING COOCCURRENCES
array size: 1020054732
Shuffling by chunks: processed 1020054732 lines../demo.sh: line 45:   355 Killed                  $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE

AngledLuffa commented 1 year ago

Killed like that is almost definitely a memory issue.

I've never tried glove without shuffling, so I can't advise on whether or not this kind of curve is how it normally goes with unshuffled text. You could always try a smaller version of your dataset and compared shuffled vs non-shuffled if you're curious.

However, the best place to start is probably with running shuffle, where you should be able to set the array_size or memory_limit smaller. Although the default expectation is 2G, so it's a little surprising that it's not working when your system has 16G. Perhaps there is something in the way you are running it that is giving it significantly less memory.

KarahanS commented 1 year ago

I have solved the memory error by decreasing the value of the memory parameter in the script. Now, I have trained my model with the following parameters:

VOCAB_MIN_COUNT=10
VECTOR_SIZE=300
MAX_ITER=100
WINDOW_SIZE=5
X_MAX=100

I expect my model to give better results in syntactic/semantic analysis tasks compared to Word2Vec with (5 epochs + 300 embeddings). But unfortunately, results of GloVe are worse than Word2Vec results. Is there something wrong with my parameters? My corpus is ~10.5 GB. Overall, I have 1,384,961,747 tokens and 1,573,013 unique words (excluding words occurring less than the minimum frequency).

Some of the possible problems that come to my mind:

Is there a problem with the corpus?: Well, I compare the resulting vocab.txt file from GloVe with the one I had from Word2Vec. They are almost identical. There doesn't seem to be any problem extracting the vocabulary - therefore I guess there shouldn't be any technical problem with the corpus. If there was a problem with corpus, we would understand it from vocab.txt, right?
Hardware related issues?: I trained models on both my local machine (i7 11390H) and on a remote machine (Intel® Xeon® Gold 6342 Processor) - results are similar.
Overfitting?: Well, I trained GloVe with 20 iterations as well - yet again I get awful results. (That's why I switched to 100 iterations. It is also the suggested number in the paper for 300 dimensions.)

I'm stuck at this point and can't really see why GloVe word vectors are performing extremely poorly - open to suggestions to iterate new ideas/play with parameters etc @AngledLuffa.

Note: Sorry for changing the title. My previous problem with shuffling is solved, thank you for that.

AngledLuffa commented 1 year ago

I don't know how to figure it out based on this information, but if you send me a sample of the text you are using to train, I can take a look and see if there is anything obvious

On Tue, Apr 4, 2023 at 12:04 AM Karahan Sarıtaş @.***> wrote:

I have solved the memory error by decreasing the value of the memory parameter in the script. Now, I have trained my model with the following parameters:

VOCAB_MIN_COUNT=10 VECTOR_SIZE=300 MAX_ITER=100 WINDOW_SIZE=5 X_MAX=100

I expect my model to give better results in syntactic/semantic analysis tasks compared to Word2Vec with (5 epochs + 300 embeddings). But unfortunately, results of GloVe are worse than Word2Vec results. Is there something wrong with my parameters? My corpus is ~10.5 GB. Overall, I have 1,384,961,747 tokens and 1,573,013 unique words (excluding words occurring less than the minimum frequency).

Some of the possible problems that come to my mind:

Is there a problem with the corpus?: Well, I compare the resulting vocab.txt file from GloVe with the one I had from Word2Vec. They are almost identical. There doesn't seem to be any problem extracting the vocabulary - therefore I guess there shouldn't be any technical problem with the corpus. If there was a problem with corpus, we would understand it from vocab.txt, right?

Hardware related issues?: I trained models on both my local machine (i7 11390H) and on a remote machine (Intel® Xeon® Gold 6342 Processor) - results are similar.

Overfitting?: Well, I trained GloVe with 20 iterations as well - yet again I get awful results. (That's why I switched to 100 iterations. It is also the suggested number in the paper for 300 dimensions.)

I'm stuck at this point and can't really see why GloVe word vectors are performing extremely poorly - open to suggestions to iterate new ideas/play with parameters etc.

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/issues/210#issuecomment-1495453677, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWKXPJ46DBRJZMQHFSLW7PBZJANCNFSM6AAAAAAWOLMGUE . You are receiving this because you commented.Message ID: @.***>

KarahanS commented 1 year ago

Let me share some glimpses from the content: Here is the output of head -1 corpus.txt:

lovecraft'ın türkçe'deki ilk kitabı

Here is the output of head -5 corpus.txt:

lovecraft'ın türkçe'deki ilk kitabı
yazarın ikinci kitabı
lovecraft türkçe'de
cthulhu'nun çağrısı ve ardından deliliğin dağlarında adlı eserleri türkçe'ye çevrilen howard phillips lovecraft korku ve gerilim ustası bir yazar
beş mayıs howard phillips lovecraft'ın yaşamı boyunca yazdığı elli bir öyküden sekizini bir araya getiren cthulhu'nun çağrısı gotik edebiyatın klasik örneklerinden biri sayılıyor

Each example is separated by \n. Examples do not have to be single sentences, they can be a collection of couple of sentences as well. For example, there is an example like this as well:

Beşiktaş Teknik Direktörü Bernd Schuster , kulübeye çektiği İbrahim Üzülmez dışında son haftalardaki tertibiyle sahadaydı . 4'lü defansın önünde Mehmet Aurelio ile Ernst , onlarında önünde Guti , üçlü hücumcu olarak da sağda Tabata , solda Holosko ve ortada Nobre görev yaptı . Oyun anlayışında bir değişiklik düşünülmediğinden alışılagelmiş şablon içerisinde bir futbol vardı . Defans bloku kalenin uzağında kademeleniyor , kazanılan toplar Ernst ve Guti tarafından forvet elemanlarına servis ediliyordu . Dün gece gene Guti'nin ne kadar önemli bir oyuncu olduğu izlendi . Ayağından çıkan topların çoğunluğu arkadaşlarını pozisyona sokuyordu . 79'da Nobre'nin kafasına adeta topu kondurması ustalığının getirisiydi . Sarı-Kırmızılı takım topa daha çok sahip olmasına rağmen ataklarda çoğalamamanın sıkıntısını yaşadı . 2-3 önemli pozisyondan da istifade etmesini bilemediler .

Technically, this is one example composed of several sentences. We used the same corpus for Word2Vec as well - so such examples shouldn't be a problem (unless there is a more special technical issue). As you can see, all tokens are separated by spaces.

If you can and want to spend more time and effort on it, here is the link to the corpus we are using: https://drive.google.com/file/d/1BhHG8-btnTcfndU5fvsvTG3mD9WGf6L0/view?usp=sharing

Additionally, here is our loss curve:

@AngledLuffa

AngledLuffa commented 1 year ago

You would ideally have one sentence per line, I would say. It should still work, though, as we have trained English vectors with multiple sentences in a line and only noticed a small drop in performance.

I'll download the corpus and take a look, although I can't promise I'll find anything. Glove is not too familiar to me (there just isn't anyone else to work on it at this point)

On Tue, Apr 4, 2023 at 1:48 AM Karahan Sarıtaş @.***> wrote:

Let me share some glimpses from the content: Here is the output of head -1 corpus.txt:

lovecraft'ın türkçe'deki ilk kitabı

Here is the output of head -5 corpus.txt:

lovecraft'ın türkçe'deki ilk kitabı yazarın ikinci kitabı lovecraft türkçe'de cthulhu'nun çağrısı ve ardından deliliğin dağlarında adlı eserleri türkçe'ye çevrilen howard phillips lovecraft korku ve gerilim ustası bir yazar beş mayıs howard phillips lovecraft'ın yaşamı boyunca yazdığı elli bir öyküden sekizini bir araya getiren cthulhu'nun çağrısı gotik edebiyatın klasik örneklerinden biri sayılıyor

Each example is separated by \n. Examples do not have to be single sentences, they can be a collection of couple of sentences as well. For example, there is an example like this as well:

Beşiktaş Teknik Direktörü Bernd Schuster , kulübeye çektiği İbrahim Üzülmez dışında son haftalardaki tertibiyle sahadaydı . 4'lü defansın önünde Mehmet Aurelio ile Ernst , onlarında önünde Guti , üçlü hücumcu olarak da sağda Tabata , solda Holosko ve ortada Nobre görev yaptı . Oyun anlayışında bir değişiklik düşünülmediğinden alışılagelmiş şablon içerisinde bir futbol vardı . Defans bloku kalenin uzağında kademeleniyor , kazanılan toplar Ernst ve Guti tarafından forvet elemanlarına servis ediliyordu . Dün gece gene Guti'nin ne kadar önemli bir oyuncu olduğu izlendi . Ayağından çıkan topların çoğunluğu arkadaşlarını pozisyona sokuyordu . 79'da Nobre'nin kafasına adeta topu kondurması ustalığının getirisiydi . Sarı-Kırmızılı takım topa daha çok sahip olmasına rağmen ataklarda çoğalamamanın sıkıntısını yaşadı . 2-3 önemli pozisyondan da istifade etmesini bilemediler .

Technically, this is one example composed of several sentences. We used the same corpus for Word2Vec as well - so such examples shouldn't be a problem (unless there is a more special technical issue). As you can see, all tokens are separated by spaces.

If you can and want to spend more time and effort on it, here is the link to the corpus we are using: https://drive.google.com/file/d/1BhHG8-btnTcfndU5fvsvTG3mD9WGf6L0/view?usp=sharing

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/issues/210#issuecomment-1495587300, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWJEEFFPESPARDPXQN3W7PN5XANCNFSM6AAAAAAWOLMGUE . You are receiving this because you were mentioned.Message ID: @.***>

KarahanS commented 1 year ago

I'd be very grateful for any assistance you could provide. If you have time to train the model as well, please use window size = 5 unless there is an important reason not to do so. That's what we used in word2vec, so in order to be able to compare the models, it's better to stick to the previous window size. Let me provide an example. This example is the direct correspondence of classical "man" - "woman" - "king" example in English. Below, you can find how to load GloVe vectors using gensim and test the analogy task:

from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format("path/to/glove/vectors.txt", no_header = True, binary=False)
print(word_vectors.most_similar_cosmul(positive=['kadın', 'kral'], negative=['adam']))

The output is like this:

[('erkek', 0.8252280950546265), 
('kraliçe', 0.8103123307228088), 
('bebek', 0.8019083142280579), 
('kralın', 0.8017817139625549),
 ('aile', 0.7960394024848938),
 ('çocuk', 0.7889254689216614), 
('afgan', 0.7882615923881531), 
('annesi', 0.7867284417152405), 
('kadınların', 0.7853242754936218), 
('arap', 0.7841709852218628)]

kraliçe means queen in Turkish and that's the word we would expect to have as the first recommendation. Word2Vec, with approximately %90 probability ratio, gives kraliçe as the correct word.

So if we somehow manage to make our model return kraliçe as the first word, it means there is progress. You can ask what is erkek? It can be translated as male. So interestingly, when we subtract man from king, and add woman, our current GloVe model suggests that it is similar to male o.O'.

KarahanS commented 1 year ago

I came across with some sources that suggest that Word2Vec performs better than GloVe in Turkish.

For example, here, in the "About Glove" section, it is stated that "In the article published by Stanford University, GloVe is showed to be better than Word2Vec. But in our study for Turkish, Word2Vec gave better results".
Here in this paper, it is stated in the conclusion part that Word2Vec performs better than GloVe in analogy tasks.

So, I happen to think like there is no technical issue with our results - it's a fact that GloVe doesn't perform as good as Word2Vec for agglutinative languages like Turkish. If it's really the case, what would you say the main reason for that @AngledLuffa ? (Having some problems with our implementation is still an option but doesn't seem likely)

AngledLuffa commented 1 year ago

I'm not sure when or if I'll have time to do a deep dive into this, but I will point out that fasttext should be better in general for agglutinative languages on account of looking at word pieces.

On Thu, Apr 6, 2023 at 12:50 AM Karahan Sarıtaş @.***> wrote:

I came across with some sources that suggest that Word2Vec performs better than GloVe in Turkish.

For example, here https://www.cmpe.boun.edu.tr/content/building-word-embeddings-repository-turkish, in the "About Glove" section, it is stated that "In the article published by Stanford University, GloVe is showed to be better than Word2Vec. But in our study for Turkish, Word2Vec gave better results".

Here in this https://dergipark.org.tr/tr/download/article-file/790325 paper, it is stated in the conclusion part that Word2Vec performs better than GloVe in analogy tasks.

So, I happen to think like there is no technical issue with our results - it's a fact that GloVe doesn't perform as good as Word2Vec for agglutinative languages like Turkish. If it's really the case, what would you say the main reason for that @AngledLuffa https://github.com/AngledLuffa ? (Having some problems with our implementation is still an option but doesn't seem likely)

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/issues/210#issuecomment-1498636772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLI3PMUEKKGP2GP6FTW7ZYV3ANCNFSM6AAAAAAWOLMGUE . You are receiving this because you were mentioned.Message ID: @.***>