kawine / contextual

How Contextual are Contextualized Word Representations?
39 stars 9 forks source link

difficulty reproducing static embedding #2

Closed dribnet closed 4 years ago

dribnet commented 4 years ago

I'm having trouble reproducing the static embedding results from the paper. For reference here is the results table for Static Embeddings from the paper:

pc_static_embeddings

And here is my current set of results when run on a similar large corpus of ~20k sentence pairs, ~8k word vocabulary (note - I've removed Elmo from my runs):

my_embeddings

My scores for GloVe and FastText indicate that the testing procedure seems to be working and my scores there roughly match the paper suggesting my vocabulary is broad enough. However there appears to be some sort of systematic issue in creating good static embeddings from the first principal component which is independent of language model.

If the repo included a diagnostic or unit test this might be easier for me to diagnose on my end. For example - it might be useful to include expected outputs when the code is run on the 99 sentence pairs in sts.csv. But I'm certainly open to suggestions to any other tips or ideas for probing where the process might be failing.


Note that the other sections seem to replicate well! For example, here is Average Cosine Similarity for Anisotropy adjustment in the paper and my most recent run:

mean_cosine_similarity_across_words

Here's self-similairity. [though note my lower scores on gpt2 - my intuition is this is a result of removing sentence duplicates, which are otherwise about 20% of the input data]

self_similarity_above_expected

And here's intra-sentence similarity:

mean_cosine_similarity_between_sentence_and_words

kawine commented 4 years ago

This might be a versioning issue with sklearn (and the attendant changes in the function implementations). I've found that others have had success in reproducing the static embeddings (and iirc, there'll be some papers ACL 2020 that do even better).

In any case, I've pushed a fix that I think will work. Can you pull the new version of the code and see if you have better luck with it?

dribnet commented 4 years ago

Thanks Kawin, I've confirmed that truncating the SVD to 100 fixes the issue with no loss of performance from your published results. Here's my own results after re-running with the same data.

new_results2

And here again for reference was the corresponding table from your paper which closely matches.

pc_static_embeddings

Thanks for taking the time to provide this fix for this as I was interested in investigating these embeddings more and as a first goal wanted to be able to replicate your results.

kawine commented 4 years ago

Do you happen to know which version of scikit-learn

Sorry, I ran the code back at my old university, so I can't recall the version that was installed.

Incidentally I've found in my testing that replacing the TruncatedSVD with simply an average of the vectors via also seems to work equally well

Interesting! This is a pretty intuitive result, since the idea behind PC embeddings was that the static vector for each word would implicitly weigh each word sense by how frequently it appeared (such that the overall variance explained is maximized). Taking the average instead of the PC has the same sort of effect -- more frequent senses have a greater weight -- so I'd expect the performance to be similar.

On Tue, May 5, 2020 at 5:58 AM Tom White notifications@github.com wrote:

Do you happen to know which version of scikit-learn was working for you? I was using scikit-learn 0.22.1 but even when I checked previous versions (going back to 0.18.0) I had the same issue.

Incidentally I've found in my testing that replacing the TruncatedSVD with simply an average of the vectors via

mean_vector = np.average(embeddings, axis=0)

also seems to work equally well on the embedding benchmarks. Here's my Bert TruncatedSVD pipeline results (top) compared to the average vector (bottom) on the same dataset.

[image: svd_vs_avg2] https://user-images.githubusercontent.com/945979/81068160-0305bb00-8f34-11ea-8fde-ba301a124291.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kawine/contextual/issues/2#issuecomment-624039731, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYXSWDDNZVXRCACERAUZL3RQAEQNANCNFSM4MTZ7APQ .

chawannut157 commented 3 years ago

hi Kawin, question on this.

What's the difference between Glove benchmark in the paper (0.194 and 0.215 reported above by dribnet) and the one listed here? https://github.com/kudkudak/word-embeddings-benchmarks/wiki

I tried running the code using the same evaluation script and I manage to get the same number as in the link above which is around 0.371 for Glove.

[Edit]: Sry my bad, I figure it out that you are using apple-to-apple comparison not the full Glove.

kawine commented 3 years ago

Hi! Yes, your edit is correct -- I made an apple-to-apple comparison, not the full Glove.