truythu169 / snml-skip-gram

Applying Sequentially Normalized Maximum Likelihood in Skip-gram model
1 stars 0 forks source link

SNML-Codelength just rising #15

Open vonMickwitz opened 9 months ago

vonMickwitz commented 9 months ago

Hello,

I am attempting to calculate the SNML-Codelength for a German corpus for my Master's thesis, but I am not achieving proper results. The Codelength is continuously increasing, and the values for the analogy task range between 0 and 0.1 (I used a translation of the Google dataset).

I am including some plots with my results. To generate them, I calculated the cumulative sum from the 'scope-{}-snml_length.csv' file, which, to my understanding, should represent the Codelength. Despite trying with different statistics and documents, I have obtained consistent results.

The plots I am adding were produced using negative sampling:

50,000 sentences, which yielded 1,896,879 data records, with n-sampling = 15, epochs = 2, and batch size = 10. 5,000 sentences, resulting in 95,000 data records, with n-sampling = 15, epochs = 1, and batch size = 1. I have altered the minimum word count to 5, the windowsize is 5, and am not using Google Cloud Storage.

If you have any ideas or suggestions as to why the calculations do not lead to useful results, I would be grateful. My results end up only adjusting the smallest Codelength to the lowest dimensionality. plot_all plot_by_dim plot_differences plot_all

truythu169 commented 9 months ago

Hi vonMickwitz

Thanks for reaching to me for the proposal approach and sorry for the late reply.

The first of all, my recommendation is to test on a smaller dataset (e.g. the first 1000 sentence of your dataset) to see if we can find an optimal dimensionality with our criteria, then we use the same logic for a larger dataset.

As in our experiment, the optimal dimensionality is increasing when the dataset become larger, we can see word2vec developed by Google work better and better when the dimensionality increasing and we still haven't known the best dimensionality for now, this is because the dataset they're using to train word2vec is very huge.

I can see you test the dimensionality until 150, but may be the best one is 160, may be 300 or may be even bigger, after testing smaller dataset, you may consider running simple criteria such as AIC or BIC, or cross validation to see if there is an optimal dimensionality in your budget of computation (for example, you only have time and hardware resources to estimate until 1000 dimensions). If the optimal dimensionality is not in the budget, I'm so sorry but you have to consider about giving up on the current dataset but doing with a smaller one.