roamanalytics / mittens

A fast implementation of GloVe, with optional retrofitting
Apache License 2.0
243 stars 31 forks source link

Inconsistent results - Mittens VS standard Glove #10

Open olivierroncalez opened 5 years ago

olivierroncalez commented 5 years ago

I am using mittens with a pre-built cooccurrence matrix of domains with the hopes of clustering certain domains that are thematically related, close to each other. Using the non-vectorized glove implementation from https://github.com/stanfordnlp/GloVe, I get very strong results. The current initialization is:

glove_model = Glove(no_components=50, learning_rate=0.03)
glove_model.fit(coo_matrix(matrix, dtype=float), epochs=50, no_threads=64, verbose=True)

Finding the nearest domains to nintendo using cosine distance yields good results.

find_nearest(glove_model, "nintendo", 10)

[('game', 0.955347329117499), ('zavvi', 0.9382098190168783), ('eurogamer', 0.9296358002057901), ('playstation', 0.9290108695965159), ('gamespot', 0.9241452666014682), ('gamesradar', 0.9210470827690169), ('365games', 0.9193152241566838), ('ign', 0.9178656620515147), ('ea', 0.912055674280889), ('forbiddenplanet', 0.9118661547211797)]

Given these results, I wanted to use mittens for two reasons: take advantage of the vectorized implementation for speed, and harness the ability to extend glove into a retrofitted model. However, when I used a basic mittens (without retrofitting existing embeddings), the results come out quite poor, even when the same hyperparameters are used.

glove_mittens_50_50 = GloVe(n=50, max_iter=50, learning_rate=0.03)
cooccurance = np.array(matrix.todense()) # was sparse matrix for original glove
glove_mittens_trained_50_50 = glove_mittens_50_50.fit(cooccurance)

I built a pd dataframe with the resulting numpy matrix and incorporated the domains as the index before writing a function that would calculate the cosine distance in the same way that the original glove model does.

find_nearest(mittens_glove_df_50_50, "nintendo", 10)

[('hmrc', 0.9992567), ('anglingdirect', 0.999141), ('axa', 0.99907136), ('greatist', 0.99906415), ('techadvisor', 0.99906313), ('victorianplumbing', 0.99903136), ('dell', 0.9990228), ('imore', 0.99899846), ('carpetright', 0.99899185)]

As you can see, the results are not at all as expected. Furthermore, while the original glove model will have converged and not change much (only very slightly) by increasing the number of iterations, the vectorized glove in this package will.

find_nearest(mittens_glove_df_50_100, "github", 10)  # 100 iterations

[('yammer', 0.9993163), ('twitch', 0.9992425), ('axs', 0.9992203), ('rottentomatoes', 0.99920493), ('travelsupermarket', 0.99919695), ('lbc', 0.99919665), ('motors', 0.99918556), ('goodreads', 0.9991843), ('deezer', 0.9991767), ('nationalexpress', 0.99917376)]

Is there a reason why this is the case? Am I doing anything wrong, or is there anything else you'd like me to try?

Thanks.

rajatcodes commented 4 years ago

From the ReadMe file it says that we need pre-trained matrix in order to use mittens. This means you should have implied some kind of reweighting scheme to your co-occurrence matrix and then use Mittens.

xjp08 commented 1 year ago

I met the same problem,and I think it's about how co-occurrence matrix generates.