nateraw / Lda2vec-Tensorflow

Tensorflow 1.5 implementation of Chris Moody's Lda2vec, adapted from @meereeum
MIT License
107 stars 40 forks source link

<UNK> Returned for Multiple Topics #56

Open dbl001 opened 5 years ago

dbl001 commented 5 years ago

"UNK " is added to the tokenizer word lists in nlppip.py because the from keras.preprocessing.text import Tokenizer is one-based.

self.tokenizer.word_index["<UNK>"] = 0
self.tokenizer.word_docs["<UNK>"] = 0
self.tokenizer.word_counts["<UNK>"] = 0

The Tensorflow implementation of word embedding and embedding lookup are zero-based.

word_embedding_000[0]
array([-0.72940636,  0.7893076 , -0.5647843 , -0.73255396,  0.7901778 ,
       -0.49344468,  0.11772466, -0.5727272 ,  0.527349  , -0.06881762,
        0.44169998, -0.20452452,  0.3124647 ,  0.86845255, -0.9390068 ,
       -0.6195681 ,  0.89950705,  0.3356259 , -0.8492527 ,  0.45032454,
        0.6324513 ,  0.75457215,  0.21222615, -0.44409204, -0.06979871,
       -0.6462743 , -0.36795807,  0.27780175,  0.94171906,  0.40449977,
       -0.16222072, -0.34851456, -0.9734571 , -0.46344304, -0.80052805,
        0.39213514, -0.23919392, -0.60179496,  0.34500718, -0.6585071 ,
        0.18976736, -0.49871182, -0.31101155,  0.8082261 ,  0.5178263 ,
       -0.9620471 , -0.98253274,  0.5575602 , -0.5283928 , -0.05512738,
       -0.46859574, -0.9827881 ,  0.4550724 , -0.4175427 , -0.6799257 ,
        0.32043505, -0.60924935, -0.08730078, -0.76487565, -0.11529756,
       -0.05081773, -0.423831  , -0.69595194, -0.39993382,  0.01512861,
        0.82286215, -0.96196485, -0.96162105, -0.69300675, -0.23160791,
       -0.8725774 , -0.62869287, -0.21675658,  0.22361946, -0.7145815 ,
        0.25228357,  0.300138  ,  0.1944983 , -0.20161653, -0.00947928,
       -0.50661993,  0.24620843,  0.8336489 , -0.6433666 ,  0.4633739 ,
        0.42356896, -0.2927196 ,  0.7726562 , -0.77078557,  0.42736077,
        0.2361381 ,  0.8253889 , -0.03234029,  0.16903758,  0.64719176,
        0.12639523,  0.468915  ,  0.36462903, -0.63329506,  0.46308804,
        0.9785025 , -0.60487294, -0.8659482 ,  0.80265903,  0.08614421,
       -0.6846776 , -0.2840774 , -0.05165243,  0.7902992 ,  0.7554364 ,
        0.07603502, -0.82541203, -0.03127742, -0.45349932, -0.6321502 ,
       -0.75881124,  0.10189629,  0.7766483 , -0.02184248,  0.30532098,
        0.40934992, -0.3520453 , -0.4991796 ,  0.89320135, -0.5294213 ,
        0.08958745, -0.2862544 ,  0.694613  , -0.2933941 , -0.2711556 ,
       -0.778697  , -0.90801215, -0.4771154 ,  0.9393649 ,  0.02598763,
       -0.6128385 ,  0.6687329 , -0.00300312,  0.39082742, -0.62328243,
       -0.1326313 , -0.04318118,  0.5147674 ,  0.30447197, -0.15042996,
       -0.29966593, -0.19948554, -0.15503025, -0.07965088, -0.18107772,
       -0.6654799 ,  0.16734552, -0.6545446 , -0.19038987,  0.11273432,
       -0.37501454, -0.01779771,  0.10266089,  0.6059449 ,  0.53478146,
        0.8791959 , -0.71896863, -0.50831914,  0.51859474,  0.7803166 ,
        0.85757375,  0.58769774, -0.01653957,  0.35751534, -0.66742086,
        0.09473515, -0.89558864,  0.5007875 ,  0.6572523 ,  0.47241664,
        0.5635514 ,  0.32414556, -0.53437877,  0.84779453,  0.6378653 ,
        0.81033015, -0.9580946 ,  0.4329822 ,  0.7842884 , -0.02432752,
       -0.26144147,  0.51170826,  0.18752575,  0.716552  ,  0.19081879,
        0.76230717,  0.95465493,  0.587734  ,  0.9609244 , -0.95637846,
       -0.8732126 , -0.4947157 ,  0.4163556 ,  0.08395147,  0.48358202,
        0.6750531 ,  0.6933727 , -0.66409326, -0.6555612 , -0.77092767,
        0.77507496,  0.6416006 , -0.10126472, -0.20890045,  0.12876058,
       -0.7351172 ,  0.68103194, -0.575778  ,  0.1444602 , -0.42351747,
       -0.81415844, -0.58244324, -0.6112335 , -0.16471076,  0.5918329 ,
        0.6705165 , -0.9932399 ,  0.1535554 ,  0.02513838, -0.6433432 ,
        0.0850389 , -0.10692096,  0.21783972, -0.00443554, -0.5312202 ,
        0.16654754,  0.1691029 ,  0.9144945 , -0.20212364, -0.7347467 ,
        0.1740458 , -0.8262415 , -0.05594969, -0.04339361,  0.439353  ,
       -0.00228357, -0.6715636 ,  0.879483  ,  0.10999107,  0.8576815 ,
       -0.38673759, -0.2496996 ,  0.8718543 ,  0.77182436, -0.91532016,
        0.8322928 , -0.95677876,  0.11354065,  0.31194258, -0.7994232 ,
        0.8070309 , -0.12008953, -0.555902  , -0.6638913 ,  0.4023559 ,
       -0.77688384,  0.12601566, -0.3632667 , -0.6541252 ,  0.10901499,
        0.3102548 , -0.40334034,  0.03114676, -0.7885685 , -0.20401645,
        0.939183  ,  0.17131758,  0.47609544, -0.17927122, -0.5007596 ,
        0.9717326 , -0.0057416 ,  0.81249833,  0.39427924,  0.18702984,
       -0.4081514 , -0.47332573, -0.0909853 , -0.5931864 ,  0.7257166 ,
        0.18550944,  0.21591997, -0.02170038, -0.0661478 , -0.67937946,
       -0.28355837,  0.7463348 , -0.32689762,  0.9659898 , -0.54855466,
        0.72903705, -0.32373667, -0.92316556,  0.01121569,  0.17884326],
      dtype=float32)

Curiously, the closest (e.g. - cosine-similarity) embedding vector after training for 200 epochs to embedding vector before training 0 is:

word_embedding_000 = np.load("word_weights_000.npy")
word_embedding_199 = np.load("word_weights_199.npy")

idx = np.array([cosine_similarity(x, word_embedding_000[0]) for x in word_embedding_199]).argmin()
print(idx)
2905
print(idx_to_word[2905])
disc

How could one embedding vector appear in so many [orthogonal?] topics.

EPOCH: 85
LOSS 950.43896 w2v 8.754408 lda 941.6846 lda-sim 3.299621869012659
---------Closest 10 words to given indexes----------
Topic 0 : <UNK>, vending, confidential, offender, drainage, terrace, overtime, unintended, documentation, yan
Topic 1 : <UNK>, meaning, refrain, spent, largely, ran, equally, considered, decade, exact
Topic 2 : mim, lite, lea, recalibration, sonny, l, skip, unsold, vive, allen
Topic 3 : marathi, recalibration, assamese, uzbek, tagalog, gaelic, romansh, galician, razoo, recast
Topic 4 : loophole, vive, estonian, gaelic, slovenian, maracaibo, slovak, faroese, magyar, romansh
Topic 5 : <UNK>, vacant, jos, bleeding, kivu, bye, aunt, sundar, whilst, cowboy
Topic 6 : depending, closely, decided, applied, considered, spent, contrary, isolated, especially, frequently
Topic 7 : <UNK>, spinach, frost, slew, confined, yakan, ironically, dusty, shelf, bleeding
Topic 8 : basque, assamese, azerbaijani, razoo, haitian, kiswahili, recast, icelandic, nederlands, mommy
Topic 9 : spiky, recast, andhra, tauranga, revoke, recalibration, thread, mull, menacing, motoring
Topic 10 : <UNK>, rightly, inflammatory, severity, owen, incitement, disappearance, forge, magistrate, campaigner
Topic 11 : burke, chronicle, resend, fico, activation, tauranga, fetish, interstitial, unspoken, mommy
Topic 12 : <UNK>, confined, labrador, rope, modeling, shane, terrace, downpour, vernon, nutritional
Topic 13 : nederlands, suomi, allen, icelandic, seed, afrikaans, razoo, assamese, latvian, gaelic
Topic 14 : unacceptable, practically, exact, impression, mixture, certainly, hardly, toxic, younger, capture
Topic 15 : <UNK>, emergence, nose, straw, abundant, confined, copper, decreasing, ironically, litigation
Topic 16 : <UNK>, charlotte, straw, ironically, taps, spinach, yakan, confined, slew, maduro
Topic 17 : <UNK>, aggregate, crushing, knockout, versatile, distinctive, admired, pleasure, applause, wishing
Topic 18 : gaelic, romansh, slovenian, folder, resend, recast, assamese, creole, slovak, unspoken
Topic 19 : kiswahili, newsstand, ossetic, banat, assamese, faroese, creole, oriya, confucianism, romansh
stalhaa commented 4 years ago

@dbl001 means??