Open dbl001 opened 5 years ago
"UNK " is added to the tokenizer word lists in nlppip.py because the from keras.preprocessing.text import Tokenizer is one-based.
self.tokenizer.word_index["<UNK>"] = 0 self.tokenizer.word_docs["<UNK>"] = 0 self.tokenizer.word_counts["<UNK>"] = 0
The Tensorflow implementation of word embedding and embedding lookup are zero-based.
word_embedding_000[0] array([-0.72940636, 0.7893076 , -0.5647843 , -0.73255396, 0.7901778 , -0.49344468, 0.11772466, -0.5727272 , 0.527349 , -0.06881762, 0.44169998, -0.20452452, 0.3124647 , 0.86845255, -0.9390068 , -0.6195681 , 0.89950705, 0.3356259 , -0.8492527 , 0.45032454, 0.6324513 , 0.75457215, 0.21222615, -0.44409204, -0.06979871, -0.6462743 , -0.36795807, 0.27780175, 0.94171906, 0.40449977, -0.16222072, -0.34851456, -0.9734571 , -0.46344304, -0.80052805, 0.39213514, -0.23919392, -0.60179496, 0.34500718, -0.6585071 , 0.18976736, -0.49871182, -0.31101155, 0.8082261 , 0.5178263 , -0.9620471 , -0.98253274, 0.5575602 , -0.5283928 , -0.05512738, -0.46859574, -0.9827881 , 0.4550724 , -0.4175427 , -0.6799257 , 0.32043505, -0.60924935, -0.08730078, -0.76487565, -0.11529756, -0.05081773, -0.423831 , -0.69595194, -0.39993382, 0.01512861, 0.82286215, -0.96196485, -0.96162105, -0.69300675, -0.23160791, -0.8725774 , -0.62869287, -0.21675658, 0.22361946, -0.7145815 , 0.25228357, 0.300138 , 0.1944983 , -0.20161653, -0.00947928, -0.50661993, 0.24620843, 0.8336489 , -0.6433666 , 0.4633739 , 0.42356896, -0.2927196 , 0.7726562 , -0.77078557, 0.42736077, 0.2361381 , 0.8253889 , -0.03234029, 0.16903758, 0.64719176, 0.12639523, 0.468915 , 0.36462903, -0.63329506, 0.46308804, 0.9785025 , -0.60487294, -0.8659482 , 0.80265903, 0.08614421, -0.6846776 , -0.2840774 , -0.05165243, 0.7902992 , 0.7554364 , 0.07603502, -0.82541203, -0.03127742, -0.45349932, -0.6321502 , -0.75881124, 0.10189629, 0.7766483 , -0.02184248, 0.30532098, 0.40934992, -0.3520453 , -0.4991796 , 0.89320135, -0.5294213 , 0.08958745, -0.2862544 , 0.694613 , -0.2933941 , -0.2711556 , -0.778697 , -0.90801215, -0.4771154 , 0.9393649 , 0.02598763, -0.6128385 , 0.6687329 , -0.00300312, 0.39082742, -0.62328243, -0.1326313 , -0.04318118, 0.5147674 , 0.30447197, -0.15042996, -0.29966593, -0.19948554, -0.15503025, -0.07965088, -0.18107772, -0.6654799 , 0.16734552, -0.6545446 , -0.19038987, 0.11273432, -0.37501454, -0.01779771, 0.10266089, 0.6059449 , 0.53478146, 0.8791959 , -0.71896863, -0.50831914, 0.51859474, 0.7803166 , 0.85757375, 0.58769774, -0.01653957, 0.35751534, -0.66742086, 0.09473515, -0.89558864, 0.5007875 , 0.6572523 , 0.47241664, 0.5635514 , 0.32414556, -0.53437877, 0.84779453, 0.6378653 , 0.81033015, -0.9580946 , 0.4329822 , 0.7842884 , -0.02432752, -0.26144147, 0.51170826, 0.18752575, 0.716552 , 0.19081879, 0.76230717, 0.95465493, 0.587734 , 0.9609244 , -0.95637846, -0.8732126 , -0.4947157 , 0.4163556 , 0.08395147, 0.48358202, 0.6750531 , 0.6933727 , -0.66409326, -0.6555612 , -0.77092767, 0.77507496, 0.6416006 , -0.10126472, -0.20890045, 0.12876058, -0.7351172 , 0.68103194, -0.575778 , 0.1444602 , -0.42351747, -0.81415844, -0.58244324, -0.6112335 , -0.16471076, 0.5918329 , 0.6705165 , -0.9932399 , 0.1535554 , 0.02513838, -0.6433432 , 0.0850389 , -0.10692096, 0.21783972, -0.00443554, -0.5312202 , 0.16654754, 0.1691029 , 0.9144945 , -0.20212364, -0.7347467 , 0.1740458 , -0.8262415 , -0.05594969, -0.04339361, 0.439353 , -0.00228357, -0.6715636 , 0.879483 , 0.10999107, 0.8576815 , -0.38673759, -0.2496996 , 0.8718543 , 0.77182436, -0.91532016, 0.8322928 , -0.95677876, 0.11354065, 0.31194258, -0.7994232 , 0.8070309 , -0.12008953, -0.555902 , -0.6638913 , 0.4023559 , -0.77688384, 0.12601566, -0.3632667 , -0.6541252 , 0.10901499, 0.3102548 , -0.40334034, 0.03114676, -0.7885685 , -0.20401645, 0.939183 , 0.17131758, 0.47609544, -0.17927122, -0.5007596 , 0.9717326 , -0.0057416 , 0.81249833, 0.39427924, 0.18702984, -0.4081514 , -0.47332573, -0.0909853 , -0.5931864 , 0.7257166 , 0.18550944, 0.21591997, -0.02170038, -0.0661478 , -0.67937946, -0.28355837, 0.7463348 , -0.32689762, 0.9659898 , -0.54855466, 0.72903705, -0.32373667, -0.92316556, 0.01121569, 0.17884326], dtype=float32)
Curiously, the closest (e.g. - cosine-similarity) embedding vector after training for 200 epochs to embedding vector before training 0 is:
word_embedding_000 = np.load("word_weights_000.npy") word_embedding_199 = np.load("word_weights_199.npy") idx = np.array([cosine_similarity(x, word_embedding_000[0]) for x in word_embedding_199]).argmin() print(idx) 2905 print(idx_to_word[2905]) disc
How could one embedding vector appear in so many [orthogonal?] topics.
EPOCH: 85 LOSS 950.43896 w2v 8.754408 lda 941.6846 lda-sim 3.299621869012659 ---------Closest 10 words to given indexes---------- Topic 0 : <UNK>, vending, confidential, offender, drainage, terrace, overtime, unintended, documentation, yan Topic 1 : <UNK>, meaning, refrain, spent, largely, ran, equally, considered, decade, exact Topic 2 : mim, lite, lea, recalibration, sonny, l, skip, unsold, vive, allen Topic 3 : marathi, recalibration, assamese, uzbek, tagalog, gaelic, romansh, galician, razoo, recast Topic 4 : loophole, vive, estonian, gaelic, slovenian, maracaibo, slovak, faroese, magyar, romansh Topic 5 : <UNK>, vacant, jos, bleeding, kivu, bye, aunt, sundar, whilst, cowboy Topic 6 : depending, closely, decided, applied, considered, spent, contrary, isolated, especially, frequently Topic 7 : <UNK>, spinach, frost, slew, confined, yakan, ironically, dusty, shelf, bleeding Topic 8 : basque, assamese, azerbaijani, razoo, haitian, kiswahili, recast, icelandic, nederlands, mommy Topic 9 : spiky, recast, andhra, tauranga, revoke, recalibration, thread, mull, menacing, motoring Topic 10 : <UNK>, rightly, inflammatory, severity, owen, incitement, disappearance, forge, magistrate, campaigner Topic 11 : burke, chronicle, resend, fico, activation, tauranga, fetish, interstitial, unspoken, mommy Topic 12 : <UNK>, confined, labrador, rope, modeling, shane, terrace, downpour, vernon, nutritional Topic 13 : nederlands, suomi, allen, icelandic, seed, afrikaans, razoo, assamese, latvian, gaelic Topic 14 : unacceptable, practically, exact, impression, mixture, certainly, hardly, toxic, younger, capture Topic 15 : <UNK>, emergence, nose, straw, abundant, confined, copper, decreasing, ironically, litigation Topic 16 : <UNK>, charlotte, straw, ironically, taps, spinach, yakan, confined, slew, maduro Topic 17 : <UNK>, aggregate, crushing, knockout, versatile, distinctive, admired, pleasure, applause, wishing Topic 18 : gaelic, romansh, slovenian, folder, resend, recast, assamese, creole, slovak, unspoken Topic 19 : kiswahili, newsstand, ossetic, banat, assamese, faroese, creole, oriya, confucianism, romansh
@dbl001 means??
"UNK " is added to the tokenizer word lists in nlppip.py because the from keras.preprocessing.text import Tokenizer is one-based.
The Tensorflow implementation of word embedding and embedding lookup are zero-based.
Curiously, the closest (e.g. - cosine-similarity) embedding vector after training for 200 epochs to embedding vector before training 0 is:
How could one embedding vector appear in so many [orthogonal?] topics.