nateraw / Lda2vec-Tensorflow

Tensorflow 1.5 implementation of Chris Moody's Lda2vec, adapted from @meereeum
MIT License
107 stars 40 forks source link

Reproducible working example in new version of Lda2Vec #45

Closed nateraw closed 5 years ago

nateraw commented 5 years ago

I've made TONS of changes the last few weeks. This has caused things to break and has made it so my working example no longer works :cry: . So, a new reproducible example needs to be made. This is highly related to #8 , where you can see that we ended up with a working example. However, with the new changes, we should be able to remake this reliably, straight from running the run_20newsgroups.py file.

nateraw commented 5 years ago

This should work :smile: Pushing shortly

lda2vec_04022019_1 lda2vec_04022019_2 lda2vec_log

dbl001 commented 5 years ago

This latest version appears to be working better then anything I’ve seen in the past. I’m also training on my ‘stories.txt’ file. btw - I’m running this on Tensorflow 1.12.

Any idea why your output has the word ‘X’ in so many topics?

EPOCH: 25 LOSS 9011.561 w2v 4.4812584 lda 9007.079 ---------Closest 10 words to given indexes---------- Topic 0 : team, hockey, teams, nhl, league, season, players, edm, wsh, moncton Topic 1 : god, people, think, know, jesus, christ, like, going, thing, believe Topic 2 : team, scoring, season, game, flyers, galley, pts, sabres, player, sanderson Topic 3 : windows, monitor, mac, price, jpeg, printer, buy, card, car, software Topic 4 : think, alomar, going, better, pitching, know, baseball, like, players, innings Topic 5 : jews, jewish, israel, palestine, christian, turkish, ottoman, israeli, occupied, turkey Topic 6 : edm, assault, fij, wsh, annual, firearms, hicnet, hb, maryland, phi Topic 7 : ftp, available, wsh, software, systems, anonymous, image, images, server, unix Topic 8 : karina, apartment, infection, sumgait, doctor, candida, patients, mamma, patient, people Topic 9 : armenian, armenians, azerbaijani, fij, said, karabagh, baku, armenia, apartment, sumgait Topic 10 : god, bible, jesus, fij, belief, christian, faith, christ, db, christians Topic 11 : gun, cipher, firearms, key, security, encrypted, police, cryptography, encryption, guns Topic 12 : grounding, wire, gfci, wiring, ground, insulation, outlets, cec, outlet, conductor Topic 13 : card, motherboard, drive, controller, vga, ram, simms, monitor, scsi, mb Topic 14 : x, oname, wsh, eof, fij, mtl, xvoid, edm, bassel, buf Topic 15 : gfci, grounding, circuits, orbit, launch, conductor, circuit, wiring, wire, solar Topic 16 : drive, fij, disk, controller, mb, windows, card, bios, m, ram Topic 17 : president, q, stephanopoulos, mr, stimulus, jobs, administration, myers, dole, going Topic 18 : pixmap, xlib, xview, literature, biblical, bitmap, afternoon, texts, translations, translation Topic 19 : jesus, bible, isaiah, christ, satan, read, matthew, psalm, prophecy, god

On Apr 2, 2019, at 6:21 PM, Nathan Raw notifications@github.com wrote:

This should work 😄 Pushing shortly

https://user-images.githubusercontent.com/32437151/55446268-25344900-557c-11e9-94a8-ceb5da9efc0e.png https://user-images.githubusercontent.com/32437151/55446273-2796a300-557c-11e9-9164-c43df243f12c.png https://user-images.githubusercontent.com/32437151/55446278-2a919380-557c-11e9-940d-6571bd5aab0b.png — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/45#issuecomment-479284998, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2znESij1s23UFLtqXc8SHXPfKADZks5vdAIigaJpZM4cUgi6.

nateraw commented 5 years ago

No idea why "x" is showing up so much. strange. I'll keep playing around with the preprocessing

dbl001 commented 5 years ago

From run_20newsgroups.txt.
Some of these words don’t exist in 20_newsgroups.txt Any idea what’s going on? (This is on python 3.6 with Tensorflow 1.12.1. I’m trying with your requirements (e.g. - python 3.5, tensorflow 1.5)

EPOCH: 90 LOSS 9011.282 w2v 4.1849346 lda 9007.098 ---------Closest 10 words to given indexes---------- Topic 0 : istanbul, fij, bassel, noll, nyr, ve, den, mwra, nejm, wsh Topic 1 : fij, apartment, db, armenian, sumgait, marina, armenians, nyr, uw, karina Topic 2 : wolverine, adirondack, existence, sabretooth, cipher, nyr, fij, liefeld, cryptosystem, hobgoblin Topic 3 : fij, mtl, moncton, den, xmu, utica, ott, edm, wsh, binghamton Topic 4 : fij, interleave, scsi, asynchronous, jumpers, drives, mw, ide, bl, mfm Topic 5 : fij, xmu, wsh, mwra, mtl, mydisplay, den, bassel, edm, bl Topic 6 : fij, db, mwra, nyr, mtl, utica, wsh, bh, adirondack, uw Topic 7 : , davidian, accidents, remind, studied, bull, forum, rbi, rocks, abstract Topic 8 : den, springfield, utica, lens, rt, hb, sabretooth, explorer, binghamton, samuel Topic 9 : jpeg, gif, better, image, images, tend, want, think, higher, people Topic 10 : phone, keys, key, traffic, security, clipper, encryption, chip, service, car Topic 11 : wiring, mwra, gfci, ott, db, fij, bl, mtl, wsh, den Topic 12 : , timer, powered, rockefeller, ships, throttle, deadly, drag, deliver, initiative Topic 13 : wsh, providence, edm, mwra, stl, phi, bl, moncton, utica, fij Topic 14 : patients, infection, use, system, pain, symptoms, internal, cause, having, usually Topic 15 : matthew, prophecy, luke, jesus, passages, apostles, peter, explanation, post, diagnosed Topic 16 : jesus, christ, god, matthew, bible, gospel, lord, fij, christians, heaven Topic 17 : drive, motherboard, card, simms, mhz, drives, connector, shipping, monitor, meg Topic 18 : centaur, moncton, bmp, countersteering, cleveland, obfuscated, jagr, seagate, shipped, morris Topic 19 : gateway, desktop, clone, diamond, unix, vlb, silicon, ati, connector, amiga

On Apr 3, 2019, at 10:57 PM, Nathan Raw notifications@github.com wrote:

No idea why "x" is showing up so much. strange. I'll keep playing around with the preprocessing

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/45#issuecomment-479760818, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i22cW0a4qJ5Lba6-d1zuDhj_F1JS4ks5vdZQ5gaJpZM4cUgi6.

dbl001 commented 5 years ago
  1. what are the three characters words in your screen snapshot (attached below) at EPOCH 115:

E.g. for topic 3: hfd, uww, fij, edm, det, nyr.

  1. Trying filtering; on 20_newsgroups … Here’s my EPOCH 115 with your baseline python and tensor flow requirements in lda2vec_test environment with my the WordNet filtering:

EPOCH: 115 LOSS 11528.346 w2v 4.221919 lda 11524.124 ---------Closest 10 words to given indexes---------- Topic 0 : den, adirondack, noll, hobgoblin, interleave, hulk, di, allocation, het, wolverine Topic 1 : controller, ide, bios, interface, drive, jumper, chip, data, quantum, d Topic 2 : desert, matt, accomplished, adam, sigh, tradition, blessed, dying, mars, verse Topic 3 : therapy, , hospital, diet, skin, infection, vitamin, tissue, coli, syndrome Topic 4 : want, people, need, sure, think, found, good, got, know, try Topic 5 : encryption, privacy, cryptography, chips, cellular, clipper, blah, attorney, chi, secure Topic 6 : adirondack, scripture, scoring, genesis, adam, encryption, scorer, sin, stanley, play Topic 7 : hobgoblin, den, refresh, allocation, noll, jonathan, hulk, wolverine, adirondack, flaming Topic 8 : width, char, dram, cursor, colors, asynchronous, click, vat, invalid, refresh Topic 9 : den, hobgoblin, blah, chairman, scorer, allocation, helicopter, edward, noll, karabagh Topic 10 : apartment, marina, mamma, armenian, azerbaijani, shouting, balcony, turkish, directory, server Topic 11 : grounding, outlet, wiring, jacket, conductor, breaker, wolverine, ground, metal, adirondack Topic 12 : adirondack, den, stewart, providence, rookie, noll, hobgoblin, murphy, ga, phi Topic 13 : holy, mr, jesus, god, written, z, hebrew, baptism, scripture, source Topic 14 : firearm, trend, measure, ammunition, ratio, arthur, rifle, observation, statistic, knife Topic 15 : example, question, today, god, true, agree, believe, way, course, good Topic 16 : propaganda, integration, douglas, consumer, compromise, ah, mess, establishment, expressed, agenda Topic 17 : phi, scoring, scorer, upgrade, portable, torque, otto, interface, tight, period Topic 18 : year, think, want, good, sure, thing, got, little, work, lot Topic 19 : leo, defamation, rod, reduction, unauthorized, assured, horn, puck, abraham, pit

and after more testing at ...

EPOCH: 150 LOSS 11528.199 w2v 4.143144 lda 11524.056 ---------Closest 10 words to given indexes---------- Topic 0 : den, noll, adirondack, hobgoblin, interleave, wolverine, allocation, hulk, di, het Topic 1 : controller, bios, interface, chip, ide, drive, floppy, d, cache, m Topic 2 : matt, desert, scripture, , regularly, strange, thirty, sigh, writer, wind Topic 3 : , therapy, extremely, cause, highly, day, days, eating, clean, disease Topic 4 : want, found, sure, good, like, work, know, people, tell, look Topic 5 : blah, encryption, exclusive, mobile, darren, cryptography, ha, cryptology, cryptographic, privacy Topic 6 : adirondack, scripture, genesis, scorer, den, het, scoring, mat, providence, adam Topic 7 : den, flaming, hobgoblin, jonathan, ut, noll, summarize, lens, hulk, rod Topic 8 : asynchronous, width, char, ut, dram, retrieve, den, wolverine, refresh, default Topic 9 : blah, den, hobgoblin, adirondack, triangle, edward, interleave, allocation, noll, chairman Topic 10 : mamma, marina, den, apartment, noll, char, adirondack, allocation, balcony, compiler Topic 11 : adirondack, jacket, outlet, grounding, wolverine, conductor, redesign, wiring, providence, cape Topic 12 : den, adirondack, stewart, allocation, providence, mat, noll, blah, ga, dee Topic 13 : den, interleave, mat, scorer, mario, hobgoblin, baptism, linked, infallible, allocation Topic 14 : , trend, ammunition, observation, ratio, firearm, gray, deletion, fantasy, weapon Topic 15 : think, work, course, note, example, need, agree, true, told, way Topic 16 : propaganda, integration, , sir, unusual, spell, allen, compromise, pack, ai Topic 17 : phi, tight, torque, scorer, probability, scoring, portable, mask, socket, engine Topic 18 : sure, pretty, good, want, work, know, like, better, old, think Topic 19 : rod, leo, assured, lightning, smooth, reduction, unauthorized, cook, sudden, defamation

Filtering out words that were not in WordNet, reduced the 20_newsgroup vocabulary to

len(word_to_idx) 3563

But, I still don’t know why these words are in the 'reduced’ vocabulary: E.g. word_to_idx["dee"] 3422

Do the results look any better?

On Apr 2, 2019, at 6:21 PM, Nathan Raw notifications@github.com wrote:

This should work 😄 Pushing shortly

https://user-images.githubusercontent.com/32437151/55446268-25344900-557c-11e9-94a8-ceb5da9efc0e.png https://user-images.githubusercontent.com/32437151/55446273-2796a300-557c-11e9-9164-c43df243f12c.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/45#issuecomment-479284998, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2znESij1s23UFLtqXc8SHXPfKADZks5vdAIigaJpZM4cUgi6.

arijit1410 commented 5 years ago

Hi, I am still getting the same error and I can't understand why

nateraw commented 5 years ago

What error are you getting @arijit1410? If it is unrelated to getting the example working as shown, please feel free to post a new issue! If not (or if it has to do with making your own example work), please explain your situation here. Thanks! 🙂

Note - if you are getting any strange errors, please make sure you are using an environment with the same requirements as me (discussion about this @ #27 )

arijit1410 commented 5 years ago

I don't have a gpu on my system and it says that the environment needs tensorflow-gpu, is that going to be a problem?

arijit1410 commented 5 years ago

InvalidArgumentError (see above for traceback): indices[496] = 5839 is not in [0, 5839) [[Node: nce_loss/negative_sampling/nce_loss/embedding_lookup_1 = Gather[Tindices=DT_INT64, Tparams=DT_FLOAT, _class=["loc:@nce_biases"], validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](nce_biases/read, nce_loss/negative_sampling/nce_loss/concat)]]

This is the error I am getting after running run_20newsgroup.py. I've cloned the latest version of the repo

dbl001 commented 5 years ago

No, it should not be a problem.
What version of Tensorflow are you running this on? Python?

On Apr 15, 2019, at 12:29 AM, Arijit Ghosh Chowdhury notifications@github.com wrote:

I don't have a gpu on my system and it says that the environment needs tensorflow-gpu, is that going to be a problem?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

arijit1410 commented 5 years ago

I am running on python 3.5 and tensorflow 1.5

nateraw commented 5 years ago

@arijit1410 please post a new issue as this is unrelated to this issue.

When you do, please include the output of this (just to be safe)

import tensorflow as tf
print("Tensorflow Version = {}".format(tf.__version__))

It doesn't make sense that you're running into that error. We will help you figure it out :slightly_smiling_face:

dbl001 commented 5 years ago

Also, please remove any directories where preprocessing files from prior versions of Lda2vec might be saved.

E.g. -

$ pwd /Users/davidlaxer/Lda2vec-Tensorflow/tests/twenty_newsgroups/data/clean_data MacBook-Pro:clean_data davidlaxer$ ls -l total 205744 -rw-r--r-- 1 davidlaxer staff 85064 Apr 8 12:26 doc_lengths.npy -rw-r--r-- 1 davidlaxer staff 8551328 Apr 8 12:26 embedding_matrix.npy -rw-r--r-- 1 davidlaxer staff 28632 Apr 8 12:26 freqs.npy -rw-r--r-- 1 davidlaxer staff 68335 Apr 8 12:26 idx_to_word.pickle -rw-r--r-- 1 davidlaxer staff 96531372 Apr 8 12:27 skipgrams.txt -rw-r--r-- 1 davidlaxer staff 68335 Apr 8 12:26 word_to_idx.pickle

On Apr 15, 2019, at 8:47 PM, Nathan Raw notifications@github.com wrote:

@arijit1410 https://github.com/arijit1410 please post a new issue as this is unrelated to this issue.

When you do, please include the output of this (just to be safe)

import tensorflow as tf print("Tensorflow Version = {}".format(tf.version)) It doesn't make sense that you're running into that error. We will help you figure it out 🙂

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/45#issuecomment-483500423, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i25hUZdl9wL52sWI_74wxJLX0YJmAks5vhUfpgaJpZM4cUgi6.

dbl001 commented 5 years ago

After running: python load_20newsgroups.py, what's the output from running:

from lda2vec import utils, model
import numpy as np

# Path to preprocessed data
data_path  = "data/clean_data"
# Whether or not to load saved embeddings file
load_embeds = True

# Load data from files
(idx_to_word, word_to_idx, freqs, pivot_ids,
 target_ids, doc_ids, embed_matrix) = utils.load_preprocessed_data(data_path, load_embed_matrix=load_embeds)

embed_matrix.shape

pivot_ids.sort()
pivot_ids

target_ids.sort()
target_ids

doc_ids.sort()
doc_ids