nateraw / Lda2vec-Tensorflow

Tensorflow 1.5 implementation of Chris Moody's Lda2vec, adapted from @meereeum
MIT License
107 stars 40 forks source link

Why does the lda loss not fall after 10 epoch? #63

Closed JennieGerhardt closed 4 years ago

JennieGerhardt commented 4 years ago

and the words of different clusters are similar

**EPOCH: 11 LOSS 18736.01 w2v 4.9724298 lda 18731.037

EPOCH: 12 LOSS 18736.383 w2v 5.3450413 lda 18731.037

EPOCH: 13 LOSS 18735.596 w2v 4.559466 lda 18731.037

EPOCH: 14 LOSS 18736.525 w2v 5.4879036 lda 18731.037

EPOCH: 15 LOSS 18736.188 w2v 5.1508136 lda 18731.037 ---------Closest 10 words to given indexes---------- Topic 0 : patient, electronic, provider, addition, system, ehr, individual, including, compared, use Topic 1 : cohort, patient, individual, year, study, compared, adult, adjusted, population, risk Topic 2 : clinical, electronic, tool, addition, system, ehr, evaluated, process, clinician, documentation Topic 3 : healthcare, system, electronic, provider, context, perspective, information, professional, application, health Topic 4 : addition, applied, approach, algorithm, clinical, different, present, identified, combination, evaluated Topic 5 : provider, physician, clinician, staff, perception, satisfaction, implementation, practice, experience, clinic Topic 6 : addition, different, present, applied, approach, clinical, context, application, system, example Topic 7 : system, different, application, approach, present, data, electronic, addition, context, clinical Topic 8 : addition, approach, present, different, applied, clinical, context, system, application, example Topic 9 : addition, applied, approach, clinical, different, algorithm, cohort, identified, present, case Topic 10 : addition, applied, clinical, approach, different, present, algorithm, cohort, identified, case Topic 11 : applied, algorithm, addition, approach, clinical, present, different, combination, example, case**

dbl001 commented 4 years ago

The topic clusters will improve after 50-200 epochs. However, there are still issues ...

On Feb 1, 2020, at 1:18 AM, JennieGerhardt notifications@github.com wrote:

and the words of different clusters are similar

**EPOCH: 11 LOSS 18736.01 w2v 4.9724298 lda 18731.037

EPOCH: 12 LOSS 18736.383 w2v 5.3450413 lda 18731.037

EPOCH: 13 LOSS 18735.596 w2v 4.559466 lda 18731.037

EPOCH: 14 LOSS 18736.525 w2v 5.4879036 lda 18731.037

EPOCH: 15 LOSS 18736.188 w2v 5.1508136 lda 18731.037 ---------Closest 10 words to given indexes---------- Topic 0 : patient, electronic, provider, addition, system, ehr, individual, including, compared, use Topic 1 : cohort, patient, individual, year, study, compared, adult, adjusted, population, risk Topic 2 : clinical, electronic, tool, addition, system, ehr, evaluated, process, clinician, documentation Topic 3 : healthcare, system, electronic, provider, context, perspective, information, professional, application, health Topic 4 : addition, applied, approach, algorithm, clinical, different, present, identified, combination, evaluated Topic 5 : provider, physician, clinician, staff, perception, satisfaction, implementation, practice, experience, clinic Topic 6 : addition, different, present, applied, approach, clinical, context, application, system, example Topic 7 : system, different, application, approach, present, data, electronic, addition, context, clinical Topic 8 : addition, approach, present, different, applied, clinical, context, system, application, example Topic 9 : addition, applied, approach, clinical, different, algorithm, cohort, identified, present, case Topic 10 : addition, applied, clinical, approach, different, present, algorithm, cohort, identified, case Topic 11 : applied, algorithm, addition, approach, clinical, present, different, combination, example, case**

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/63?email_source=notifications&email_token=AAXWFW75YUE6AFDUXBKKEKDRAU45ZA5CNFSM4KOR73C2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IKKG6TQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXWFW4LPXHCYJLJ25GT5PTRAU45ZANCNFSM4KOR73CQ.

JennieGerhardt commented 4 years ago

The topic clusters will improve after 50-200 epochs. However, there are still issues ... On Feb 1, 2020, at 1:18 AM, JennieGerhardt @.*> wrote: and the words of different clusters are similar EPOCH: 11 LOSS 18736.01 w2v 4.9724298 lda 18731.037 EPOCH: 12 LOSS 18736.383 w2v 5.3450413 lda 18731.037 EPOCH: 13 LOSS 18735.596 w2v 4.559466 lda 18731.037 EPOCH: 14 LOSS 18736.525 w2v 5.4879036 lda 18731.037 EPOCH: 15 LOSS 18736.188 w2v 5.1508136 lda 18731.037 ---------Closest 10 words to given indexes---------- Topic 0 : patient, electronic, provider, addition, system, ehr, individual, including, compared, use Topic 1 : cohort, patient, individual, year, study, compared, adult, adjusted, population, risk Topic 2 : clinical, electronic, tool, addition, system, ehr, evaluated, process, clinician, documentation Topic 3 : healthcare, system, electronic, provider, context, perspective, information, professional, application, health Topic 4 : addition, applied, approach, algorithm, clinical, different, present, identified, combination, evaluated Topic 5 : provider, physician, clinician, staff, perception, satisfaction, implementation, practice, experience, clinic Topic 6 : addition, different, present, applied, approach, clinical, context, application, system, example Topic 7 : system, different, application, approach, present, data, electronic, addition, context, clinical Topic 8 : addition, approach, present, different, applied, clinical, context, system, application, example Topic 9 : addition, applied, approach, clinical, different, algorithm, cohort, identified, present, case Topic 10 : addition, applied, clinical, approach, different, present, algorithm, cohort, identified, case Topic 11 : applied, algorithm, addition, approach, clinical, present, different, combination, example, case** — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#63?email_source=notifications&email_token=AAXWFW75YUE6AFDUXBKKEKDRAU45ZA5CNFSM4KOR73C2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IKKG6TQ>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXWFW4LPXHCYJLJ25GT5PTRAU45ZANCNFSM4KOR73CQ.

The words on each topic are still similar after 100 epochs,and the loss not fall after 10 epoch. My dataset consists of 10000 short texts, the size of dataset is 21M. Is it because my dataset is too small?

dbl001 commented 4 years ago

I don’t believe it’s the size of your dataset. I believe it’s either issue(s) with the algorithm and/or with the implementation.

On Feb 3, 2020, at 5:19 PM, JennieGerhardt notifications@github.com wrote:

 The topic clusters will improve after 50-200 epochs. However, there are still issues ... … On Feb 1, 2020, at 1:18 AM, JennieGerhardt @.***> wrote: and the words of different clusters are similar EPOCH: 11 LOSS 18736.01 w2v 4.9724298 lda 18731.037 EPOCH: 12 LOSS 18736.383 w2v 5.3450413 lda 18731.037 EPOCH: 13 LOSS 18735.596 w2v 4.559466 lda 18731.037 EPOCH: 14 LOSS 18736.525 w2v 5.4879036 lda 18731.037 EPOCH: 15 LOSS 18736.188 w2v 5.1508136 lda 18731.037 ---------Closest 10 words to given indexes---------- Topic 0 : patient, electronic, provider, addition, system, ehr, individual, including, compared, use Topic 1 : cohort, patient, individual, year, study, compared, adult, adjusted, population, risk Topic 2 : clinical, electronic, tool, addition, system, ehr, evaluated, process, clinician, documentation Topic 3 : healthcare, system, electronic, provider, context, perspective, information, professional, application, health Topic 4 : addition, applied, approach, algorithm, clinical, different, present, identified, combination, evaluated Topic 5 : provider, physician, clinician, staff, perception, satisfaction, implementation, practice, experience, clinic Topic 6 : addition, different, present, applied, approach, clinical, context, application, system, example Topic 7 : system, different, application, approach, present, data, electronic, addition, context, clinical Topic 8 : addition, approach, present, different, applied, clinical, context, system, application, example Topic 9 : addition, applied, approach, clinical, different, algorithm, cohort, identified, present, case Topic 10 : addition, applied, clinical, approach, different, present, algorithm, cohort, identified, case Topic 11 : applied, algorithm, addition, approach, clinical, present, different, combination, example, case — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#63?email_source=notifications&email_token=AAXWFW75YUE6AFDUXBKKEKDRAU45ZA5CNFSM4KOR73C2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IKKG6TQ>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXWFW4LPXHCYJLJ25GT5PTRAU45ZANCNFSM4KOR73CQ.

The words on each topic are still similar after 100 epochs,and the loss not fall after 10 epoch. My dataset consists of 10000 short texts, the size of dataset is 21M. Is it because my dataset is too small?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

JennieGerhardt commented 4 years ago

I only modified this line of code and the read address of the file embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE )) embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, 'r' , encoding = 'utf-8' )) Dataset is "20_newsgroups.txt" in the project file. Loss is no longer falling after 10 epoch, I don't know where the issue(s) is. Do you have this problem when you run this project file? thank u

EPOCH: 36 LOSS 71936.99 w2v 4.865297 lda 71932.125

EPOCH: 37 LOSS 71936.96 w2v 4.8396053 lda 71932.125

EPOCH: 38 LOSS 71936.484 w2v 4.3559113 lda 71932.125

EPOCH: 39 LOSS 71936.28 w2v 4.1550007 lda 71932.125

EPOCH: 40 LOSS 71936.445 w2v 4.323127 lda 71932.125 ---------Closest 10 words to given indexes---------- Topic 0 : got, bike, like, guy, mtl, car, season, hit, wsh, little Topic 1 : wsh, armenian, db, edm, mtl, xmu, nyr, jesus, widget, intrinsics Topic 2 : said, armenian, people, armenians, azerbaijani, going, turkish, armenia, azerbaijanis, think Topic 3 : car, bike, got, like, guy, season, miles, mtl, hit, tires Topic 4 : god, jesus, christ, bible, christian, believe, faith, christians, think, scripture Topic 5 : space, planetary, available, nasa, laboratory, earth, systems, software, research, satellites Topic 6 : people, god, think, fact, religion, believe, bible, way, truth, reason Topic 7 : drive, card, ram, controller, mb, disk, drives, floppy, scsi, pc Topic 8 : bible, jesus, christian, god, biblical, religion, scripture, example, christians, theists Topic 9 : know, like, think, way, things, thing, find, people, want, tell Topic 10 : wsh, nyr, people, edm, mtl, fact, ott, gun, information, fij Topic 11 : wsh, mtl, edm, nyr, nyi, fij, adirondack, mv, team, like Topic 12 : team, season, flyers, play, galley, winnipeg, nyr, game, mtl, teams Topic 13 : team, season, coach, mtl, fans, wsh, teams, nyi, sanderson, games Topic 14 : x, file, oname, available, program, version, use, information, example, display Topic 15 : going, think, encryption, q, know, clipper, chip, president, nsa, government Topic 16 : wsh, mail, nyr, mtl, email, thanks, sj, ott, edm, john Topic 17 : wire, grounding, ground, wiring, gfci, cec, conductor, outlets, insulation, outlet Topic 18 : privacy, encryption, information, government, cryptography, law, public, file, security, electronic Topic 19 : homicides, gun, nyr, handgun, homicide, mtl, firearms, wsh, handguns, seattle

EPOCH: 41 LOSS 71936.51 w2v 4.38324 lda 71932.125

EPOCH: 42 LOSS 71936.68 w2v 4.5584536 lda 71932.125

EPOCH: 43 LOSS 71936.09 w2v 3.9714596 lda 71932.125

EPOCH: 44 LOSS 71936.22 w2v 4.09763 lda 71932.125

EPOCH: 45 LOSS 71936.22 w2v 4.094322 lda 71932.125 ---------Closest 10 words to given indexes---------- Topic 0 : got, bike, like, car, hit, season, guy, mtl, team, wsh Topic 1 : wsh, armenian, edm, db, mtl, nyr, xmu, widget, jesus, intrinsics Topic 2 : said, people, armenian, armenians, going, turkish, think, azerbaijani, armenia, know Topic 3 : car, bike, got, like, guy, season, hit, team, miles, tires Topic 4 : god, jesus, christ, bible, christian, believe, christians, think, faith, scripture Topic 5 : space, planetary, available, nasa, satellite, earth, software, systems, research, satellites Topic 6 : people, god, think, religion, believe, fact, bible, reason, way, gun Topic 7 : drive, card, controller, ram, mb, disk, floppy, drives, pc, scsi Topic 8 : bible, god, christian, jesus, religion, scripture, christians, biblical, theists, belief Topic 9 : know, like, think, way, things, thing, find, people, want, tell Topic 10 : wsh, nyr, people, edm, mtl, ott, gun, stl, fact, use Topic 11 : wsh, mtl, edm, nyr, fij, nyi, team, mv, adirondack, like Topic 12 : team, season, play, flyers, nyr, game, winnipeg, galley, pts, teams Topic 13 : team, season, fans, wsh, mtl, coach, teams, players, hockey, nyi Topic 14 : x, file, available, oname, program, use, version, information, software, motif Topic 15 : going, encryption, q, think, know, clipper, president, chip, government, nsa Topic 16 : wsh, mail, nyr, mtl, email, ott, thanks, sj, edm, john Topic 17 : wire, grounding, ground, wiring, gfci, cec, outlets, insulation, conductor, metal Topic 18 : privacy, encryption, information, government, cryptography, public, file, law, security, electronic Topic 19 : gun, homicides, nyr, firearms, homicide, wsh, mtl, handgun, file, handguns

EPOCH: 46 LOSS 71936.64 w2v 4.517669 lda 71932.125

EPOCH: 47 LOSS 71936.4 w2v 4.2767406 lda 71932.125

EPOCH: 48 LOSS 71936.07 w2v 3.945362 lda 71932.125

dbl001 commented 4 years ago
  1. the loss does NOT decrease after 10 epochs.
  2. the topic vectors and the word embedding vectors have issues (please see other open issues).
  3. I have similar problems. I don’t yet know what’s wrong.
  4. embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, encoding="utf8", errors='ignore') if len(o)>100)

On Feb 3, 2020, at 8:31 PM, JennieGerhardt notifications@github.com wrote:

I only modified this line of code and the read address of the file embeddings_index = dict(get_coefs(o.split(" ")) for o in open(EMBEDDING_FILE )) embeddings_index = dict(get_coefs(o.split(" ")) for o in open(EMBEDDING_FILE, 'r' , encoding = 'utf-8' )) Dataset is "20_newsgroups.txt" in the project file. Loss is no longer falling after 10 epoch, I don't know where the issue(s) is. Do you have this problem when you run this project file? thank u

EPOCH: 36 LOSS 71936.99 w2v 4.865297 lda 71932.125

EPOCH: 37 LOSS 71936.96 w2v 4.8396053 lda 71932.125

EPOCH: 38 LOSS 71936.484 w2v 4.3559113 lda 71932.125

EPOCH: 39 LOSS 71936.28 w2v 4.1550007 lda 71932.125

EPOCH: 40 LOSS 71936.445 w2v 4.323127 lda 71932.125 ---------Closest 10 words to given indexes---------- Topic 0 : got, bike, like, guy, mtl, car, season, hit, wsh, little Topic 1 : wsh, armenian, db, edm, mtl, xmu, nyr, jesus, widget, intrinsics Topic 2 : said, armenian, people, armenians, azerbaijani, going, turkish, armenia, azerbaijanis, think Topic 3 : car, bike, got, like, guy, season, miles, mtl, hit, tires Topic 4 : god, jesus, christ, bible, christian, believe, faith, christians, think, scripture Topic 5 : space, planetary, available, nasa, laboratory, earth, systems, software, research, satellites Topic 6 : people, god, think, fact, religion, believe, bible, way, truth, reason Topic 7 : drive, card, ram, controller, mb, disk, drives, floppy, scsi, pc Topic 8 : bible, jesus, christian, god, biblical, religion, scripture, example, christians, theists Topic 9 : know, like, think, way, things, thing, find, people, want, tell Topic 10 : wsh, nyr, people, edm, mtl, fact, ott, gun, information, fij Topic 11 : wsh, mtl, edm, nyr, nyi, fij, adirondack, mv, team, like Topic 12 : team, season, flyers, play, galley, winnipeg, nyr, game, mtl, teams Topic 13 : team, season, coach, mtl, fans, wsh, teams, nyi, sanderson, games Topic 14 : x, file, oname, available, program, version, use, information, example, display Topic 15 : going, think, encryption, q, know, clipper, chip, president, nsa, government Topic 16 : wsh, mail, nyr, mtl, email, thanks, sj, ott, edm, john Topic 17 : wire, grounding, ground, wiring, gfci, cec, conductor, outlets, insulation, outlet Topic 18 : privacy, encryption, information, government, cryptography, law, public, file, security, electronic Topic 19 : homicides, gun, nyr, handgun, homicide, mtl, firearms, wsh, handguns, seattle

EPOCH: 41 LOSS 71936.51 w2v 4.38324 lda 71932.125

EPOCH: 42 LOSS 71936.68 w2v 4.5584536 lda 71932.125

EPOCH: 43 LOSS 71936.09 w2v 3.9714596 lda 71932.125

EPOCH: 44 LOSS 71936.22 w2v 4.09763 lda 71932.125

EPOCH: 45 LOSS 71936.22 w2v 4.094322 lda 71932.125 ---------Closest 10 words to given indexes---------- Topic 0 : got, bike, like, car, hit, season, guy, mtl, team, wsh Topic 1 : wsh, armenian, edm, db, mtl, nyr, xmu, widget, jesus, intrinsics Topic 2 : said, people, armenian, armenians, going, turkish, think, azerbaijani, armenia, know Topic 3 : car, bike, got, like, guy, season, hit, team, miles, tires Topic 4 : god, jesus, christ, bible, christian, believe, christians, think, faith, scripture Topic 5 : space, planetary, available, nasa, satellite, earth, software, systems, research, satellites Topic 6 : people, god, think, religion, believe, fact, bible, reason, way, gun Topic 7 : drive, card, controller, ram, mb, disk, floppy, drives, pc, scsi Topic 8 : bible, god, christian, jesus, religion, scripture, christians, biblical, theists, belief Topic 9 : know, like, think, way, things, thing, find, people, want, tell Topic 10 : wsh, nyr, people, edm, mtl, ott, gun, stl, fact, use Topic 11 : wsh, mtl, edm, nyr, fij, nyi, team, mv, adirondack, like Topic 12 : team, season, play, flyers, nyr, game, winnipeg, galley, pts, teams Topic 13 : team, season, fans, wsh, mtl, coach, teams, players, hockey, nyi Topic 14 : x, file, available, oname, program, use, version, information, software, motif Topic 15 : going, encryption, q, think, know, clipper, president, chip, government, nsa Topic 16 : wsh, mail, nyr, mtl, email, ott, thanks, sj, edm, john Topic 17 : wire, grounding, ground, wiring, gfci, cec, outlets, insulation, conductor, metal Topic 18 : privacy, encryption, information, government, cryptography, public, file, law, security, electronic Topic 19 : gun, homicides, nyr, firearms, homicide, wsh, mtl, handgun, file, handguns

EPOCH: 46 LOSS 71936.64 w2v 4.517669 lda 71932.125

EPOCH: 47 LOSS 71936.4 w2v 4.2767406 lda 71932.125

EPOCH: 48 LOSS 71936.07 w2v 3.945362 lda 71932.125

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/63?email_source=notifications&email_token=AAXWFWYDC4WUBB4DWXL34N3RBDVRVA5CNFSM4KOR73C2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKWJ46A#issuecomment-581738104, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXWFW3XRWUGRKV5IEO3ALLRBDVRVANCNFSM4KOR73CQ.

JennieGerhardt commented 4 years ago

1.损失不会减少10个纪元。2.主题向量和词嵌入向量有问题(请参阅其他未解决的问题)。3.我有类似的问题。我还不知道怎么了。4. embeddings_index = dict(get_coefs(* o.split(“”))for open in(EMBEDDING_FILE,encoding =“ utf8”,errors ='ignore'),如果len(o)> 100) On Feb 3, 2020, at 8:31 PM, JennieGerhardt @.**> wrote: I only modified this line of code and the read address of the file embeddings_index = dict(get_coefs(o.split(" ")) for o in open(EMBEDDING_FILE )) embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, 'r' , encoding = 'utf-8' )) Dataset is "20_newsgroups.txt" in the project file. Loss is no longer falling after 10 epoch, I don't know where the issue(s) is. Do you have this problem when you run this project file? thank u EPOCH: 36 LOSS 71936.99 w2v 4.865297 lda 71932.125 EPOCH: 37 LOSS 71936.96 w2v 4.8396053 lda 71932.125 EPOCH: 38 LOSS 71936.484 w2v 4.3559113 lda 71932.125 EPOCH: 39 LOSS 71936.28 w2v 4.1550007 lda 71932.125 EPOCH: 40 LOSS 71936.445 w2v 4.323127 lda 71932.125 ---------Closest 10 words to given indexes---------- Topic 0 : got, bike, like, guy, mtl, car, season, hit, wsh, little Topic 1 : wsh, armenian, db, edm, mtl, xmu, nyr, jesus, widget, intrinsics Topic 2 : said, armenian, people, armenians, azerbaijani, going, turkish, armenia, azerbaijanis, think Topic 3 : car, bike, got, like, guy, season, miles, mtl, hit, tires Topic 4 : god, jesus, christ, bible, christian, believe, faith, christians, think, scripture Topic 5 : space, planetary, available, nasa, laboratory, earth, systems, software, research, satellites Topic 6 : people, god, think, fact, religion, believe, bible, way, truth, reason Topic 7 : drive, card, ram, controller, mb, disk, drives, floppy, scsi, pc Topic 8 : bible, jesus, christian, god, biblical, religion, scripture, example, christians, theists Topic 9 : know, like, think, way, things, thing, find, people, want, tell Topic 10 : wsh, nyr, people, edm, mtl, fact, ott, gun, information, fij Topic 11 : wsh, mtl, edm, nyr, nyi, fij, adirondack, mv, team, like Topic 12 : team, season, flyers, play, galley, winnipeg, nyr, game, mtl, teams Topic 13 : team, season, coach, mtl, fans, wsh, teams, nyi, sanderson, games Topic 14 : x, file, oname, available, program, version, use, information, example, display Topic 15 : going, think, encryption, q, know, clipper, chip, president, nsa, government Topic 16 : wsh, mail, nyr, mtl, email, thanks, sj, ott, edm, john Topic 17 : wire, grounding, ground, wiring, gfci, cec, conductor, outlets, insulation, outlet Topic 18 : privacy, encryption, information, government, cryptography, law, public, file, security, electronic Topic 19 : homicides, gun, nyr, handgun, homicide, mtl, firearms, wsh, handguns, seattle EPOCH: 41 LOSS 71936.51 w2v 4.38324 lda 71932.125 EPOCH: 42 LOSS 71936.68 w2v 4.5584536 lda 71932.125 EPOCH: 43 LOSS 71936.09 w2v 3.9714596 lda 71932.125 EPOCH: 44 LOSS 71936.22 w2v 4.09763 lda 71932.125 EPOCH: 45 LOSS 71936.22 w2v 4.094322 lda 71932.125 ---------Closest 10 words to given indexes---------- Topic 0 : got, bike, like, car, hit, season, guy, mtl, team, wsh Topic 1 : wsh, armenian, edm, db, mtl, nyr, xmu, widget, jesus, intrinsics Topic 2 : said, people, armenian, armenians, going, turkish, think, azerbaijani, armenia, know Topic 3 : car, bike, got, like, guy, season, hit, team, miles, tires Topic 4 : god, jesus, christ, bible, christian, believe, christians, think, faith, scripture Topic 5 : space, planetary, available, nasa, satellite, earth, software, systems, research, satellites Topic 6 : people, god, think, religion, believe, fact, bible, reason, way, gun Topic 7 : drive, card, controller, ram, mb, disk, floppy, drives, pc, scsi Topic 8 : bible, god, christian, jesus, religion, scripture, christians, biblical, theists, belief Topic 9 : know, like, think, way, things, thing, find, people, want, tell Topic 10 : wsh, nyr, people, edm, mtl, ott, gun, stl, fact, use Topic 11 : wsh, mtl, edm, nyr, fij, nyi, team, mv, adirondack, like Topic 12 : team, season, play, flyers, nyr, game, winnipeg, galley, pts, teams Topic 13 : team, season, fans, wsh, mtl, coach, teams, players, hockey, nyi Topic 14 : x, file, available, oname, program, use, version, information, software, motif Topic 15 : going, encryption, q, think, know, clipper, president, chip, government, nsa Topic 16 : wsh, mail, nyr, mtl, email, ott, thanks, sj, edm, john Topic 17 : wire, grounding, ground, wiring, gfci, cec, outlets, insulation, conductor, metal Topic 18 : privacy, encryption, information, government, cryptography, public, file, law, security, electronic Topic 19 : gun, homicides, nyr, firearms, homicide, wsh, mtl, handgun, file, handguns EPOCH: 46 LOSS 71936.64 w2v 4.517669 lda 71932.125 EPOCH: 47 LOSS 71936.4 w2v 4.2767406 lda 71932.125 EPOCH: 48 LOSS 71936.07 w2v 3.945362 lda 71932.125 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#63?email_source=notifications&email_token=AAXWFWYDC4WUBB4DWXL34N3RBDVRVA5CNFSM4KOR73C2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKWJ46A#issuecomment-581738104>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXWFW3XRWUGRKV5IEO3ALLRBDVRVANCNFSM4KOR73CQ.

thank you for your help

dbl001 commented 4 years ago

These articles might help you to understand LDA and Lda2Vec

https://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/#topic=38&lambda=1&term=

https://brooksandrew.github.io/simpleblog/articles/latent-dirichlet-allocation-under-the-hood/

There’s also a pytorch implementation of Lda2vec you can try (which works a bit better).

https://github.com/TropComplique/lda2vec-pytorch

On Feb 4, 2020, at 11:40 AM, JennieGerhardt notifications@github.com wrote:

1.损失不会减少10个纪元。2.主题向量和词嵌入向量有问题(请参阅其他未解决的问题)。3.我有类似的问题。我还不知道怎么了。4. embeddings_index = dict(get_coefs(* o.split(“”))for open in(EMBEDDING_FILE,encoding =“ utf8”,errors ='ignore'),如果len(o)> 100) … <x-msg://45/#> On Feb 3, 2020, at 8:31 PM, JennieGerhardt @.**> wrote: I only modified this line of code and the read address of the file embeddings_index = dict(get_coefs(o.split(" ")) for o in open(EMBEDDING_FILE )) embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, 'r' , encoding = 'utf-8' )) Dataset is "20_newsgroups.txt" in the project file. Loss is no longer falling after 10 epoch, I don't know where the issue(s) is. Do you have this problem when you run this project file? thank u EPOCH: 36 LOSS 71936.99 w2v 4.865297 lda 71932.125 EPOCH: 37 LOSS 71936.96 w2v 4.8396053 lda 71932.125 EPOCH: 38 LOSS 71936.484 w2v 4.3559113 lda 71932.125 EPOCH: 39 LOSS 71936.28 w2v 4.1550007 lda 71932.125 EPOCH: 40 LOSS 71936.445 w2v 4.323127 lda 71932.125 ---------Closest 10 words to given indexes---------- Topic 0 : got, bike, like, guy, mtl, car, season, hit, wsh, little Topic 1 : wsh, armenian, db, edm, mtl, xmu, nyr, jesus, widget, intrinsics Topic 2 : said, armenian, people, armenians, azerbaijani, going, turkish, armenia, azerbaijanis, think Topic 3 : car, bike, got, like, guy, season, miles, mtl, hit, tires Topic 4 : god, jesus, christ, bible, christian, believe, faith, christians, think, scripture Topic 5 : space, planetary, available, nasa, laboratory, earth, systems, software, research, satellites Topic 6 : people, god, think, fact, religion, believe, bible, way, truth, reason Topic 7 : drive, card, ram, controller, mb, disk, drives, floppy, scsi, pc Topic 8 : bible, jesus, christian, god, biblical, religion, scripture, example, christians, theists Topic 9 : know, like, think, way, things, thing, find, people, want, tell Topic 10 : wsh, nyr, people, edm, mtl, fact, ott, gun, information, fij Topic 11 : wsh, mtl, edm, nyr, nyi, fij, adirondack, mv, team, like Topic 12 : team, season, flyers, play, galley, winnipeg, nyr, game, mtl, teams Topic 13 : team, season, coach, mtl, fans, wsh, teams, nyi, sanderson, games Topic 14 : x, file, oname, available, program, version, use, information, example, display Topic 15 : going, think, encryption, q, know, clipper, chip, president, nsa, government Topic 16 : wsh, mail, nyr, mtl, email, thanks, sj, ott, edm, john Topic 17 : wire, grounding, ground, wiring, gfci, cec, conductor, outlets, insulation, outlet Topic 18 : privacy, encryption, information, government, cryptography, law, public, file, security, electronic Topic 19 : homicides, gun, nyr, handgun, homicide, mtl, firearms, wsh, handguns, seattle EPOCH: 41 LOSS 71936.51 w2v 4.38324 lda 71932.125 EPOCH: 42 LOSS 71936.68 w2v 4.5584536 lda 71932.125 EPOCH: 43 LOSS 71936.09 w2v 3.9714596 lda 71932.125 EPOCH: 44 LOSS 71936.22 w2v 4.09763 lda 71932.125 EPOCH: 45 LOSS 71936.22 w2v 4.094322 lda 71932.125 ---------Closest 10 words to given indexes---------- Topic 0 : got, bike, like, car, hit, season, guy, mtl, team, wsh Topic 1 : wsh, armenian, edm, db, mtl, nyr, xmu, widget, jesus, intrinsics Topic 2 : said, people, armenian, armenians, going, turkish, think, azerbaijani, armenia, know Topic 3 : car, bike, got, like, guy, season, hit, team, miles, tires Topic 4 : god, jesus, christ, bible, christian, believe, christians, think, faith, scripture Topic 5 : space, planetary, available, nasa, satellite, earth, software, systems, research, satellites Topic 6 : people, god, think, religion, believe, fact, bible, reason, way, gun Topic 7 : drive, card, controller, ram, mb, disk, floppy, drives, pc, scsi Topic 8 : bible, god, christian, jesus, religion, scripture, christians, biblical, theists, belief Topic 9 : know, like, think, way, things, thing, find, people, want, tell Topic 10 : wsh, nyr, people, edm, mtl, ott, gun, stl, fact, use Topic 11 : wsh, mtl, edm, nyr, fij, nyi, team, mv, adirondack, like Topic 12 : team, season, play, flyers, nyr, game, winnipeg, galley, pts, teams Topic 13 : team, season, fans, wsh, mtl, coach, teams, players, hockey, nyi Topic 14 : x, file, available, oname, program, use, version, information, software, motif Topic 15 : going, encryption, q, think, know, clipper, president, chip, government, nsa Topic 16 : wsh, mail, nyr, mtl, email, ott, thanks, sj, edm, john Topic 17 : wire, grounding, ground, wiring, gfci, cec, outlets, insulation, conductor, metal Topic 18 : privacy, encryption, information, government, cryptography, public, file, law, security, electronic Topic 19 : gun, homicides, nyr, firearms, homicide, wsh, mtl, handgun, file, handguns EPOCH: 46 LOSS 71936.64 w2v 4.517669 lda 71932.125 EPOCH: 47 LOSS 71936.4 w2v 4.2767406 lda 71932.125 EPOCH: 48 LOSS 71936.07 w2v 3.945362 lda 71932.125 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#63 https://github.com/nateraw/Lda2vec-Tensorflow/issues/63?email_source=notifications&email_token=AAXWFWYDC4WUBB4DWXL34N3RBDVRVA5CNFSM4KOR73C2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKWJ46A#issuecomment-581738104>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXWFW3XRWUGRKV5IEO3ALLRBDVRVANCNFSM4KOR73CQ https://github.com/notifications/unsubscribe-auth/AAXWFW3XRWUGRKV5IEO3ALLRBDVRVANCNFSM4KOR73CQ.

thank you for your help

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/63?email_source=notifications&email_token=AAXWFW5MC2R5SKUSQNP46OLRBHAEHA5CNFSM4KOR73C2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKY5QOY#issuecomment-582080571, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXWFW2VIBUG42Q6EPMJ7OLRBHAEHANCNFSM4KOR73CQ.

dbl001 commented 4 years ago

… and this:

https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158

On Feb 4, 2020, at 11:45 AM, David Laxer davidl@softintel.com wrote:

These articles might help you to understand LDA and Lda2Vec

https://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/#topic=38&lambda=1&term= https://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/#topic=38&lambda=1&term=

https://brooksandrew.github.io/simpleblog/articles/latent-dirichlet-allocation-under-the-hood/

There’s also a pytorch implementation of Lda2vec you can try (which works a bit better).

https://github.com/TropComplique/lda2vec-pytorch

On Feb 4, 2020, at 11:40 AM, JennieGerhardt notifications@github.com wrote:

1.损失不会减少10个纪元。2.主题向量和词嵌入向量有问题(请参阅其他未解决的问题)。3.我有类似的问题。我还不知道怎么了。4. embeddings_index = dict(get_coefs(* o.split(“”))for open in(EMBEDDING_FILE,encoding =“ utf8”,errors ='ignore'),如果len(o)> 100) … <x-msg://45/#> On Feb 3, 2020, at 8:31 PM, JennieGerhardt @.**> wrote: I only modified this line of code and the read address of the file embeddings_index = dict(get_coefs(o.split(" ")) for o in open(EMBEDDING_FILE )) embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, 'r' , encoding = 'utf-8' )) Dataset is "20_newsgroups.txt" in the project file. Loss is no longer falling after 10 epoch, I don't know where the issue(s) is. Do you have this problem when you run this project file? thank u EPOCH: 36 LOSS 71936.99 w2v 4.865297 lda 71932.125 EPOCH: 37 LOSS 71936.96 w2v 4.8396053 lda 71932.125 EPOCH: 38 LOSS 71936.484 w2v 4.3559113 lda 71932.125 EPOCH: 39 LOSS 71936.28 w2v 4.1550007 lda 71932.125 EPOCH: 40 LOSS 71936.445 w2v 4.323127 lda 71932.125 ---------Closest 10 words to given indexes---------- Topic 0 : got, bike, like, guy, mtl, car, season, hit, wsh, little Topic 1 : wsh, armenian, db, edm, mtl, xmu, nyr, jesus, widget, intrinsics Topic 2 : said, armenian, people, armenians, azerbaijani, going, turkish, armenia, azerbaijanis, think Topic 3 : car, bike, got, like, guy, season, miles, mtl, hit, tires Topic 4 : god, jesus, christ, bible, christian, believe, faith, christians, think, scripture Topic 5 : space, planetary, available, nasa, laboratory, earth, systems, software, research, satellites Topic 6 : people, god, think, fact, religion, believe, bible, way, truth, reason Topic 7 : drive, card, ram, controller, mb, disk, drives, floppy, scsi, pc Topic 8 : bible, jesus, christian, god, biblical, religion, scripture, example, christians, theists Topic 9 : know, like, think, way, things, thing, find, people, want, tell Topic 10 : wsh, nyr, people, edm, mtl, fact, ott, gun, information, fij Topic 11 : wsh, mtl, edm, nyr, nyi, fij, adirondack, mv, team, like Topic 12 : team, season, flyers, play, galley, winnipeg, nyr, game, mtl, teams Topic 13 : team, season, coach, mtl, fans, wsh, teams, nyi, sanderson, games Topic 14 : x, file, oname, available, program, version, use, information, example, display Topic 15 : going, think, encryption, q, know, clipper, chip, president, nsa, government Topic 16 : wsh, mail, nyr, mtl, email, thanks, sj, ott, edm, john Topic 17 : wire, grounding, ground, wiring, gfci, cec, conductor, outlets, insulation, outlet Topic 18 : privacy, encryption, information, government, cryptography, law, public, file, security, electronic Topic 19 : homicides, gun, nyr, handgun, homicide, mtl, firearms, wsh, handguns, seattle EPOCH: 41 LOSS 71936.51 w2v 4.38324 lda 71932.125 EPOCH: 42 LOSS 71936.68 w2v 4.5584536 lda 71932.125 EPOCH: 43 LOSS 71936.09 w2v 3.9714596 lda 71932.125 EPOCH: 44 LOSS 71936.22 w2v 4.09763 lda 71932.125 EPOCH: 45 LOSS 71936.22 w2v 4.094322 lda 71932.125 ---------Closest 10 words to given indexes---------- Topic 0 : got, bike, like, car, hit, season, guy, mtl, team, wsh Topic 1 : wsh, armenian, edm, db, mtl, nyr, xmu, widget, jesus, intrinsics Topic 2 : said, people, armenian, armenians, going, turkish, think, azerbaijani, armenia, know Topic 3 : car, bike, got, like, guy, season, hit, team, miles, tires Topic 4 : god, jesus, christ, bible, christian, believe, christians, think, faith, scripture Topic 5 : space, planetary, available, nasa, satellite, earth, software, systems, research, satellites Topic 6 : people, god, think, religion, believe, fact, bible, reason, way, gun Topic 7 : drive, card, controller, ram, mb, disk, floppy, drives, pc, scsi Topic 8 : bible, god, christian, jesus, religion, scripture, christians, biblical, theists, belief Topic 9 : know, like, think, way, things, thing, find, people, want, tell Topic 10 : wsh, nyr, people, edm, mtl, ott, gun, stl, fact, use Topic 11 : wsh, mtl, edm, nyr, fij, nyi, team, mv, adirondack, like Topic 12 : team, season, play, flyers, nyr, game, winnipeg, galley, pts, teams Topic 13 : team, season, fans, wsh, mtl, coach, teams, players, hockey, nyi Topic 14 : x, file, available, oname, program, use, version, information, software, motif Topic 15 : going, encryption, q, think, know, clipper, president, chip, government, nsa Topic 16 : wsh, mail, nyr, mtl, email, ott, thanks, sj, edm, john Topic 17 : wire, grounding, ground, wiring, gfci, cec, outlets, insulation, conductor, metal Topic 18 : privacy, encryption, information, government, cryptography, public, file, law, security, electronic Topic 19 : gun, homicides, nyr, firearms, homicide, wsh, mtl, handgun, file, handguns EPOCH: 46 LOSS 71936.64 w2v 4.517669 lda 71932.125 EPOCH: 47 LOSS 71936.4 w2v 4.2767406 lda 71932.125 EPOCH: 48 LOSS 71936.07 w2v 3.945362 lda 71932.125 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#63 https://github.com/nateraw/Lda2vec-Tensorflow/issues/63?email_source=notifications&email_token=AAXWFWYDC4WUBB4DWXL34N3RBDVRVA5CNFSM4KOR73C2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKWJ46A#issuecomment-581738104>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXWFW3XRWUGRKV5IEO3ALLRBDVRVANCNFSM4KOR73CQ https://github.com/notifications/unsubscribe-auth/AAXWFW3XRWUGRKV5IEO3ALLRBDVRVANCNFSM4KOR73CQ.

thank you for your help

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/63?email_source=notifications&email_token=AAXWFW5MC2R5SKUSQNP46OLRBHAEHA5CNFSM4KOR73C2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKY5QOY#issuecomment-582080571, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXWFW2VIBUG42Q6EPMJ7OLRBHAEHANCNFSM4KOR73CQ.

JennieGerhardt commented 4 years ago

These articles might help you to understand LDA and Lda2Vec https://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/#topic=38&lambda=1&term= https://brooksandrew.github.io/simpleblog/articles/latent-dirichlet-allocation-under-the-hood/ There’s also a pytorch implementation of Lda2vec you can try (which works a bit better). https://github.com/TropComplique/lda2vec-pytorch

it helps a lot! thank you!!!!

nateraw commented 4 years ago

The overall loss does not decrease after a certain number of epochs. This is part of the algorithm. The parameters, however, are still changing. The algorithm is extremely sensitive to preprocessing changes. You will have to tweak the parameters if you change the preprocessing/input data.

Closing this, as it is not a code related issue, but rather an algorithmic issue. For the record, I do not suggest using lda2vec for production/serious use cases.