Closed dbl001 closed 5 years ago
Ah, I see why you're so interested in using no pretrained embeddings now...
My suggestion would be to do one of 4 things:
switch_epoch
to a higher value or train the embedding matrix separately and load it in yourself in order to give the words time to train). From what I'm seeing, it seems many of these terms are names of people. Strange. Also, the languages that are there in your example seem to be uncommon languages, which would not be covered by most multi-lingual embedding sources. Looks like your best bet is to train from scratch. However...I've never dealt with that issue before with all the varying languages, so I really don't know if it would work or not.
Closing because this is more a tricky logic issue than an issue with the repo
Space tokens have an attribute lang. What about skipping the token if lang!=‘en’?
On Apr 5, 2019, at 4:32 PM, Nathan Raw notifications@github.com wrote:
Closing because this is more a tricky logic issue than an issue with the repo
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/47#issuecomment-480451993, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2yjytsWJ4fWLqV0wQLrR-aQBXf07ks5vd9z_gaJpZM4cdQYT.
The spacy Token attributes ‘isoov’ and lang in nlp=“en_core_web_lg”, didn’t help much processing documents with foreign words or filtering out 2-3 character words in 20newsgroups (that didn’t appear to have much semantic relevance - they could be assembler code instructions, abbreviations, etc. ). However, checking whether the Spacy token.lower is a valid WordNet synset appears to be improving the results on all the data sets I have tested.
E.g. in nlppipe.py:
import nltk self.english_vocab = set(w.lower() for w in nltk.corpus.words.words()) nlppipe.py on #72: if token.is_alpha and not token.isstop and token.lower in self.english_vocab: Python 3.5, Tensorflow 1.5 Keras 2.1.6 and Spacy 2.0.9 EPOCH: 70 LOSS 11528.113 w2v 4.000897 lda 11524.112 ---------Closest 10 words to given indexes---------- Topic 0 : den, noll, hobgoblin, adirondack, x, chi, interleave, char, phi, dole Topic 1 : controller, bios, drive, ide, m, jumper, d, chip, disk, floppy Topic 2 : scripture, sabbath, desert, tomb, priest, prophecy, testament, messiah, accomplished, thirty Topic 3 : therapy, diet, cancer, vitamin, chronic, infection, clinical, coli, tissue, syndrome Topic 4 : think, people, sure, going, know, got, like, seen, look, need Topic 5 : encryption, privacy, cryptography, clipper, escrow, secure, chip, cryptographic, enforcement, security Topic 6 : scripture, scorer, god, scoring, adirondack, christ, defensive, play, truth, jesus Topic 7 : hobgoblin, flaming, jonathan, den, icon, wolverine, ghost, priced, consortium, ga Topic 8 : cursor, width, window, colors, char, hardware, application, viewer, screen, invalid Topic 9 : hobgoblin, scorer, den, chairman, phi, defence, micro, smokeless, ga, headquarters Topic 10 : armenian, apartment, mamma, azerbaijani, marina, turkish, father, went, x, baku Topic 11 : wiring, grounding, wire, ground, metal, conductor, outlet, insulation, breaker, bike Topic 12 : adirondack, chi, otto, phi, season, providence, goalie, tor, stewart, pitcher Topic 13 : jesus, mr, god, bible, christ, holy, christian, passage, testament, scripture Topic 14 : firearm, trend, homicide, weapon, measure, accidental, revolver, statistic, rape, rifle Topic 15 : god, truth, belief, bible, accept, true, exist, example, religion, existence Topic 16 : propaganda, regulation, integration, welfare, gee, rifle, agenda, fund, country, regime Topic 17 : scorer, card, engine, otto, power, phi, bike, configuration, condition, drive Topic 18 : year, think, work, going, good, sure, look, pretty, know, people Topic 19 : defamation, rod, diego, unto, hobgoblin, storm, arizona, reduction, unauthorized, carter
Interestingly, the results are worse using Tensorflow 1.12 Keras 2.2.4 and Spacy 2.0.12, then the versions in your requirements.txt. I don’t know why.
Python 3.6 Tensorflow 1.12 Keras 2.2.4 and Spacy 2.0.12
EPOCH: 70
LOSS 11528.196 w2v 4.346923 lda 11523.85
---------Closest 10 words to given indexes----------
Topic 0 : x, mr, apartment, know, want, said, q, azerbaijani, set, look
Topic 1 : scorer, den, adirondack, galley, chi, phi, scoring, pit, tor, bos
Topic 2 : pitching, year, better, season, defense, win, pitcher, think, good, probably
Topic 3 : bios, chip, card, serial, disk, ide, hardware, x, hi, thanks
Topic 4 : doom, cop, concealed, jonathan, song, na, director, dod, jacket, samuel
Topic 5 : wiring, wire, grounding, electrical, bike, insulation, engine, safety, car, want
Topic 6 : drive, father, bike, marina, floppy, w, saw, phone, room, normal
Topic 7 : think, reason, god, course, people, way, believe, question, true, argument
Topic 8 : adirondack, tor, bos, hobgoblin, smokeless, den, providence, phi, maple, ga
Topic 9 : gif, ram, edit, file, disk, viewer, format, output, color, cursor
Topic 10 : encryption, enforcement, privacy, escrow, constitution, clipper, chip, abiding, firearm, security
Topic 11 : m, bios, chip, delta, launch, space, x, controller, ram, meg
Topic 12 : grounding, conductor, metal, magnetic, comet, wiring, strip, homicide, di, shuttle
Topic 13 : jesus, israel, claim, matthew, christian, saying, killing, message, said, m
Topic 14 :
On Apr 6, 2019, at 11:15 AM, David Laxer davidl@softintel.com wrote:
Space tokens have an attribute lang. What about skipping the token if lang!=‘en’? <Screen Shot 2019-04-06 at 11.13.45 AM.png>
On Apr 5, 2019, at 4:32 PM, Nathan Raw <notifications@github.com mailto:notifications@github.com> wrote:
Closing because this is more a tricky logic issue than an issue with the repo
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/47#issuecomment-480451993, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2yjytsWJ4fWLqV0wQLrR-aQBXf07ks5vd9z_gaJpZM4cdQYT.
Interesting that the versions don't work out the same. Again, I can't test on anything higher than TF v1.5.0 until I either upgrade my hardware (too expensive) or figure out a way to run newer versions on my hardware. Problem stems from AVX instructions not working on old hardware (and AVX started to be used by default after version 1.5.0)
Can you test Tensorflow as CPU only on Linux 20 news groups runs relatively fast.
It’s probably worth while to figure out which part of processing has the differences Eg - load -vs- run ... or both. What do think about my WordNet filter logic ?
On Apr 9, 2019, at 4:24 PM, Nathan Raw notifications@github.com wrote:
Interesting that the versions don't work out the same. Again, I can't test on anything higher than TF v1.5.0 until I either upgrade my hardware (too expensive) or figure out a way to run newer versions on my hardware. Problem stems from AVX instructions not working on old hardware (and AVX started to be used by default after version 1.5.0)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
I can perhaps test the cpu version. You mean 1.12 CPU right? Should be able to do this without messing anything up if I use docker container.
Also - I like the wordnet filter idea. Obviously its a specific piece of the preprocessing pipeline that you have to do for your use case. Would maybe be cool to be able to add optional custom Spacy Pipe components to nlppipe.py so that you can extend rather easily without having to redo everything I wrote.
You mean 1.12 CPU right? Yes.
I’d be curious to see if you get better results on 20_newsgroups using the WordNet filter.
On Apr 9, 2019, at 8:17 PM, Nathan Raw notifications@github.com wrote:
I can perhaps test the cpu version. You mean 1.12 CPU right? Should be able to do this without messing anything up if I use docker container.
Also - I like the wordnet filter idea. Obviously its a specific piece of the preprocessing pipeline that you have to do for your use case. Would maybe be cool to be able to add optional custom Spacy Pipe components https://explosion.ai/blog/spacy-v2-pipelines-extensions to nlppipe.py so that you can extend rather easily without having to redo everything I wrote.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/47#issuecomment-481517365, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i21L9JPRGRxyCs9Ju9HORnbxzMCQBks5vfVfEgaJpZM4cdQYT.
I think the results are only going to be better if there are a lot of non-english tokens. Twenty newsgroups doesnt have a big problem with this, I don't think.
Here's an example of training my 'stories.txt' file which comprises ~5100 news stories captured from numerous news sites:
Many of these word are not in: glove.6B.300d.txt
glove.840B.300d.txt
Or in: en_core_web_lg
Should we filter out foreign words?