Foreign words not in spacy model or GloVe

dbl001 commented 5 years ago

Here's an example of training my 'stories.txt' file which comprises ~5100 news stories captured from numerous news sites:

EPOCH: 15
LOSS 626.1764 w2v 6.0411406 lda 620.13525
---------Closest 10 words to given indexes----------
Topic 0 : mosharrof, mehazabien, assamese, shuvro, azerbaijani, newscasts, basque, l, mehzbin, galician
Topic 1 : newscasts, disqus, studentu, mehazabien, mosharrof, ticker, assamese, moldovan, mehzbin, azerbaijani
Topic 2 : newscasts, galician, azerbaijani, mehazabien, assamese, basque, flemish, moldovan, mosharrof, shuvro
Topic 3 : mosharrof, mehazabien, shuvro, newscasts, azerbaijani, mehzbin, assamese, l, allen, mim
Topic 4 : mosharrof, mehazabien, marathi, moldovan, assamese, galician, azerbaijani, faroese, shuvro, maltese
Topic 5 : institutionalizing, mehazabien, mosharrof, mehzbin, mim, melanie, shuvro, azerbaijani, safa, malay
Topic 6 : newscasts, mehazabien, mosharrof, shuvro, azerbaijani, assamese, studentu, sonny, basque, mehzbin
Topic 7 : mehazabien, mosharrof, marathi, shuvro, assamese, mehzbin, malay, basque, azerbaijani, oriya
Topic 8 : mehazabien, mosharrof, shuvro, newscasts, mim, mehzbin, allen, l, azerbaijani, assamese
Topic 9 : mehazabien, mosharrof, shuvro, mehzbin, l, kabir, mim, jovan, toya, allen
Topic 10 : mehazabien, mosharrof, oceania, assamese, messenger, newscasts, studentu, moldovan, shuvro, mehzbin
Topic 11 : mosharrof, mehazabien, azerbaijani, shuvro, assamese, newscasts, l, basque, safa, allen
Topic 12 : mehazabien, mosharrof, shuvro, mehzbin, assamese, newscasts, azerbaijani, l, mim, allen
Topic 13 : mehazabien, mosharrof, shuvro, mehzbin, l, azerbaijani, assamese, newscasts, mim, allen
Topic 14 : mehazabien, mosharrof, assamese, azerbaijani, shuvro, newscasts, l, basque, malay, mehzbin
Topic 15 : mehazabien, mosharrof, eish, shuvro, newscasts, azerbaijani, assamese, studentu, chery, mehzbin
Topic 16 : mehazabien, mosharrof, lithuanian, kurmanji, azerbaijani, kyrgyz, moldovan, assamese, burmese, belarusian
Topic 17 : newscasts, mehazabien, mosharrof, bookmark, allen, l, shuvro, disqus, talkup, azerbaijani
Topic 18 : mehazabien, mosharrof, newscasts, shuvro, azerbaijani, galician, assamese, bienenretter, aktuell, malay
Topic 19 : assamese, azerbaijani, mosharrof, allen, moldovan, mehzbin, shuvro, lithuanian, marathi, mehazabien

Many of these word are not in: glove.6B.300d.txt
glove.840B.300d.txt

Or in: en_core_web_lg

Should we filter out foreign words?

nateraw commented 5 years ago

Ah, I see why you're so interested in using no pretrained embeddings now...

My suggestion would be to do one of 4 things:

Figure out % of foreign words in your texts to see if you should just filter out all foreign words and leave embeddings as english.
If you determine the % of foreign words is higher than english, I would use embeddings for that language and filter out english instead.
If you find there are many different languages in your text, you could try multi-lingual embeddings (if the languages are common, like English, German, French, etc.)
Train your own embedding matrix from scratch (Remember to either set switch_epoch to a higher value or train the embedding matrix separately and load it in yourself in order to give the words time to train).

From what I'm seeing, it seems many of these terms are names of people. Strange. Also, the languages that are there in your example seem to be uncommon languages, which would not be covered by most multi-lingual embedding sources. Looks like your best bet is to train from scratch. However...I've never dealt with that issue before with all the varying languages, so I really don't know if it would work or not.

nateraw commented 5 years ago

Closing because this is more a tricky logic issue than an issue with the repo

dbl001 commented 5 years ago

Space tokens have an attribute lang. What about skipping the token if lang!=‘en’?

On Apr 5, 2019, at 4:32 PM, Nathan Raw notifications@github.com wrote:

Closing because this is more a tricky logic issue than an issue with the repo

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/47#issuecomment-480451993, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2yjytsWJ4fWLqV0wQLrR-aQBXf07ks5vd9z_gaJpZM4cdQYT.

dbl001 commented 5 years ago

The spacy Token attributes ‘isoov’ and lang in nlp=“en_core_web_lg”, didn’t help much processing documents with foreign words or filtering out 2-3 character words in 20newsgroups (that didn’t appear to have much semantic relevance - they could be assembler code instructions, abbreviations, etc. ). However, checking whether the Spacy token.lower is a valid WordNet synset appears to be improving the results on all the data sets I have tested.

E.g. in nlppipe.py:

import nltk self.english_vocab = set(w.lower() for w in nltk.corpus.words.words()) nlppipe.py on #72: if token.is_alpha and not token.isstop and token.lower in self.english_vocab: Python 3.5, Tensorflow 1.5 Keras 2.1.6 and Spacy 2.0.9 EPOCH: 70 LOSS 11528.113 w2v 4.000897 lda 11524.112 ---------Closest 10 words to given indexes---------- Topic 0 : den, noll, hobgoblin, adirondack, x, chi, interleave, char, phi, dole Topic 1 : controller, bios, drive, ide, m, jumper, d, chip, disk, floppy Topic 2 : scripture, sabbath, desert, tomb, priest, prophecy, testament, messiah, accomplished, thirty Topic 3 : therapy, diet, cancer, vitamin, chronic, infection, clinical, coli, tissue, syndrome Topic 4 : think, people, sure, going, know, got, like, seen, look, need Topic 5 : encryption, privacy, cryptography, clipper, escrow, secure, chip, cryptographic, enforcement, security Topic 6 : scripture, scorer, god, scoring, adirondack, christ, defensive, play, truth, jesus Topic 7 : hobgoblin, flaming, jonathan, den, icon, wolverine, ghost, priced, consortium, ga Topic 8 : cursor, width, window, colors, char, hardware, application, viewer, screen, invalid Topic 9 : hobgoblin, scorer, den, chairman, phi, defence, micro, smokeless, ga, headquarters Topic 10 : armenian, apartment, mamma, azerbaijani, marina, turkish, father, went, x, baku Topic 11 : wiring, grounding, wire, ground, metal, conductor, outlet, insulation, breaker, bike Topic 12 : adirondack, chi, otto, phi, season, providence, goalie, tor, stewart, pitcher Topic 13 : jesus, mr, god, bible, christ, holy, christian, passage, testament, scripture Topic 14 : firearm, trend, homicide, weapon, measure, accidental, revolver, statistic, rape, rifle Topic 15 : god, truth, belief, bible, accept, true, exist, example, religion, existence Topic 16 : propaganda, regulation, integration, welfare, gee, rifle, agenda, fund, country, regime Topic 17 : scorer, card, engine, otto, power, phi, bike, configuration, condition, drive Topic 18 : year, think, work, going, good, sure, look, pretty, know, people Topic 19 : defamation, rod, diego, unto, hobgoblin, storm, arizona, reduction, unauthorized, carter

Interestingly, the results are worse using Tensorflow 1.12 Keras 2.2.4 and Spacy 2.0.12, then the versions in your requirements.txt. I don’t know why.

Python 3.6 Tensorflow 1.12 Keras 2.2.4 and Spacy 2.0.12

EPOCH: 70 LOSS 11528.196 w2v 4.346923 lda 11523.85 ---------Closest 10 words to given indexes---------- Topic 0 : x, mr, apartment, know, want, said, q, azerbaijani, set, look Topic 1 : scorer, den, adirondack, galley, chi, phi, scoring, pit, tor, bos Topic 2 : pitching, year, better, season, defense, win, pitcher, think, good, probably Topic 3 : bios, chip, card, serial, disk, ide, hardware, x, hi, thanks Topic 4 : doom, cop, concealed, jonathan, song, na, director, dod, jacket, samuel Topic 5 : wiring, wire, grounding, electrical, bike, insulation, engine, safety, car, want Topic 6 : drive, father, bike, marina, floppy, w, saw, phone, room, normal Topic 7 : think, reason, god, course, people, way, believe, question, true, argument Topic 8 : adirondack, tor, bos, hobgoblin, smokeless, den, providence, phi, maple, ga Topic 9 : gif, ram, edit, file, disk, viewer, format, output, color, cursor Topic 10 : encryption, enforcement, privacy, escrow, constitution, clipper, chip, abiding, firearm, security Topic 11 : m, bios, chip, delta, launch, space, x, controller, ram, meg Topic 12 : grounding, conductor, metal, magnetic, comet, wiring, strip, homicide, di, shuttle Topic 13 : jesus, israel, claim, matthew, christian, saying, killing, message, said, m Topic 14 : , flash, galaxy, continental, designing, artificial, explorer, hobgoblin, courtesy, dollar Topic 15 : den, hobgoblin, adirondack, noll, handler, phi, allocation, char, chi, wolverine Topic 16 : , alexander, sincerely, reaching, artificial, studied, yeast, mistaken, likewise, zionism Topic 17 : allocation, interleave, digest, linear, smoking, tor, hobgoblin, maple, scorer, briefing Topic 18 : testament, jesus, christian, theology, holy, christ, scripture, ottoman, empire, bible Topic 19 : price, good, offer, stereo, interested, mail, repair, great, buy, service

On Apr 6, 2019, at 11:15 AM, David Laxer davidl@softintel.com wrote:

Space tokens have an attribute lang. What about skipping the token if lang!=‘en’? <Screen Shot 2019-04-06 at 11.13.45 AM.png>

On Apr 5, 2019, at 4:32 PM, Nathan Raw <notifications@github.com mailto:notifications@github.com> wrote:

Closing because this is more a tricky logic issue than an issue with the repo

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/47#issuecomment-480451993, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2yjytsWJ4fWLqV0wQLrR-aQBXf07ks5vd9z_gaJpZM4cdQYT.

nateraw commented 5 years ago

Interesting that the versions don't work out the same. Again, I can't test on anything higher than TF v1.5.0 until I either upgrade my hardware (too expensive) or figure out a way to run newer versions on my hardware. Problem stems from AVX instructions not working on old hardware (and AVX started to be used by default after version 1.5.0)

dbl001 commented 5 years ago

Can you test Tensorflow as CPU only on Linux 20 news groups runs relatively fast.

It’s probably worth while to figure out which part of processing has the differences Eg - load -vs- run ... or both. What do think about my WordNet filter logic ?

On Apr 9, 2019, at 4:24 PM, Nathan Raw notifications@github.com wrote:

Interesting that the versions don't work out the same. Again, I can't test on anything higher than TF v1.5.0 until I either upgrade my hardware (too expensive) or figure out a way to run newer versions on my hardware. Problem stems from AVX instructions not working on old hardware (and AVX started to be used by default after version 1.5.0)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

nateraw commented 5 years ago

I can perhaps test the cpu version. You mean 1.12 CPU right? Should be able to do this without messing anything up if I use docker container.

Also - I like the wordnet filter idea. Obviously its a specific piece of the preprocessing pipeline that you have to do for your use case. Would maybe be cool to be able to add optional custom Spacy Pipe components to nlppipe.py so that you can extend rather easily without having to redo everything I wrote.

dbl001 commented 5 years ago

You mean 1.12 CPU right? Yes.

I’d be curious to see if you get better results on 20_newsgroups using the WordNet filter.

On Apr 9, 2019, at 8:17 PM, Nathan Raw notifications@github.com wrote:

I can perhaps test the cpu version. You mean 1.12 CPU right? Should be able to do this without messing anything up if I use docker container.

Also - I like the wordnet filter idea. Obviously its a specific piece of the preprocessing pipeline that you have to do for your use case. Would maybe be cool to be able to add optional custom Spacy Pipe components https://explosion.ai/blog/spacy-v2-pipelines-extensions to nlppipe.py so that you can extend rather easily without having to redo everything I wrote.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/47#issuecomment-481517365, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i21L9JPRGRxyCs9Ju9HORnbxzMCQBks5vfVfEgaJpZM4cdQYT.

nateraw commented 5 years ago

I think the results are only going to be better if there are a lot of non-english tokens. Twenty newsgroups doesnt have a big problem with this, I don't think.

nateraw / Lda2vec-Tensorflow

Foreign words not in spacy model or GloVe #47