nateraw / Lda2vec-Tensorflow

Tensorflow 1.5 implementation of Chris Moody's Lda2vec, adapted from @meereeum
MIT License
108 stars 40 forks source link

Working Example #8

Closed nateraw closed 6 years ago

nateraw commented 6 years ago

I've been working on this code base for quite a while, but I have still yet to see a working example. I've played with calculating the loss function differently, all sorts of hyperparameters, and different ways of preprocessing the data, but yet I still havent seen this code or the original author's actually work.

So, if anybody wants to contribute an example that is reproducible, please let me know! Let me know if I can help explain whats going on in any way in any of the files. Thank you.

dbl001 commented 6 years ago

Questions:

  1. do you know why the loss function goes negative after the 5th Epoch? Is it Overfitting? screen shot 2018-07-09 at 10 26 15 am

EPOCH: 1

EPOCH: 2 STEP 100 LOSS 220.99734 w2v 220.99734 lda -38858.785

EPOCH: 3 STEP 200 LOSS 206.99643 w2v 206.99643 lda -38903.184

EPOCH: 4 STEP 300 LOSS 214.60081 w2v 214.60081 lda -38978.082

EPOCH: 5 STEP 400 LOSS 189.11394 w2v 189.11394 lda -39061.438

EPOCH: 6 STEP 500 LOSS -39708.11 w2v 196.33139 lda -39904.44

EPOCH: 7 STEP 600 LOSS -41558.43 w2v 156.58838 lda -41715.02

EPOCH: 8

EPOCH: 9 STEP 700 LOSS -42843.523 w2v 152.64543 lda -42996.168

EPOCH: 10 STEP 800 LOSS -43950.285 w2v 154.05565 lda -44104.34

  1. How would I use the generated 'lda2vec' vectors? How do I access them from the Tensorflow graph to compute these vector operations?

E.g. Hacker News - story + question = StackOverflow

screen shot 2018-07-09 at 10 26 37 am

  1. Have you experimented with pyLDAvis to visualize the generated topics?

https://pyldavis.readthedocs.io/en/latest/

https://nbviewer.jupyter.org/github/cemoody/lda2vec/blob/master/examples/hacker_news/lda2vec/lda2vec.ipynb

nateraw commented 6 years ago

1) The reason why the loss is negative after the "switch loss" epoch (in this case, 5), is that after we get to the epoch set by the switch loss variable, we "switch on" the lda loss (which is negative). For the first 5 epochs, we were only training the words. If you dont want this behavior, you can change the switch loss variable to be 0.

Now, as for why it is negative, this is something I've been struggling with. If you check out the dirichlet_likelihood.py file, you can see my experiments...

    # Normal
    #loss = (alpha - 1) * log_proportions
    # Negative - it works
    loss = -(alpha - 1) * log_proportions
    # Abs Proportions + negative
    #loss = -(alpha - 1) * tf.abs(log_proportions)

    return tf.reduce_sum(loss)

You can see that the one that we are currently using is negative. The "normal" one above it is the one that @meereeum used in her implementation. This one is positive, however when training, all of the document proportions converge to be the same for some reason, the loss converges, and basically nothing happens (from my experience, at least).

Now, you might be wondering why I decided to make it negative...and your answer can be found in the original author's same dirichlet_likelihood file: dirichlet_likelihood.py If you notice, he threw a negative sign on the return statement.

    loss = (alpha - 1.0) * log_proportions
    return -F.sum(loss)

Very confusing! On top of this, if you go searching around his examples, you will see that he never even uses the lambda variable talked about in the paper...I'm not sure if this was a mistake when he uploaded the code, but all he does is initialize it and then never uses it. Look at the clambda variable here

2) If you want to use the embeddings to do some math stuff, you can just extract them and do some vector math after extracting them (and probably normalizing them). If you want the raw embeddings straight from the graph, you can extract them like this

# We called our model variable m
doc_embed = m.sesh.run(m.doc_embedding)
topic_embed = m.sesh.run(m.topic_embedding)
word_embed = m.sesh.run(m.word_embedding)

As for performing queries like the one you said, I havent even bothered because I haven't seen the k_closest function return anything promising and the topic embedding vectors are all pretty much the same.

3) I started working on getting the pyldavis implementation working too, but like I said, if we can't get promising results going there's really no reason to visualize them. Again, from the original authors code, we can find an example of him preparing the data for pyldavis and copy that a bit. Here is the file he did this in. If you take what I just wrote up about extracting the embedding matrices, you're most of the way there...The only thing I havent done is get the variable he mentions called "doc_lengths". This needs to be added to the preprocessing. Basically we need an array indicating the number of tokens in each document. Apparently this variable is required by pyldavis. Anyways, here is what I have so far on getting that data ready...

def generate_ldavis_data():
    doc_embed = m.sesh.run(m.doc_embedding)
    topic_embed = m.sesh.run(m.topic_embedding)
    word_embed = m.sesh.run(m.word_embedding)

    # Extract all unique words in order of index 0-vocab_size
    vocabulary = []
    for i in range(vocab_size):
        vocabulary.append(idx_to_word[i])

    # Extract the amount of words in each document

    # utils.py is a direct copy from original authors "topics.py" file
    data = utils.prepare_topics(doc_embed, topic_embed, word_embed, vocabulary)

Thank you for showing interest in contributing, I appreciate it!!

nateraw commented 6 years ago

Oh...and just to be safe, I'll run the "normal" loss function on my server for at least 200 epochs and get back to you. I heard from one of the issues on a different version that it takes at least 20 epochs to get the topics to start to show.

dbl001 commented 6 years ago

Have you ever tried ‘sense2vec’ (as opposed to ‘word2vec’) in your pre-processing pipeline?

On Jul 9, 2018, at 11:41 AM, Nathan Raw notifications@github.com wrote:

Oh...and just to be safe, I'll run the "normal" loss function on my server for at least 200 epochs and get back to you. I heard from one of the issues on a different version that it takes at least 20 epochs to get the topics to start to show.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

nateraw commented 6 years ago

Yes, if you pass the parameter "merge = True" to the nlppipe instantiation, youll get sense2vec tokens :smile:

dbl001 commented 6 years ago

https://github.com/explosion/sense2vec

On Jul 9, 2018, at 12:34 PM, Nathan Raw notifications@github.com wrote:

"merge = True"

nateraw commented 6 years ago

Yeah, the premise behind sense2vec is that you can merge noun phrases into single tokens. When you pass merge=True, the nlppipe.py file will merge any noun phrases in nearly the same way as the original author (the way that he did it is broken completely, so I had to improvise). You dont have to use that sense2vec wrapper for spacy, but feel free to try it! Let me know if it gives different results.

My preprocessing file is really just an option, you can preprocess the data any way you want to, as long as you pass it into the model in the correct format.

dbl001 commented 6 years ago

Understood. Question: I installed your lda2vec into an Anaconda virtual environment with: python setup.py install.

Curiously, I can only import your code when running from the Lda2vec_tensorflow subdirectory. What am I missing?

On Jul 9, 2018, at 1:09 PM, Nathan Raw notifications@github.com wrote:

Yeah, the premise behind sense2vec is that you can merge noun phrases into single tokens. When you pass merge=True, the nlppipe.py file will merge any noun phrases in nearly the same way as the original author (the way that he did it is broken completely, so I had to improvise). You dont have to use that sense2vec wrapper for spacy, but feel free to try it! Let me know if it gives different results.

My preprocessing file is really just an option, you can preprocess the data any way you want to, as long as you pass it into the model in the correct format.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-403604144, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2_EwEH-UEN9O7QaaQumoRRG44Tmyks5uE7hwgaJpZM4VIAiH.

dbl001 commented 6 years ago

import sense2vec model = sense2vec.load('/Users/davidlaxer/Downloads/reddit_vectors-1.1.0') new_model = gensim.models.Word2Vec.load('/Users/davidlaxer/LSTM-Sentiment-Analysis/corpus_output_50.txt') word_vectors = new_model.wv wordsList = new_model.wv.index2word type(wordsList)

freq, query_vector1 = model["flies|NOUN"] model.most_similar(query_vector1, n=5) (['flies|NOUN','gnats|NOUN','snakes|NOUN','birds|NOUN','grasshoppers|NOUN'], <MemoryView of 'ndarray' at 0x1a6920a048>)

freq, query_vector2 = model["flies|VERB"] model.most_similar(query_vector2, n=5)

(['flies|VERB', 'flys|VERB', 'flying|VERB', 'jumps|VERB', 'swoops|VERB'], <MemoryView of 'ndarray' at 0x1a6920a1f0>)

model.data.similarity(query_vector1, query_vector2)

0.6554745435714722

On Jul 9, 2018, at 1:09 PM, Nathan Raw notifications@github.com wrote:

Yeah, the premise behind sense2vec is that you can merge noun phrases into single tokens. When you pass merge=True, the nlppipe.py file will merge any noun phrases in nearly the same way as the original author (the way that he did it is broken completely, so I had to improvise). You dont have to use that sense2vec wrapper for spacy, but feel free to try it! Let me know if it gives different results.

My preprocessing file is really just an option, you can preprocess the data any way you want to, as long as you pass it into the model in the correct format.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-403604144, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2_EwEH-UEN9O7QaaQumoRRG44Tmyks5uE7hwgaJpZM4VIAiH.

nateraw commented 6 years ago

Perhaps it is a permissions issue? Try to check out where it installed and see if you need to change the permissions with chmod. This is my first time creating a setup.py package, so it could be an error on my end too...

Edit: Also, now that I think about it, the way I am tokenizing currently doenst tokenize nouns/verbs separately like you just showed. That would definitely be interesting. However, the original author didn't do this and his worked, so I guess the main issue would be something in the model still.

dbl001 commented 6 years ago

$ python setup.py install /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/setuptools/dist.py:388: UserWarning: Normalizing '0.12.00' to '0.12.0' normalized_version, running install running bdist_egg running egg_info writing lda2vec.egg-info/PKG-INFO writing dependency_links to lda2vec.egg-info/dependency_links.txt writing top-level names to lda2vec.egg-info/top_level.txt reading manifest file 'lda2vec.egg-info/SOURCES.txt' writing manifest file 'lda2vec.egg-info/SOURCES.txt' installing library code to build/bdist.macosx-10.7-x86_64/egg running install_lib warning: install_lib: 'build/lib' does not exist -- no Python modules to install

creating build/bdist.macosx-10.7-x86_64/egg creating build/bdist.macosx-10.7-x86_64/egg/EGG-INFO copying lda2vec.egg-info/PKG-INFO -> build/bdist.macosx-10.7-x86_64/egg/EGG-INFO copying lda2vec.egg-info/SOURCES.txt -> build/bdist.macosx-10.7-x86_64/egg/EGG-INFO copying lda2vec.egg-info/dependency_links.txt -> build/bdist.macosx-10.7-x86_64/egg/EGG-INFO copying lda2vec.egg-info/top_level.txt -> build/bdist.macosx-10.7-x86_64/egg/EGG-INFO zip_safe flag not set; analyzing archive contents... creating 'dist/lda2vec-0.12.0-py3.6.egg' and adding 'build/bdist.macosx-10.7-x86_64/egg' to it removing 'build/bdist.macosx-10.7-x86_64/egg' (and everything under it) Processing lda2vec-0.12.0-py3.6.egg Removing /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/lda2vec-0.12.0-py3.6.egg Copying lda2vec-0.12.0-py3.6.egg to /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages lda2vec 0.12.0 is already the active version in easy-install.pth

Installed /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/lda2vec-0.12.0-py3.6.egg Processing dependencies for lda2vec==0.12.0 Finished processing dependencies for lda2vec==0.12.0 (ai) David-Laxers-MacBook-Pro:Lda2vec-Tensorflow davidlaxer$ ls -l /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/lda2vec-0.12.0-py3.6.egg -rw-r--r-- 1 davidlaxer staff 875 Jul 9 13:29 /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/lda2vec-0.12.0-py3.6.egg (ai) David-Laxers-MacBook-Pro:Lda2vec-Tensorflow davidlaxer$ file !$ file /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/lda2vec-0.12.0-py3.6.egg /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/lda2vec-0.12.0-py3.6.egg: Zip archive data, at least v2.0 to extract (ai) David-Laxers-MacBook-Pro:Lda2vec-Tensorflow davidlaxer$ cd tests/twenty_newsgroups/ (ai) David-Laxers-MacBook-Pro:twenty_newsgroups davidlaxer$ ipython Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29) Type 'copyright', 'credits' or 'license' for more information IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import lda2vec

ModuleNotFoundError Traceback (most recent call last)

in () ----> 1 import lda2vec ModuleNotFoundError: No module named 'lda2vec' In [2]: > On Jul 9, 2018, at 1:24 PM, Nathan Raw wrote: > > Perhaps it is a permissions issue? Try to check out where it installed and see if you need to change the permissions with chmod. This is my first time creating a setup.py package, so it could be an error on my end too... > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub , or mute the thread . >
dbl001 commented 6 years ago

(ai) David-Laxers-MacBook-Pro:twenty_newsgroups davidlaxer$ cd ../.. (ai) David-Laxers-MacBook-Pro:Lda2vec-Tensorflow davidlaxer$ pwd /Users/davidlaxer/Lda2vec-Tensorflow (ai) David-Laxers-MacBook-Pro:Lda2vec-Tensorflow davidlaxer$ ipython Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29) Type 'copyright', 'credits' or 'license' for more information IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import lda2vec

In [2]:

On Jul 9, 2018, at 1:31 PM, David Laxer davidl@softintel.com wrote:

$ python setup.py install /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/setuptools/dist.py:388: UserWarning: Normalizing '0.12.00' to '0.12.0' normalized_version, running install running bdist_egg running egg_info writing lda2vec.egg-info/PKG-INFO writing dependency_links to lda2vec.egg-info/dependency_links.txt writing top-level names to lda2vec.egg-info/top_level.txt reading manifest file 'lda2vec.egg-info/SOURCES.txt' writing manifest file 'lda2vec.egg-info/SOURCES.txt' installing library code to build/bdist.macosx-10.7-x86_64/egg running install_lib warning: install_lib: 'build/lib' does not exist -- no Python modules to install

creating build/bdist.macosx-10.7-x86_64/egg creating build/bdist.macosx-10.7-x86_64/egg/EGG-INFO copying lda2vec.egg-info/PKG-INFO -> build/bdist.macosx-10.7-x86_64/egg/EGG-INFO copying lda2vec.egg-info/SOURCES.txt -> build/bdist.macosx-10.7-x86_64/egg/EGG-INFO copying lda2vec.egg-info/dependency_links.txt -> build/bdist.macosx-10.7-x86_64/egg/EGG-INFO copying lda2vec.egg-info/top_level.txt -> build/bdist.macosx-10.7-x86_64/egg/EGG-INFO zip_safe flag not set; analyzing archive contents... creating 'dist/lda2vec-0.12.0-py3.6.egg' and adding 'build/bdist.macosx-10.7-x86_64/egg' to it removing 'build/bdist.macosx-10.7-x86_64/egg' (and everything under it) Processing lda2vec-0.12.0-py3.6.egg Removing /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/lda2vec-0.12.0-py3.6.egg Copying lda2vec-0.12.0-py3.6.egg to /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages lda2vec 0.12.0 is already the active version in easy-install.pth

Installed /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/lda2vec-0.12.0-py3.6.egg Processing dependencies for lda2vec==0.12.0 Finished processing dependencies for lda2vec==0.12.0 (ai) David-Laxers-MacBook-Pro:Lda2vec-Tensorflow davidlaxer$ ls -l /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/lda2vec-0.12.0-py3.6.egg -rw-r--r-- 1 davidlaxer staff 875 Jul 9 13:29 /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/lda2vec-0.12.0-py3.6.egg (ai) David-Laxers-MacBook-Pro:Lda2vec-Tensorflow davidlaxer$ file !$ file /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/lda2vec-0.12.0-py3.6.egg /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/lda2vec-0.12.0-py3.6.egg: Zip archive data, at least v2.0 to extract (ai) David-Laxers-MacBook-Pro:Lda2vec-Tensorflow davidlaxer$ cd tests/twenty_newsgroups/ (ai) David-Laxers-MacBook-Pro:twenty_newsgroups davidlaxer$ ipython Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29) Type 'copyright', 'credits' or 'license' for more information IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import lda2vec

ModuleNotFoundError Traceback (most recent call last)

in () ----> 1 import lda2vec ModuleNotFoundError: No module named 'lda2vec' In [2]: > On Jul 9, 2018, at 1:24 PM, Nathan Raw > wrote: > > Perhaps it is a permissions issue? Try to check out where it installed and see if you need to change the permissions with chmod. This is my first time creating a setup.py package, so it could be an error on my end too... > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub , or mute the thread . >
dbl001 commented 6 years ago

In [2]: lda2vec Out[2]: <module 'lda2vec' from '/Users/davidlaxer/Lda2vec-Tensorflow/lda2vec/init.py’>

On Jul 9, 2018, at 1:41 PM, David Laxer davidl@softintel.com wrote:

(ai) David-Laxers-MacBook-Pro:twenty_newsgroups davidlaxer$ cd ../.. (ai) David-Laxers-MacBook-Pro:Lda2vec-Tensorflow davidlaxer$ pwd /Users/davidlaxer/Lda2vec-Tensorflow (ai) David-Laxers-MacBook-Pro:Lda2vec-Tensorflow davidlaxer$ ipython Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29) Type 'copyright', 'credits' or 'license' for more information IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import lda2vec

In [2]:

On Jul 9, 2018, at 1:31 PM, David Laxer <davidl@softintel.com mailto:davidl@softintel.com> wrote:

$ python setup.py install /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/setuptools/dist.py:388: UserWarning: Normalizing '0.12.00' to '0.12.0' normalized_version, running install running bdist_egg running egg_info writing lda2vec.egg-info/PKG-INFO writing dependency_links to lda2vec.egg-info/dependency_links.txt writing top-level names to lda2vec.egg-info/top_level.txt reading manifest file 'lda2vec.egg-info/SOURCES.txt' writing manifest file 'lda2vec.egg-info/SOURCES.txt' installing library code to build/bdist.macosx-10.7-x86_64/egg running install_lib warning: install_lib: 'build/lib' does not exist -- no Python modules to install

creating build/bdist.macosx-10.7-x86_64/egg creating build/bdist.macosx-10.7-x86_64/egg/EGG-INFO copying lda2vec.egg-info/PKG-INFO -> build/bdist.macosx-10.7-x86_64/egg/EGG-INFO copying lda2vec.egg-info/SOURCES.txt -> build/bdist.macosx-10.7-x86_64/egg/EGG-INFO copying lda2vec.egg-info/dependency_links.txt -> build/bdist.macosx-10.7-x86_64/egg/EGG-INFO copying lda2vec.egg-info/top_level.txt -> build/bdist.macosx-10.7-x86_64/egg/EGG-INFO zip_safe flag not set; analyzing archive contents... creating 'dist/lda2vec-0.12.0-py3.6.egg' and adding 'build/bdist.macosx-10.7-x86_64/egg' to it removing 'build/bdist.macosx-10.7-x86_64/egg' (and everything under it) Processing lda2vec-0.12.0-py3.6.egg Removing /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/lda2vec-0.12.0-py3.6.egg Copying lda2vec-0.12.0-py3.6.egg to /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages lda2vec 0.12.0 is already the active version in easy-install.pth

Installed /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/lda2vec-0.12.0-py3.6.egg Processing dependencies for lda2vec==0.12.0 Finished processing dependencies for lda2vec==0.12.0 (ai) David-Laxers-MacBook-Pro:Lda2vec-Tensorflow davidlaxer$ ls -l /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/lda2vec-0.12.0-py3.6.egg -rw-r--r-- 1 davidlaxer staff 875 Jul 9 13:29 /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/lda2vec-0.12.0-py3.6.egg (ai) David-Laxers-MacBook-Pro:Lda2vec-Tensorflow davidlaxer$ file !$ file /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/lda2vec-0.12.0-py3.6.egg /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/lda2vec-0.12.0-py3.6.egg: Zip archive data, at least v2.0 to extract (ai) David-Laxers-MacBook-Pro:Lda2vec-Tensorflow davidlaxer$ cd tests/twenty_newsgroups/ (ai) David-Laxers-MacBook-Pro:twenty_newsgroups davidlaxer$ ipython Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29) Type 'copyright', 'credits' or 'license' for more information IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import lda2vec

ModuleNotFoundError Traceback (most recent call last)

in () ----> 1 import lda2vec ModuleNotFoundError: No module named 'lda2vec' In [2]: > On Jul 9, 2018, at 1:24 PM, Nathan Raw > wrote: > > Perhaps it is a permissions issue? Try to check out where it installed and see if you need to change the permissions with chmod. This is my first time creating a setup.py package, so it could be an error on my end too... > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub , or mute the thread . >
nateraw commented 6 years ago

try sudo python setup.py install or sudo python setup.py develop too? I'm not sure what the issue is :cry:

dbl001 commented 6 years ago

$ sudo python setup.py develop /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/setuptools/dist.py:388: UserWarning: Normalizing '0.12.00' to '0.12.0' normalized_version, running develop running egg_info writing lda2vec.egg-info/PKG-INFO writing dependency_links to lda2vec.egg-info/dependency_links.txt writing top-level names to lda2vec.egg-info/top_level.txt reading manifest file 'lda2vec.egg-info/SOURCES.txt' writing manifest file 'lda2vec.egg-info/SOURCES.txt' running build_ext Creating /Users/davidlaxer/anaconda/envs/ai/lib/python3.6/site-packages/lda2vec.egg-link (link to .) Removing lda2vec 0.12.0 from easy-install.pth file Adding lda2vec 0.12.0 to easy-install.pth file

Installed /Users/davidlaxer/Lda2vec-Tensorflow Processing dependencies for lda2vec==0.12.0 Finished processing dependencies for lda2vec==0.12.0 (ai) David-Laxers-MacBook-Pro:Lda2vec-Tensorflow davidlaxer$ cd tests/twenty_newsgroups/ (ai) David-Laxers-MacBook-Pro:twenty_newsgroups davidlaxer$ ipython Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29) Type 'copyright', 'credits' or 'license' for more information IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import lda2vec

In [2]: lda2vec Out[2]: <module 'lda2vec' from '/Users/davidlaxer/Lda2vec-Tensorflow/lda2vec/init.py'>

In [3]:

On Jul 9, 2018, at 1:48 PM, Nathan Raw notifications@github.com wrote:

try sudo python setup.py install or sudo python setup.py develop too? I'm not sure what the issue is 😢

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-403615352, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i25iq3UCB5YDkleqpl9J3wsogBtazks5uE8GmgaJpZM4VIAiH.

nateraw commented 6 years ago

The develop one is meant to do that, it allows you to change the code without having to reinstall every time you make changes. Also, when you ran python setup.py install, you were in the same directory as the setup.py file, correct? You cant be outside of the directory, as far as I understand.

Check this stackoverflow post

dbl001 commented 6 years ago

Also, when you ran python setup.py install, you were in the same directory as the setup.py file, correct? Yes.

Also, in a virtual environment you get the ‘pip’ inside the virtual environment.

$ which pip /Users/davidlaxer/anaconda/envs/ai/bin/pip

On Jul 9, 2018, at 1:59 PM, Nathan Raw notifications@github.com wrote:

The develop one is meant to do that, it allows you to change the code without having to reinstall every time you make changes. Also, when you ran python setup.py install, you were in the same directory as the setup.py file, correct? You cant be outside of the directory, as far as I understand.

Check this stackoverflow post https://stackoverflow.com/questions/14865990/python-module-wont-install — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-403618530, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2wRryo3yBlaW_ugjgnhAzspMR3jBks5uE8QvgaJpZM4VIAiH.

dbl001 commented 6 years ago

The PCA visualization in tensorboard shows the first component describes 99.8% of the variance in the data, after 10 epochs. What happens after 200 Epochs? Do the topic clusterings start to appear?

On Jul 9, 2018, at 11:30 AM, Nathan Raw notifications@github.com wrote:

The reason why the loss is negative after the "switch loss" epoch (in this case, 5), is that after we get to the epoch set by the switch loss variable, we "switch on" the lda loss (which is negative). For the first 5 epochs, we were only training the words. If you dont want this behavior, you can change the switch loss variable to be 0. Now, as for why it is negative, this is something I've been struggling with. If you check out the dirichlet_likelihood.py https://github.com/nateraw/Lda2vec-Tensorflow/blob/master/lda2vec/dirichlet_likelihood.py file, you can see my experiments...

# Normal
#loss = (alpha - 1) * log_proportions
# Negative - it works
loss = -(alpha - 1) * log_proportions
# Abs Proportions + negative
#loss = -(alpha - 1) * tf.abs(log_proportions)

return tf.reduce_sum(loss)

You can see that the one that we are currently using is negative. The "normal" one above it is the one that @meereeum https://github.com/meereeum used in her implementation. This one is positive, however when training, all of the document proportions converge to be the same for some reason, the loss converges, and basically nothing happens (from my experience, at least).

Now, you might be wondering why I decided to make it negative...and your answer can be found in the original author's same dirichlet_likelihood file: dirichlet_likelihood.py https://github.com/cemoody/lda2vec/blob/master/lda2vec/dirichlet_likelihood.py If you notice, he threw a negative sign on the return statement.

loss = (alpha - 1.0) * log_proportions
return -F.sum(loss)

Very confusing! On top of this, if you go searching around his examples, you will see that he never even uses the lambda variable talked about in the paper...I'm not sure if this was a mistake when he uploaded the code, but all he does is initialize it and then never uses it. Look at the clambda variable here https://github.com/cemoody/lda2vec/blob/master/examples/twenty_newsgroups/lda2vec/lda2vec_run.py If you want to use the embeddings to do some math stuff, you can just extract them and do some vector math after extracting them (and probably normalizing them). If you want the raw embeddings straight from the graph, you can extract them like this

We called our model variable m

doc_embed = m.sesh.run(m.doc_embedding) topic_embed = m.sesh.run(m.topic_embedding) word_embed = m.sesh.run(m.word_embedding) As for performing queries like the one you said, I havent even bothered because I haven't seen the k_closest function return anything promising and the topic embedding vectors are all pretty much the same.

I started working on getting the pyldavis implementation working too, but like I said, if we can't get promising results going there's really no reason to visualize them. Again, from the original authors code, we can find an example of him preparing the data for pyldavis and copy that a bit. Here is the file he did this in https://github.com/cemoody/lda2vec/blob/master/lda2vec/topics.py. If you take what I just wrote up about extracting the embedding matrices, you're most of the way there...The only thing I havent done is get the variable he mentions called "doc_lengths". This needs to be added to the preprocessing. Basically we need an array indicating the number of tokens in each document. Apparently this variable is required by pyldavis. Anyways, here is what I have so far on getting that data ready... def generate_ldavis_data(): doc_embed = m.sesh.run(m.doc_embedding) topic_embed = m.sesh.run(m.topic_embedding) word_embed = m.sesh.run(m.word_embedding)

# Extract all unique words in order of index 0-vocab_size
vocabulary = []
for i in range(vocab_size):
    vocabulary.append(idx_to_word[i])

# Extract the amount of words in each document

# utils.py is a direct copy from original authors "topics.py" file
data = utils.prepare_topics(doc_embed, topic_embed, word_embed, vocabulary)

Thank you for showing interest in contributing, I appreciate it!!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-403576591, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2-jmKJ0ukD1a-NSFjxeG5w2amhqnks5uE6FMgaJpZM4VIAiH.

nateraw commented 6 years ago

@dbl001 After running it yesterday I saw some topics starting to change after a while (60 epochs). However, my preprocessing seemed a little messy, so I changed it up a bit and tried to re-run it with cleaner data. Unfortunately, late last night my server box's hard drive filled up and crashed while I was working on it. I lost some files, but hopefully nothing too critical. I'm working on fixing it right now, and I'll let you know how it goes.

Edit: Note that what I was running was the "normal" loss function that is not negative; not the one that is currently being used by default.

dbl001 commented 6 years ago

I’m currently re-training with 250 epochs. I’ll let you know how that goes.

E.g. -

Train the model

m.train(pivot_ids,target_ids,doc_ids, len(pivot_ids), 250, context_ids=False, switch_loss_epoch=50)

What’s: model.context_doc_embedding?

E.g. -

mix_custs = model.sesh.run(tf.nn.softmax(model.context_doc_embedding))

On Jul 10, 2018, at 11:15 AM, Nathan Raw notifications@github.com wrote:

@dbl001 https://github.com/dbl001 After running it yesterday I saw some topics starting to change after a while (60 epochs). However, my preprocessing seemed a little messy, so I changed it up a bit and tried to re-run it with cleaner data. Unfortunately, late last night my server box's hard drive filled up and crashed while I was working on it. I lost some files, but hopefully nothing too critical. I'm working on fixing it right now, and I'll let you know how it goes.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-403917731, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2w0SW7i7M2S7Nw4ZCqLcPMMwZfhGks5uFO9ZgaJpZM4VIAiH.

nateraw commented 6 years ago

The short answer is that "context_doc_embedding" doesnt exist in this version. As for the long answer...

First, it should be m instead of model, that's my fault for leaving in those comments. In an older version I instantiated it as model. In fact, those commented out lines are from a completely different experiment with customer reviews data.

As for what "context_doc_embedding" is, it is also from an older version. However, I still have the same functionality working a different way with the "additional_features" variables found in the Lda2vec class.

image

The idea is that you can add additional context by passing in different unique IDs relating to documents. In the example above, you can see that each document has a unique ID as well as a unique zip code. Then, you can use these additional contexts to model topics over multiple contexts (ex. how do people from similar zip codes speak).

nateraw commented 6 years ago

I was able to do damage control on my lost files, and I got everything up and running again. It seems that after changing up my preprocessing to clean out some nonsense out of the 20 newsgroups dataset, it is working much better. Here it is after 14 epochs. Im going to let it go all night and will report back.

image

nateraw commented 6 years ago

With the most recent push, I added a reproducible example. Additionally, I added the pyLDAvis support.

pyldavis_example

Because of this, I am going to close this thread. More examples will come. Thank you @dbl001 for helping talk this through.

dbl001 commented 6 years ago

So, do you think disambiguating word-senses in the word vectors will decrease noise and improve the ‘lda2vec’ results?

On Jul 11, 2018, at 8:21 AM, Nathan Raw notifications@github.com wrote:

With the most recent push, I added a reproducible example. Additionally, I added the pyLDAvis support.

https://user-images.githubusercontent.com/32437151/42582163-715fe2ae-84fc-11e8-9e7e-c2eb3201fc02.png Because of this, I am going to close this thread. More examples will come. Thank you @dbl001 https://github.com/dbl001 for helping talk this through.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-404209202, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2-8aa5plqp4y-swQpd4AoDw9iK_fks5uFhgLgaJpZM4VIAiH.

nateraw commented 6 years ago

I cant confirm this, but I'm guessing using sense2vec wont be as helpful on small datasets like 20 newsgroups. Large datasets with unique vocabulary (such as hacker news) would work well with sense2vec I think, though.

dbl001 commented 6 years ago

What do you think of this example ...

I scraped news stories from many countries using OpenEventData' scraper: https://github.com/openeventdata/phoenix_pipeline

I generated a unicode file with 1,000 stories (medium) and 5,000+ stories (large) and generated topics:

---------Closest words to given indexes---------- Topic 0 : narendra, ysr, liye, hanya, mitra, aaj, tharoor, aja, sushil, ananth, vijayan Topic 1 : voa, espanha, pyongyang, zarif, warburton, gaziantep, melania, lenora, mirek, về, acevedo Topic 2 : ntv, voa, bjp, reuters, noticias, narendra, ysr, médias, naidu, petersen, wta Topic 3 : mirek, bengaluru, mattis, anz, kozhikode, ananth, mosul, către, phẩm, suria, europäische Topic 4 : jpost, irna, nela, europäische, poderosa, ulpan, tayyip, с, orban, shinzo, akademie Topic 5 : manawatu, palmerston, jpost, greymouth, wanganui, wiesel, către, liên, carnoustie, whangarei, europäische Topic 6 : jpost, voa, syria, aviv, hezbollah, gaza, bashar, assad, pkk, tehran, aleppo Topic 7 : incredibles, hrvatski, moldovan, nela, maru, dyna, bror, siêu, gwm, alton, unsworth Topic 8 : discurso, espanha, edson, noticias, schumer, nela, către, telemundo, urbano, graça, siêu Topic 9 : siêu, whangarei, penzance, unsworth, liên, masterton, palmerston, colville, marin, manawatu, и Topic 10 : karunanidhi, nisar, najib, baluchistan, hossain, mehmet, jayanthi, grm, tayyip, anwar, naqvi Topic 11 : nmg, lenora, mirek, mattis, hrw, khaleej, ncs, safa, с, đại, lamu Topic 12 : khaleej, на, с, unhcr, espanha, bror, browder, и, către, poderosa, azerbaijani Topic 13 : nmg, khaleej, poderosa, lamu, louw, unhcr, espanha, hrw, fao, sahel, на Topic 14 : cfi, mattis, sergei, gauteng, siêu, psl, vijayan, baylor, elon, currie, jnf Topic 15 : pretoria, kwazulu, espanha, ashburton, nmg, sfgate, mpumalanga, gauteng, anz, whangarei, invercargill Topic 16 : espanha, grm, nollywood, hrw, gauteng, iain, enugu, edson, schutz, gilberto, giggs Topic 17 : ndtv, biya, către, olha, hbo, discurso, tsai, nela, arcangel, mehmet, gaziantep Topic 18 : deira, ajman, sharjah, dhabi, siddiqui, burj, emirati, safa, riyadh, maghreb, mulk Topic 19 : sfgate, marin, merced, mateo, joaquin, monterey, guilfoyle, rafael, oliveira, valdez, yosemite STEP 3140100 LOSS 442.70493 w2v 4.248924 lda 438.456

stories_medium.txt

screen shot 2018-07-24 at 1 12 16 pm screen shot 2018-07-24 at 1 11 57 pm

stories.txt.gz

nateraw commented 6 years ago

This is a very interesting example! Do you have code to run this from start to finish? If so/when you do, we can add it and I'll add you as a collaborator. Great work, and thank you. On an off topic note, I actually implemented sense2vec in the preprocessing pipeline now instead of what I was doing. It works great!

Something else I really want to set up is easy to use visualization of topics over time. This should be baked into our preprocessing too, allowing to store dates (if available/applicable) and link them to documents.

image

You can check out how Chris Moody did it here

dbl001 commented 6 years ago

Please find attached my code:

I was using space model: en_core_web_lg I suppose I could try:

https://spacy.io/models/xx#xx_ent_wiki_sm

I’m not convinced the topics being generated make sense. What do you think?

On Jul 24, 2018, at 1:20 PM, Nathan Raw notifications@github.com wrote:

This is a very interesting example! Do you have code to run this from start to finish? If so/when you do, we can add it and I'll add you as a collaborator. Great work, and thank you. On an off topic note, I actually implemented sense2vec in the preprocessing pipeline now instead of what I was doing. It works great!

Something else I really want to set up is easy to use visualization of topics over time. This should be baked into our preprocessing too, allowing to store dates (if available/applicable) and link them to documents.

https://user-images.githubusercontent.com/32437151/43163749-11d2677e-8f5d-11e8-932e-eadc94af7514.png You can check out how Chris Moody did it here https://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-407538676, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i21758mUDjN6YGfTQpUft_OUuwS8iks5uJ4F2gaJpZM4VIAiH.

dbl001 commented 6 years ago

I can generate many more stories, constrain the languages, etc. I settled on 1,000 stories, because the tokenizing didn’t work on 5,000+.

On Jul 24, 2018, at 1:20 PM, Nathan Raw notifications@github.com wrote:

This is a very interesting example! Do you have code to run this from start to finish? If so/when you do, we can add it and I'll add you as a collaborator. Great work, and thank you. On an off topic note, I actually implemented sense2vec in the preprocessing pipeline now instead of what I was doing. It works great!

Something else I really want to set up is easy to use visualization of topics over time. This should be baked into our preprocessing too, allowing to store dates (if available/applicable) and link them to documents.

https://user-images.githubusercontent.com/32437151/43163749-11d2677e-8f5d-11e8-932e-eadc94af7514.png You can check out how Chris Moody did it here https://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-407538676, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i21758mUDjN6YGfTQpUft_OUuwS8iks5uJ4F2gaJpZM4VIAiH.

nateraw commented 6 years ago

I could be wrong, but I'm not seeing the attached code. Also, I think the language thing is probably what is hurting the topics. I noticed too that a couple had some strange words mixed in. Overall it seemed like it was doing okay though. Perhaps with a few more constraints you will see better results.

Also, what happened with the tokenizer with 5000+? Were you using my file or your own for preprocessing? I found some memory errors when I did some larger files. I had to find workarounds.

dbl001 commented 6 years ago

Yes, I was running into issues (memory?) with large files l in your code. It appears to be in tokenize() ... the for loop with the pipeline code.

Code wasn’t included. 😬. I’ll resend tonight.

On Jul 24, 2018, at 2:19 PM, Nathan Raw notifications@github.com wrote:

I could be wrong, but I'm not seeing the attached code. Also, I think the language thing is probably what is hurting the topics. I noticed too that a couple had some strange words mixed in. Overall it seemed like it was doing okay though. Perhaps with a few more constraints you will see better results.

Also, what happened with the tokenizer with 5000+? Were you using my file or your own for preprocessing? I found some memory errors when I did some larger files. I had to find workarounds.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

nateraw commented 6 years ago

Yeah it actually has to do with the batch_size in the pipeline. For large documents, you'd want to cut that down. I dont think I allow for that functionality in the current version, so I'll try to update that. I ran into the same errors...segmentation fault if I remember correctly.

dbl001 commented 6 years ago

Yes.

On Jul 24, 2018, at 2:49 PM, Nathan Raw notifications@github.com wrote:

Yeah it actually has to do with the batch_size in the pipeline. For large documents, you'd want to cut that down. I dont think I allow for that functionality in the current version, so I'll try to update that. I ran into the same errors...segmentation fault if I remember correctly.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

dbl001 commented 6 years ago

For some reason I can’t upload a .py or a .ipynb file into the github issue. So, I emailed you the files.

On Jul 24, 2018, at 2:49 PM, Nathan Raw notifications@github.com wrote:

Yeah it actually has to do with the batch_size in the pipeline. For large documents, you'd want to cut that down. I dont think I allow for that functionality in the current version, so I'll try to update that. I ran into the same errors...segmentation fault if I remember correctly.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-407563502, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i21ZEDcRYM9RTXGH76DgiE0i-6yahks5uJ5aBgaJpZM4VIAiH.

dbl001 commented 6 years ago

I got further … I’m running on an AWS EC2 r4.2xlarge instance (4 CPUs, 32gb RAM):

I tried loading stories.txt:

ubuntu@ip-10-0-1-107:~/Lda2vec-Tensorflow$ wc stories.txt 5172 10128815 68793150 stories.txt

setting setting ‘batch_size’ = 100 in tokenize():

Using TensorFlow backend. It took 1805.0306975841522 seconds to run tokenizer method converting data to w2v indexes trimming 0's converting to skipgrams step 0 of 5172 step 500 of 5172 step 1000 of 5172 step 1500 of 5172 step 2000 of 5172 step 2500 of 5172 step 3000 of 5172 step 3500 of 5172 step 4000 of 5172 step 4500 of 5172 step 5000 of 5172

MemoryError Traceback (most recent call last)

in () 111 if i % 500 == 0: 112 print("step", i, "of", num_examples) --> 113 temp_df = pd.DataFrame(data) 114 temp_df.to_csv(file_out_path, sep="\t", index=False, header=None, mode="a") 115 del temp_df ~/anaconda/envs/ai/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy) 367 if is_named_tuple(data[0]) and columns is None: 368 columns = data[0]._fields --> 369 arrays, columns = _to_arrays(data, columns, dtype=dtype) 370 columns = _ensure_index(columns) 371 ~/anaconda/envs/ai/lib/python3.6/site-packages/pandas/core/frame.py in _to_arrays(data, columns, coerce_float, dtype) 6282 if isinstance(data[0], (list, tuple)): 6283 return _list_to_arrays(data, columns, coerce_float=coerce_float, -> 6284 dtype=dtype) 6285 elif isinstance(data[0], collections.Mapping): 6286 return _list_of_dict_to_arrays(data, columns, ~/anaconda/envs/ai/lib/python3.6/site-packages/pandas/core/frame.py in _list_to_arrays(data, columns, coerce_float, dtype) 6359 else: 6360 # list of lists -> 6361 content = list(lib.to_object_array(data).T) 6362 return _convert_object_array(content, columns, dtype=dtype, 6363 coerce_float=coerce_float) pandas/_libs/src/inference.pyx in pandas._libs.lib.to_object_array() MemoryError: > On Jul 24, 2018, at 2:49 PM, Nathan Raw wrote: > > Yeah it actually has to do with the batch_size in the pipeline. For large documents, you'd want to cut that down. I dont think I allow for that functionality in the current version, so I'll try to update that. I ran into the same errors...segmentation fault if I remember correctly. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub , or mute the thread . >
dbl001 commented 6 years ago

This 'tweak' solved the Pandas memory issue (load_stories.py):

write_every = 100

E.g.

Using TensorFlow backend. It took 1659.5098674297333 seconds to run tokenizer method converting data to w2v indexes trimming 0's converting to skipgrams step 0 of 5172 step 500 of 5172 step 1000 of 5172 step 1500 of 5172 step 2000 of 5172 step 2500 of 5172 step 3000 of 5172 step 3500 of 5172 step 4000 of 5172 step 4500 of 5172 step 5000 of 5172 The whole program took 2890.286608695984 seconds

On Jul 25, 2018, at 11:07 AM, David Laxer davidl@softintel.com wrote:

I got further … I’m running on an AWS EC2 r4.2xlarge instance (4 CPUs, 32gb RAM):

I tried loading stories.txt:

ubuntu@ip-10-0-1-107:~/Lda2vec-Tensorflow$ wc stories.txt 5172 10128815 68793150 stories.txt

setting setting ‘batch_size’ = 100 in tokenize():

Using TensorFlow backend. It took 1805.0306975841522 seconds to run tokenizer method converting data to w2v indexes trimming 0's converting to skipgrams step 0 of 5172 step 500 of 5172 step 1000 of 5172 step 1500 of 5172 step 2000 of 5172 step 2500 of 5172 step 3000 of 5172 step 3500 of 5172 step 4000 of 5172 step 4500 of 5172 step 5000 of 5172

MemoryError Traceback (most recent call last)

in () 111 if i % 500 == 0: 112 print("step", i, "of", num_examples) --> 113 temp_df = pd.DataFrame(data) 114 temp_df.to_csv(file_out_path, sep="\t", index=False, header=None, mode="a") 115 del temp_df ~/anaconda/envs/ai/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy) 367 if is_named_tuple(data[0]) and columns is None: 368 columns = data[0]._fields --> 369 arrays, columns = _to_arrays(data, columns, dtype=dtype) 370 columns = _ensure_index(columns) 371 ~/anaconda/envs/ai/lib/python3.6/site-packages/pandas/core/frame.py in _to_arrays(data, columns, coerce_float, dtype) 6282 if isinstance(data[0], (list, tuple)): 6283 return _list_to_arrays(data, columns, coerce_float=coerce_float, -> 6284 dtype=dtype) 6285 elif isinstance(data[0], collections.Mapping): 6286 return _list_of_dict_to_arrays(data, columns, ~/anaconda/envs/ai/lib/python3.6/site-packages/pandas/core/frame.py in _list_to_arrays(data, columns, coerce_float, dtype) 6359 else: 6360 # list of lists -> 6361 content = list(lib.to_object_array(data).T) 6362 return _convert_object_array(content, columns, dtype=dtype, 6363 coerce_float=coerce_float) pandas/_libs/src/inference.pyx in pandas._libs.lib.to_object_array() MemoryError: > On Jul 24, 2018, at 2:49 PM, Nathan Raw > wrote: > > Yeah it actually has to do with the batch_size in the pipeline. For large documents, you'd want to cut that down. I dont think I allow for that functionality in the current version, so I'll try to update that. I ran into the same errors...segmentation fault if I remember correctly. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub , or mute the thread . >
nateraw commented 6 years ago

I will push 100 as a default size for that. Same thing with the batch size. I'll make sure both are able to be changed easily. Right now those numbers feel kind of hidden, so I'll adjust that. Thank you!!

dbl001 commented 6 years ago

Did you make any improvements in the LDA topic processing?

I've been working with my news stories data file and find that the topics that 'lda2vec' has learned don't seem quite right: E.g. i. there's redundancy in the words pyLDAvis has selected for the topics and, ii. the topics appear clustered together and overlapping instead of spanning the space. iii. ldaloss doesn't decrease with more trainin epochs, iv. adjusting alpha changes the ldaloss range but doesn't improve the topic learning. v. have you tried with negative loss? E.g.

https://datascience.stackexchange.com/questions/13216/intuitive-explanation-of-noise-contrastive-estimation-nce-loss

---------Closest words to given indexes----------
Topic 0 : news, new, for, latest, africa, list, newsletter, more, will, go, about
Topic 1 : news, information, review, show, latest, update, list, videos, new, will, video
Topic 2 : saudi, arabia, news, africa, pakistan, iran, syria, latest, to, nigeria, uae
Topic 3 : new, also, should, for, article, will, list, be, review, can, information
Topic 4 : for, new, news, latest, finance, with, karnataka, term, share, report, list
Topic 5 : news, show, information, video, france, more, videos, report, review, africa, york
Topic 6 : update, report, latest, news, post, information, review, submit, publish, page, data
Topic 7 : new, news, latest, for, review, with, report, the, more, information, article
Topic 8 : news, africa, latest, newsletter, new, information, nigeria, johannesburg, for, please, australia
Topic 9 : news, newsletter, information, latest, report, israel, syria, the, list, article, website
Topic 10 : news, please, new, report, should, post, will, latest, help, information, give
Topic 11 : news, africa, nigeria, latest, pakistan, malaysia, new, please, india, newsletter, australia
Topic 12 : news, report, new, latest, search, for, show, week, information, next, to
Topic 13 : new, news, information, latest, review, report, list, also, will, article, the
Topic 14 : syria, afghanistan, iran, pakistan, news, france, yemen, africa, spain, europe, israel
Topic 15 : sheikh, abdullah, abu, ali, saudi, pakistan, mohammed, khan, allah, arabia, iran
Topic 16 : news, pakistan, latest, afghanistan, africa, saudi, report, syria, iran, india, yemen
Topic 17 : news, information, new, report, latest, article, for, should, show, the, will
Topic 18 : news, new, for, the, report, latest, search, to, information, will, with
Topic 19 : news, information, show, new, for, the, will, and, report, with, about

screen shot 2018-09-22 at 1 14 17 pm

stalhaa commented 5 years ago

@dbl001 can u plz run ur lda2vec code with my dataset file and share it's results??

dbl001 commented 5 years ago

I can take a look. Can you provide me with your data file?

stalhaa commented 5 years ago

Thanku soo much.. This data file contains 1200 abstracts of different research papers. I want a model which generates atleast 100 topics from the corpus. abstract.txt

dbl001 commented 5 years ago

EPOCH: 358 ... ---------Closest words to given indexes---------- Topic 0 : csg, lagrangian, rbf, btf, polyhedral, composit, remesh, laplacian, fourier, parametrization, ldr Topic 1 : csg, lagrangian, remesh, voronoi, antialias, laplacian, octree, saliency, parametrization, btf, nurbs Topic 2 : csg, lagrangian, parametrization, antialias, parameterization, remesh, laplacian, btf, polyhedral, multiscale, fourier Topic 3 : parametrization, quadrilateral, polyhedral, discretization, remesh, csg, lagrangian, voronoi, affine, laplacian, piecewise Topic 4 : remesh, csg, lagrangian, btf, polyhedral, voronoi, parametrization, incompressible, multiscale, nurbs, laplacian Topic 5 : csg, remesh, btf, lagrangian, rbf, laplacian, composit, voronoi, precompute, ndf, ldr Topic 6 : btf, csg, rbf, ldr, prt, antialias, rgb, laplacian, remesh, carlo, gpu Topic 7 : btf, csg, ldr, composit, rbf, prt, lagrangian, dof, laplacian, rgb, carlo Topic 8 : photometric, parametrization, laplacian, psychophysical, csg, ndf, lagrangian, rbf, composit, rgb, colorization Topic 9 : prt, csg, ldr, btf, rbf, laplacian, antialias, rgb, composit, remesh, lagrangian Topic 10 : prt, csg, rbf, btf, remesh, ndf, ldr, precompute, biped, composit, retarget Topic 11 : lagrangian, polyhedral, incompressible, isotropic, csg, tetrahedral, parametrization, multiscale, discretization, conformal, deformable Topic 12 : csg, antialias, remesh, laplacian, btf, rbf, colorization, ndf, saliency, prt, precompute Topic 13 : csg, rbf, ldr, prt, btf, lagrangian, biped, composit, locomotion, haptic, rgb Topic 14 : remesh, polyhedral, nurbs, quadrilateral, lagrangian, csg, voronoi, parametrization, laplacian, spline, discretization Topic 15 : biped, haptic, prt, colorization, btf, ldr, locomotion, psychophysical, composit, csg, retarget Topic 16 : csg, tetrahedral, remesh, discretization, btf, rbf, antialias, laplacian, voronoi, quadrilateral, resampling Topic 17 : csg, btf, antialias, colorization, retarget, laplacian, remesh, precompute, rbf, prt, ndf Topic 18 : laplacian, btf, remesh, antialias, prt, rbf, csg, composit, retarget, saliency, colorization Topic 19 : csg, btf, remesh, lagrangian, voronoi, composit, biped, laplacian, rbf, ldr, prt Topic 20 : prt, csg, btf, rbf, biped, remesh, retarget, lagrangian, laplacian, dof, ldr Topic 21 : reflectance, specular, csg, colorization, radiosity, btf, photometric, rgb, rbf, lagrangian, laplacian Topic 22 : csg, btf, rbf, laplacian, remesh, lagrangian, antialias, fourier, carlo, ldr, voronoi Topic 23 : csg, lagrangian, parametrization, laplacian, voronoi, parameterization, octree, discretization, antialias, remesh, btf Topic 24 : dof, haptic, prt, rgb, ldr, hdr, btf, csg, composit, laplacian, chameleon Topic 25 : csg, remesh, lagrangian, btf, tetrahedral, laplacian, antialias, discretization, rbf, voronoi, polyhedral Topic 26 : csg, lagrangian, incompressible, btf, discretization, remesh, parametrization, gpu, parameterization, laplacian, multiscale Topic 27 : composit, prt, rbf, ldr, btf, retarget, csg, laplacian, precompute, hdr, kd Topic 28 : btf, lagrangian, remesh, polyhedral, biped, spline, quadrilateral, voronoi, nurbs, prt, laplacian Topic 29 : csg, prt, retarget, btf, biped, remesh, ldr, precompute, lagrangian, speedup, voronoi Topic 30 : csg, gpu, speedup, btf, ldr, rbf, cpu, bidirectional, api, quad, lagrangian Topic 31 : csg, btf, rbf, ldr, laplacian, prt, composit, antialias, remesh, ndf, colorization Topic 32 : composit, remesh, btf, laplacian, saliency, parametrization, csg, voronoi, multiscale, lagrangian, deconvolution Topic 33 : csg, precompute, rbf, remesh, laplacian, radiosity, prt, antialias, btf, rgb, composit Topic 34 : lagrangian, csg, composit, multiscale, rbf, parametrization, btf, remesh, laplacian, incompressible, isotropic Topic 35 : csg, btf, rbf, remesh, prt, ldr, laplacian, precompute, ndf, lagrangian, composit Topic 36 : csg, composit, btf, remesh, multiscale, rbf, parametrization, laplacian, incompressible, parameterization, rgb Topic 37 : csg, rbf, laplacian, btf, remesh, voronoi, lagrangian, precompute, parametrization, fourier, composit Topic 38 : csg, laplacian, fourier, voxel, lagrangian, antialias, gaussian, rgb, resampling, convolution, parametrization Topic 39 : csg, rbf, btf, laplacian, remesh, composit, antialias, precompute, voronoi, lagrangian, ndf Topic 40 : csg, gpu, btf, opengl, carlo, rbf, api, remesh, speedup, cpu, workstation Topic 41 : csg, colorization, btf, retarget, haptic, saliency, composit, prt, biped, remesh, laplacian Topic 42 : lagrangian, incompressible, remesh, discretization, parametrization, polyhedral, tetrahedral, csg, quadrilateral, tensor, voronoi Topic 43 : csg, remesh, lagrangian, laplacian, precompute, discretization, btf, voronoi, parametrization, resampling, polyhedral Topic 44 : csg, rbf, btf, remesh, composit, antialias, voronoi, lagrangian, ldr, laplacian, kd Topic 45 : btf, csg, rbf, prt, ndf, antialias, gpu, remesh, ldr, precompute, laplacian Topic 46 : btf, csg, composit, ldr, rbf, rgb, laplacian, prt, hdr, colorization, remesh Topic 47 : luminance, photometric, defocus, deconvolution, rgb, laplacian, colorization, hdr, dof, reflectance, rbf Topic 48 : csg, btf, lagrangian, remesh, rbf, parametrization, voronoi, precompute, multiscale, parameterization, composit Topic 49 : csg, retarget, btf, prt, biped, remesh, multiscale, lagrangian, composit, saliency, rbf Topic 50 : csg, btf, antialias, rbf, remesh, ndf, lagrangian, laplacian, colorization, prt, voronoi Topic 51 : csg, btf, composit, rbf, remesh, lagrangian, ldr, precompute, voronoi, prt, laplacian Topic 52 : rbf, csg, ldr, composit, btf, antialias, precompute, ndf, rgb, hdr, remesh Topic 53 : csg, lagrangian, incompressible, btf, rbf, multiscale, polyhedral, remesh, prt, parametrization, voronoi Topic 54 : csg, btf, laplacian, remesh, multiscale, voronoi, parametrization, composit, deconvolution, fourier, lagrangian Topic 55 : csg, lagrangian, precompute, rbf, parametrization, octree, voronoi, remesh, antialias, polyhedral, fourier Topic 56 : csg, voxel, precompute, photon, laplacian, fourier, radiosity, prt, rgb, photometric, reflectance Topic 57 : laplacian, csg, composit, rbf, deconvolution, antialias, saliency, photometric, btf, dof, rgb Topic 58 : polyhedral, lagrangian, incompressible, discretization, multiscale, remesh, btf, csg, laplacian, simplicial, parametrization Topic 59 : csg, incompressible, remesh, lagrangian, parametrization, btf, parameterization, rbf, discretization, ldr, bidirectional Topic 60 : csg, btf, remesh, lagrangian, parametrization, laplacian, discretization, rbf, parameterization, composit, precompute Topic 61 : btf, polyhedral, csg, voronoi, laplacian, polygonal, lagrangian, parametrization, freeform, 3-d, occlude Topic 62 : csg, antialias, rbf, laplacian, resampling, deconvolution, fourier, antialiasing, rgb, precompute, composit Topic 63 : csg, rbf, btf, precompute, antialias, remesh, laplacian, speedup, colorization, deconvolution, voronoi Topic 64 : btf, laplacian, csg, rbf, antialias, colorization, composit, voronoi, saliency, hdr, ldr Topic 65 : csg, voxel, laplacian, irradiance, reflectance, ndf, rbf, voronoi, rgb, halftone, photometric Topic 66 : csg, laplacian, discretization, btf, remesh, rbf, fourier, parametrization, speedup, precompute, octree Topic 67 : csg, btf, remesh, lagrangian, rbf, polyhedral, voronoi, prt, multiscale, laplacian, composit Topic 68 : btf, csg, composit, rbf, laplacian, lagrangian, remesh, voronoi, ldr, precompute, ndf Topic 69 : lagrangian, multiscale, biped, polyhedral, saliency, kinematic, spline, composit, csg, deformable, remesh Topic 70 : csg, remesh, btf, laplacian, lagrangian, parametrization, polyhedral, ndf, prt, saliency, voronoi Topic 71 : csg, antialias, remesh, btf, laplacian, fourier, parametrization, discretization, voronoi, precompute, lagrangian Topic 72 : csg, btf, remesh, lagrangian, biped, voronoi, retarget, polyhedral, rbf, nurbs, composit Topic 73 : prt, remesh, lagrangian, csg, voronoi, biped, saliency, rbf, polyhedral, laplacian, btf Topic 74 : btf, prt, ldr, rbf, csg, composit, biped, retarget, remesh, ndf, lagrangian Topic 75 : reflectance, anisotropy, csg, lagrangian, refraction, parametrization, btf, laplacian, isotropic, specular, psychophysical Topic 76 : csg, composit, dof, colorization, photometric, btf, rbf, ldr, defocus, saliency, prt Topic 77 : reflectance, photometric, parametrization, irradiance, defocus, specular, laplacian, luminance, fourier, voxel, gaussian Topic 78 : csg, rbf, btf, remesh, laplacian, composit, lagrangian, precompute, parametrization, voronoi, fourier Topic 79 : remesh, parametrization, csg, lagrangian, laplacian, discretization, voronoi, quadrilateral, polyhedral, saliency, piecewise Topic 80 : csg, octree, remesh, gpu, lagrangian, rbf, laplacian, btf, discretization, carlo, prt Topic 81 : colorization, stereoscopic, csg, dof, prt, composit, rbf, btf, retarget, antialiasing, laplacian Topic 82 : lagrangian, laplacian, csg, parametrization, fourier, gaussian, discretization, isotropic, piecewise, parameterization, convolution Topic 83 : csg, btf, radiosity, antialias, lagrangian, precompute, rbf, laplacian, fourier, photometric, composit Topic 84 : biped, lagrangian, kinematic, remesh, multiscale, locomotion, btf, incompressible, csg, haptic, discretization Topic 85 : btf, rbf, csg, composit, prt, laplacian, voronoi, remesh, saliency, precompute, ldr Topic 86 : lagrangian, csg, incompressible, btf, precompute, rbf, remesh, multiscale, biped, deformable, simulator Topic 87 : csg, btf, discretization, remesh, lagrangian, rbf, laplacian, parametrization, voronoi, fourier, polyhedral Topic 88 : csg, laplacian, voronoi, btf, rbf, antialias, lagrangian, saliency, remesh, fourier, discretization Topic 89 : csg, remesh, lagrangian, rbf, discretization, voronoi, polyhedral, btf, laplacian, parametrization, tetrahedral Topic 90 : composit, btf, csg, retarget, biped, prt, colorization, 3-d, rbf, dof, rgb Topic 91 : laplacian, csg, ldr, btf, lagrangian, rbf, composit, deconvolution, gaussian, parametrization, psychophysical Topic 92 : csg, btf, rbf, remesh, laplacian, lagrangian, voronoi, gpu, prt, speedup, octree Topic 93 : csg, btf, rbf, prt, remesh, colorization, antialias, ndf, laplacian, ldr, composit Topic 94 : radiosity, fourier, parametrization, poisson, csg, gaussian, irradiance, antialias, reflectance, photon, isotropic Topic 95 : antialias, antialiasing, laplacian, rgb, resampling, csg, precompute, radiosity, specular, photometric, luminance Topic 96 : csg, rbf, btf, composit, antialias, remesh, ldr, prt, ndf, rgb, precompute Topic 97 : csg, btf, remesh, lagrangian, voronoi, laplacian, rbf, retarget, nurbs, saliency, antialias Topic 98 : csg, btf, remesh, rbf, lagrangian, ldr, composit, laplacian, ndf, prt, precompute Topic 99 : csg, lagrangian, btf, discretization, remesh, voronoi, tetrahedral, rbf, parametrization, multiscale, antialias STEP 1265100 LOSS 30863.713 w2v 4.185615 lda 30859.527

I’m attaching the two .py files. Not sure they will get through

On Feb 3, 2019, at 9:31 PM, stalhaa notifications@github.com wrote:

Thanku soo much.. This data file contains 1200 abstracts of different research papers. I want a model which generates atleast 100 topics from the corpus. abstract.txt https://github.com/nateraw/Lda2vec-Tensorflow/files/2826649/abstract.txt — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-460135586, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i289DCn2Wgq9rCEgzeELaho4icNrCks5vJ8WxgaJpZM4VIAiH.

dbl001 commented 5 years ago

Uploaded the files with .txt extensions. Please rename to .py files to run:

load_abstract.txt run_abstract.txt

stalhaa commented 5 years ago

@dbl001 thanku soo much for this.but i think these topics don't seem quite right.. well,can u plz send ur lda2vec complete code to me? the lda2vec code which u hav run on your system? and what are the specs of ur machine? here is my email id: sana.talha26@gmail.com

dbl001 commented 5 years ago

ubuntu@ip-10-0-1-130:~/Lda2vec-Tensorflow$ git branch

diff --git a/setup.py b/setup.py index c13b398..d30df9d 100644 --- a/setup.py +++ b/setup.py @@ -1,13 +1,10 @@ from setuptools import find_packages from distutils.core import setup

-#install_requires = ["spacy==2.0.5","numpy==1.14.3","pandas==0.21.1","tensorflo

setup(name="lda2vec", version="0.13.00", description="Tools for interpreting natural language", author="Nathan Raw", author_email="nxr9266@rit.edu",

IPython 4.1.2 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details.

In [1]: import tensorflow

In [2]: print(tensorflow.version) 1.12.0

On Feb 4, 2019, at 8:45 AM, stalhaa notifications@github.com wrote:

@dbl001 https://github.com/dbl001 thanku soo much for this.but i think these topics don't seem quite right.. well,can u plz send ur lda2vec complete code to me? the lda2vec code which u hav run on your system? and what are the specs of ur machine? here is my email id: sana.talha26@gmail.com mailto:sana.talha26@gmail.com — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-460320524, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i28YqdQaOGx9-kWElaw1WaTW29aWPks5vKGOdgaJpZM4VIAiH.

dbl001 commented 5 years ago

The results don’t look right to me either. I’ve compared them to other LDA algorithms.

On Feb 4, 2019, at 8:52 AM, David Laxer davidl@softintel.com wrote:

ubuntu@ip-10-0-1-130:~/Lda2vec-Tensorflow$ git branch

  • master ubuntu@ip-10-0-1-130:~/Lda2vec-Tensorflow$ git show commit 1f15771390d88e3283ccd580e537a9b1378c43c7 Author: Nathan Raw <nxr9266@g.rit.edu mailto:nxr9266@g.rit.edu> Date: Fri Nov 9 13:55:29 2018 -0800

    Update setup.py

diff --git a/setup.py b/setup.py index c13b398..d30df9d 100644 --- a/setup.py +++ b/setup.py @@ -1,13 +1,10 @@ from setuptools import find_packages from distutils.core import setup

-#install_requires = ["spacy==2.0.5","numpy==1.14.3","pandas==0.21.1","tensorflo

setup(name="lda2vec", version="0.13.00", description="Tools for interpreting natural language", author="Nathan Raw", author_email="nxr9266@rit.edu mailto:nxr9266@rit.edu",

  • install_requires=install_requires,

  • packages=find_packages("lda2vec"),
  • packages=find_packages(), url="") ubuntu@ip-10-0-1-130:~/Lda2vec-Tensorflow$ ipython Python 2.7.15 |Anaconda custom (64-bit)| (default, Nov 13 2018, 23:04:45) Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details.

In [1]: import tensorflow

In [2]: print(tensorflow.version) 1.12.0

On Feb 4, 2019, at 8:45 AM, stalhaa <notifications@github.com mailto:notifications@github.com> wrote:

@dbl001 https://github.com/dbl001 thanku soo much for this.but i think these topics don't seem quite right.. well,can u plz send ur lda2vec complete code to me? the lda2vec code which u hav run on your system? and what are the specs of ur machine? here is my email id: sana.talha26@gmail.com mailto:sana.talha26@gmail.com — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-460320524, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i28YqdQaOGx9-kWElaw1WaTW29aWPks5vKGOdgaJpZM4VIAiH.

dbl001 commented 5 years ago

ai) ubuntu@ip-10-0-1-130:~/Lda2vec-Tensorflow$ ipython Python 3.6.7 |Anaconda, Inc.| (default, Oct 23 2018, 19:16:44) Type 'copyright', 'credits' or 'license' for more information IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

On Feb 4, 2019, at 8:57 AM, David Laxer davidl@softintel.com wrote:

The results don’t look right to me either. I’ve compared them to other LDA algorithms.

On Feb 4, 2019, at 8:52 AM, David Laxer <davidl@softintel.com mailto:davidl@softintel.com> wrote:

ubuntu@ip-10-0-1-130:~/Lda2vec-Tensorflow$ git branch

  • master ubuntu@ip-10-0-1-130:~/Lda2vec-Tensorflow$ git show commit 1f15771390d88e3283ccd580e537a9b1378c43c7 Author: Nathan Raw <nxr9266@g.rit.edu mailto:nxr9266@g.rit.edu> Date: Fri Nov 9 13:55:29 2018 -0800

    Update setup.py

diff --git a/setup.py b/setup.py index c13b398..d30df9d 100644 --- a/setup.py +++ b/setup.py @@ -1,13 +1,10 @@ from setuptools import find_packages from distutils.core import setup

-#install_requires = ["spacy==2.0.5","numpy==1.14.3","pandas==0.21.1","tensorflo

setup(name="lda2vec", version="0.13.00", description="Tools for interpreting natural language", author="Nathan Raw", author_email="nxr9266@rit.edu mailto:nxr9266@rit.edu",

  • install_requires=install_requires,

  • packages=find_packages("lda2vec"),
  • packages=find_packages(), url="") ubuntu@ip-10-0-1-130:~/Lda2vec-Tensorflow$ ipython Python 2.7.15 |Anaconda custom (64-bit)| (default, Nov 13 2018, 23:04:45) Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details.

In [1]: import tensorflow

In [2]: print(tensorflow.version) 1.12.0

On Feb 4, 2019, at 8:45 AM, stalhaa <notifications@github.com mailto:notifications@github.com> wrote:

@dbl001 https://github.com/dbl001 thanku soo much for this.but i think these topics don't seem quite right.. well,can u plz send ur lda2vec complete code to me? the lda2vec code which u hav run on your system? and what are the specs of ur machine? here is my email id: sana.talha26@gmail.com mailto:sana.talha26@gmail.com — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-460320524, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i28YqdQaOGx9-kWElaw1WaTW29aWPks5vKGOdgaJpZM4VIAiH.

stalhaa commented 5 years ago

@dbl001 y is this soo? have u any idea abt how to improve these results??is ther any issue wd the datafile ?

dbl001 commented 5 years ago

I’m not currently sure. I sent Nathan comparison result on a few test datasets between Lda2vec-Tensorflow and gensim, etc. I didn’t hear back.

On Feb 4, 2019, at 9:17 AM, stalhaa notifications@github.com wrote:

@dbl1001 y is this soo? have u any idea abt how to improve these results??is ther any issue wd the datafile ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/8#issuecomment-460332626, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2wVWfMVLnUYzo1Ejls_FmbRNaV_hks5vKGsVgaJpZM4VIAiH.

nateraw commented 5 years ago

@dbl001 thank you for your help as always. @stalhaa the results may not be the best because lda2vec is a research algorighm.. its results are not as reliable as traditional topic modeling algorithms.