kzhai / InfVocLDA

Online Latent Dirichlet Allocation with Infinite Vocabulary using Variational Inference
https://github.com/kzhai/InfVocLDA
Apache License 2.0
74 stars 19 forks source link

launch issues #3

Closed DmitryKey closed 8 years ago

DmitryKey commented 9 years ago

Hi, What are the recommended python and nltk versions to run this project?

At first, python complained about missing GoodTuringProbDist. I provided SimpleGoodTuringProbDist.

Now getting:

python launch.py --input_directory=../../input/ --output_directory=../../output/ --corpus_name=de-news --truncation_level=4000 --number_of_topics=10 --number_of_documents=9800 --vocab_prune_interval=10 --batch_size=100 --alpha_beta=1000
========== ========== ========== ========== ==========
output_directory=../../output/de-news/15Sep18-223048-D9800-K10-T4000-P10-I10-B100-O98-t64-k0.75-at0.1-ab1000/
input_directory=../../input/de-news
corpus_name=de-news
dictionary_file=None
number_of_documents=9800
number_of_topics=10
truncation_level=4000
vocab_prune_interval=10
snapshot_interval=10
batch_size=100
online_iterations=98
tau=64.0
kappa=0.75
alpha_theta=0.1
alpha_beta=1000.0
========== ========== ========== ========== ==========
successfully load all training documents...
Traceback (most recent call last):
  File "launch.py", line 248, in <module>
    main()
  File "launch.py", line 189, in main
    import hybrid;
  File "/home/dmitry/projects/github/topics/InfVocLDA/src/infvoc/hybrid.py", line 13, in <module>
    import nchar;
  File "/home/dmitry/projects/github/topics/InfVocLDA/src/infvoc/nchar.py", line 13, in <module>
    from nltk.util import ingrams
ImportError: cannot import name ingrams
kzhai commented 9 years ago

Hi, Dmitry,

Please refer to the command line in README.md. You need to invoke the program as a python module.

python -m infvoc.launch --input_directory=../input/ --output_directory=../output/ --corpus_name=de-news --truncation_level=4000 --number_of_topics=10 --number_of_documents=9800 --vocab_prune_interval=10 --batch_size=100 --alpha_beta=1000

Best, Ke

DmitryKey commented 9 years ago

Hi Ke,

I'm having some trouble installing ingrams to run your code as a module. What python version are you running on?

kzhai commented 9 years ago

The problem I could foresee is not due to python version, but NLTK versions. At the time infvoc was implemented, I was using NLTK 2.0, and NLTK 3.0's API is significantly different from NLTK 2.0.

kzhai commented 9 years ago

I am using Python 2.7 by the way.

DmitryKey commented 8 years ago

I'm using Python 2.7.6. And not sure what nltk version -- installed some fresh one.

kzhai commented 8 years ago

I think I have replicate the errors. The problem is due to a major refactoring change in NLTK from 2.0 to 3.0, so many classes and modules are missing. I am still look into the problem. As a work around, I have comment out the n-gram character model, you should be able to run it now. Also, keep in mind, I am planning to do a structural refactor on this repo very soon.

Best, Ke

OlanaMi commented 8 years ago

Hi guys,

I've adopted your code for my project and thank you very much for putting it on GitHub!

I'm wondering about the output in infvoc/hybrid.py export_beta() though. I see that you collect the weights into what actually is just a dictionary (why use nltk.FreqDist by the way?) and then output first top_display from the list of keys. The way I understand it the order of the keys in the list is determined by Python, so the first top_display items there are not necessary the top contributing words to the topic.

I've modified the output for my purposes to the following one: srt = sorted(freqdist.items(), key=operator.itemgetter(1), reverse=True) for key, value in srt[:top_display]: output.write(self._index_to_word[key] + "\t" + str(value) + "\n")

but I'm wondering whether I misunderstand something in your code.

Best, Olana

kzhai commented 8 years ago

Hi, Olana,

You are absolutely right, there is no obvious reason to use the nltk.FreqDist for that purpose. I will refine the code sometime to incorporate your change.

Best, Ke

OlanaMi commented 8 years ago

Hi Ke,

my issue was more with the output being unsorted than with using the FreqDist as such, so that's what I wanted to know :) is there something I am missing in the logic of export_beta() or was it really a bug that outputted unsorted (hence, not necessarily the most representative of the topics) words?

Cheers, Olana

kzhai commented 8 years ago

Hi, Olana,

I see the confusion now. By default (at the version of NLTK when this code was written), all keys are sorted by their value when you call freqdist.keys() function (https://github.com/kzhai/InfVocLDA/blob/master/src/infvoc/hybrid.py#L490). Many people (including me) have complained about the computation cost every time calling such function. Ever since the new NLTK release, the API of freqdist has change to utilize the collections.Counter object, and it introduces a new function most_common(), which gives an iterator over the key-value pairs sorted by the value, and the keys() function might changed to an "unordered" iterator. After all, you are right about the logic behind export_beta(), where the words should be sorted according their probabilities.

Best, Ke

OlanaMi commented 8 years ago

thanks for the explanation! all is clear now :)

gauravkoradiya commented 5 years ago

thanks for the explanation! all is clear now :)

can u share your output file?