Closed DmitryKey closed 8 years ago
Hi, Dmitry,
Please refer to the command line in README.md. You need to invoke the program as a python module.
python -m infvoc.launch --input_directory=../input/ --output_directory=../output/ --corpus_name=de-news --truncation_level=4000 --number_of_topics=10 --number_of_documents=9800 --vocab_prune_interval=10 --batch_size=100 --alpha_beta=1000
Best, Ke
Hi Ke,
I'm having some trouble installing ingrams to run your code as a module. What python version are you running on?
The problem I could foresee is not due to python version, but NLTK versions. At the time infvoc was implemented, I was using NLTK 2.0, and NLTK 3.0's API is significantly different from NLTK 2.0.
I am using Python 2.7 by the way.
I'm using Python 2.7.6. And not sure what nltk version -- installed some fresh one.
I think I have replicate the errors. The problem is due to a major refactoring change in NLTK from 2.0 to 3.0, so many classes and modules are missing. I am still look into the problem. As a work around, I have comment out the n-gram character model, you should be able to run it now. Also, keep in mind, I am planning to do a structural refactor on this repo very soon.
Best, Ke
Hi guys,
I've adopted your code for my project and thank you very much for putting it on GitHub!
I'm wondering about the output in infvoc/hybrid.py export_beta() though. I see that you collect the weights into what actually is just a dictionary (why use nltk.FreqDist by the way?) and then output first top_display from the list of keys. The way I understand it the order of the keys in the list is determined by Python, so the first top_display items there are not necessary the top contributing words to the topic.
I've modified the output for my purposes to the following one:
srt = sorted(freqdist.items(), key=operator.itemgetter(1), reverse=True)
for key, value in srt[:top_display]: output.write(self._index_to_word[key] + "\t" + str(value) + "\n")
but I'm wondering whether I misunderstand something in your code.
Best, Olana
Hi, Olana,
You are absolutely right, there is no obvious reason to use the nltk.FreqDist for that purpose. I will refine the code sometime to incorporate your change.
Best, Ke
Hi Ke,
my issue was more with the output being unsorted than with using the FreqDist as such, so that's what I wanted to know :) is there something I am missing in the logic of export_beta() or was it really a bug that outputted unsorted (hence, not necessarily the most representative of the topics) words?
Cheers, Olana
Hi, Olana,
I see the confusion now. By default (at the version of NLTK when this code was written), all keys are sorted by their value when you call freqdist.keys() function (https://github.com/kzhai/InfVocLDA/blob/master/src/infvoc/hybrid.py#L490). Many people (including me) have complained about the computation cost every time calling such function. Ever since the new NLTK release, the API of freqdist has change to utilize the collections.Counter object, and it introduces a new function most_common(), which gives an iterator over the key-value pairs sorted by the value, and the keys() function might changed to an "unordered" iterator. After all, you are right about the logic behind export_beta(), where the words should be sorted according their probabilities.
Best, Ke
thanks for the explanation! all is clear now :)
thanks for the explanation! all is clear now :)
can u share your output file?
Hi, What are the recommended python and nltk versions to run this project?
At first, python complained about missing GoodTuringProbDist. I provided SimpleGoodTuringProbDist.
Now getting: