uwdata / termite-data-server

Data Server for Topic Models
BSD 3-Clause "New" or "Revised" License
121 stars 46 forks source link

Dead project? Older version in other github still works, here is a how-to for the record. #31

Open carlosparadis opened 7 years ago

carlosparadis commented 7 years ago

In case someone is interested in running this, as of this day I was able to run the earlier version of this project available here by the same author back in the day:

https://github.com/StanfordHCI/termite/blob/master/README.old

While trying to find a solution to make the current project run, this .txt file works as an example input:

https://github.com/YingHsuan/termite_data_server/blob/master/apps/mobile_payment_mallet/data/corpus.txt

Was a good example that matches the format of the old version.

The readme was more friendly on making sense of running the code. One thing to lookout for is that it will throw an error when it gets to the compiler-latest.zip saying it couldn't move the file. The file as of today (I was surprised to see all links working for download despite being 4 years later!) will have inside the .zip a jar file containing the expect name+version. Simple extract the closure.jar file, and rename it to closure.jar inside the lib folder. Re-running the script will then rename it to the intended name, and finish installing.

For running the script, I had some issues with the config file path, but the script allows to make the 3 paths explicit:

./execute.py --corpus-path ~/Desktop/finance_corpus.txt carlos_lda.cfg --model-path example-project2/topic-model/ --data-path example-project2/

It will create the folders for you or overwrite. The paths provided through the command line example above from left to right are the same required by the .cfg script from top to bottom. Provide the corpus.txt on the referred link (or any that follows the format doc-id\ttext and it should work. In practice, somewhere along the pipeline I experienced errors. This corpus, which is already tokenized ran like a breeze instead, so I imagine it would be best to tokenize using some other library before putting here.

Finally, there was some issue on the old project where it was pointed out by the author of the code that a small corpus may lead to throwing an error due to running out of vocabulary or something.

The visualization for this file took about half an hour to get done on a 2016 Macbook on 16 GB Ram in contrast to running LDA on R topicmodels package that takes about 3 minutes, plus loading on another visualization work that referred this one (LDAVis on github), which is about 1 minute.

I wish the visualization didn't attempt to do the entire process from start, but rather required the data as the other authors did (i.e. the matrixes and a few vectors). Would facilitate a lot on reusability.

If anyone is interested in how the output looks like in the end, here it is:

screenshot 2017-07-07 22 16 56

You can also select multiple topics. Sadly, the old version does not include the document view pane and the project seems abandoned now.

carlosparadis commented 7 years ago

Just a follow up on this: Testing on my own dataset, with more terms and more documents it ended up in a matter of minutes. Not sure if due to a first run or simple the example corpus had some other caveat I was unaware of. It also seems LDAVis doesn't implement serialization, so worth the shot on keeping both! Thanks for the code 👍

xinnyuann commented 5 years ago

@carlosparadis i think by saying "rename it to closure.jar”, you mean "rename it to compiler.jar". Thank you so much for explaining the work-around and testing it. I'm trying to use it with my own data for a project but not sure if it's gonna work...worth a try though~

carlosparadis commented 5 years ago

@Coraxin Happy to see the workaround is of potential use to someone :) My group was actually successful in decoupling the mallet dependency out of this code so something else, like R package topicmodels, could be used or anything that provides the tables of LDA. While I wouldn't claim that a final version either, maybe the commits where we cut the dependencies off could aid in your understanding of your code for your use:

https://github.com/sailuh/termite/commits/master

Otherwise, you may want to try LDAvis (https://github.com/cpsievert/LDAvis) which is a much easier to use code and still maintained inspired on improving this one (per their research publication citing this work).

Best of luck!

xinnyuann commented 5 years ago

@carlosparadis Thank you so much for your prompt reply. It's amazing seeing people working on it to make a more compatible version of Termite. I got a question following up the previous comment. I was trying to reproduce the vizz using termite-old as you did, with the same config file (example_config_file.cfg) and same corpus file/token file (finance_corpus.txt), everything went well until the step "combining similarity matrix.." , which used up all my memory on jupyter server (0.011T) but still threw me a memory error!!! How did you work around this memory issue or did you ever have it when running termite? Thank you again in advance~ Screen Shot 2019-04-11 at 12 20 26 PM

carlosparadis commented 5 years ago

@xinnyuann my apologies, I missed the notification of your response. I hope you were able to address the issue. Are you referring to the implementation in this repo or our fork?

In our fork, we skipped the preprocessing of mallet of the data, obtained the intermediate files which represent the topic matrix and necessary metadata, and provided to the remainder of the code which is responsible for the visualization alone.

Doing so, the only job of the code was to generate the visualization, which albeit a bit laggy, worked.

Using a package like topicmodels in R should suffice to create a pipeline to replace mallet. I have created a small package, topicflowr, which is public but far from polished yet, that does the job and fits data to LDAVis, but I have not had the chance to reshape the data to termite, albeit it is just a formatting issue, the data is fundamentally the same.