src-d / tmsc

Other
21 stars 9 forks source link

tmsc reports "No files were processed" for common repos #8

Closed bluecloudws closed 6 years ago

bluecloudws commented 6 years ago

tmsc is not able to discover topics for some common repos. The result returned indicated that "no files were processed". I tried tmsc against the trending Github repos and got similar results. Testing against Spark did work (but not for Mesos as seen below). Thank you for any help/pointers.

INFO:repo_cloner:Cloning from https://github.com/mesos/spark... INFO:repo_cloner:Finished cloning https://github.com/mesos/spark INFO:repo_cloner:Classifying the files... INFO:repo_cloner:Result: {} INFO:repo2bow:Fetching and processing UASTs... WARNING:repo2bow:No files were processed for https://github.com/mesos/spark Zend framework, Magento - Technologies 0.00 ODBC-like interfaces - General IT 0.00 Media containers - General IT 0.00 People databases in French - Human languages 0.00 DataTables (JQuery Plugin) - Technologies 0.00 Web Frontend, Yii CHtml - Technologies 0.00 VOIP, nginx modules - General IT 0.00 Maps, geography - Concepts 0.00 Fonts, DirectX - Technologies 0.00 Message brokers - General IT 0.00

vmarkovtsev commented 6 years ago

The reason is simple: Python and Java are hardcoded as the supported languages in the legacy code from 2017. Your log states that

WARNING:repo2bow:No files were processed for

Try a Python or a Java repo. Meanwhile, we will update tmsc as soon as the new upstream models are trained.

bluecloudws commented 6 years ago

Thanks for the response. I tried running tmsc against Java based repos and it was able to discovered topics. On a related note, can you provide some pointers on how I can train the model to recognize additional topics?

vmarkovtsev commented 6 years ago

The theory is described in the paper: https://arxiv.org/abs/1704.00135 The practice is in https://github.com/src-d/ml/blob/master/doc/topic_modeling.md If you need topics with specific seed words, it should be possible after studying the officinal documentation on BigARTM: https://bigartm.readthedocs.io/en/stable/ - though we personally have no specific docs.