src-d / tmsc

Other
21 stars 9 forks source link

[WIP] make tmsc work with git-based modelforge and sourced.ml #12

Open bzz opened 6 years ago

bzz commented 6 years ago

This is one Sunday afternoon attempt to make tmsc great again.

It's WIP as usage of BOW model from modelforge should be removed as per discussion in https://github.com/src-d/models/issues/11

Early feedback is warmly appreciated though, helping to make it ready to merge at some point.

Current version is able to run and produce results:

$python3 -m tmsc https://github.com/apache/spark

                Parallel and distributed processing - General IT    4.49
                Machine Learning, sklearn-like APIs - General IT    3.88
               Java/JS + async + JSON serialization - General IT    3.77
                            Cryptography: libraries - General IT    3.23
                        SQL, working with databases - General IT    3.18
                Java string input/output - Programming languages    3.16
                          Java: Spring, Hibernate - Technologies    3.11
                              Operations on numbers - General IT    3.02
                               Distributed clusters - General IT    2.69
           Functional programming, Scala - Programming languages    2.64

Full log

``` $ python3 -m tmsc https://github.com/apache/spark /usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py:125: RuntimeWarning: 'tmsc.__main__' found in sys.modules after import of package 'tmsc', but prior to execution of 'tmsc.__main__'; this may result in unpredictable behaviour warn(RuntimeWarning(msg)) INFO:GitIndex:Index is cached INFO:topics:Reading /Users/alex/.source{d}/topics/default.asdf... INFO:docfreq:Reading /Users/alex/.source{d}/docfreq/default.asdf... INFO:docfreq:Building the docfreq dictionary... INFO:topic_detector:Loaded topics model: {'created_at': datetime.datetime(2017, 9, 18, 12, 27, 56, 74233), 'dependencies': [{'created_at': datetime.datetime(2017, 6, 19, 9, 59, 14, 766638), 'dependencies': [], 'model': 'docfreq', 'uuid': 'f64bacd4-67fb-4c64-8382-399a8e7db52a', 'version': [1, 0, 0]}], 'model': 'topics', 'uuid': 'c70a7514-9257-4b33-b468-27a8588d4dfa', 'version': [0, 3, 0]} 320 topics, 2015336 tokens First 10 tokens: ['ulcancel', 'domainlin', 'trudi', 'fncreateinstancedbaselin', 'wbnz', 'lmultiplicand', 'otronumero', 'qxln', 'gvgq', 'polaroidish'] Topics: labeled, first 10: ['Zend framework, Magento - Technologies', 'AngularJS, promises - Technologies', 'Drupal - Technologies', 'HTML DOM - General IT', 'Cryptography: ciphers and certificates - General IT', 'HTML tags - General IT', 'Countries, Moodle - Technologies', '3D modelling and rendering, WebGL - Technologies', 'POSIX terminal interface, serial interface, image capture - General IT', 'Popular Wordpress plugins - Technologies'] non-zero elements: 16892389 (0.026194) INFO:docfreq:Pruning to min 20 occurrences INFO:docfreq:Size: 5720096 -> 416370 INFO:topic_detector:Loaded docfreq model: {'created_at': datetime.datetime(2017, 6, 19, 9, 59, 14, 766638), 'dependencies': [], 'model': 'docfreq', 'uuid': 'f64bacd4-67fb-4c64-8382-399a8e7db52a', 'version': [1, 0, 0]} Number of words: 416370 Random 10 words: {'aaa': 6322, 'aaaa': 2676, 'aaaaa': 861, 'aaaaaa': 1163, 'aaaaaaa': 341, 'aaaaaaaa': 156, 'aaaaaaaaa': 119, 'aaaaaaaaaa': 189, 'aaaaaaaaaaa': 30, 'aaaaaaaaaaaa': 90} Number of documents: 112273 INFO:bblfsh:Detected bblfsh server: 0.0.0.0:9432 WARNING:topic_detector:No BOW cache was loaded. INFO:repo_cloner:Cloning from https://github.com/apache/spark... INFO:repo_cloner:Finished cloning https://github.com/apache/spark INFO:repo_cloner:Classifying the files... INFO:repo_cloner:Result: {'ANTLR': 1, 'Batchfile': 19, 'C': 1, 'CSS': 6, 'CSV': 20, 'Csound': 3, 'Dockerfile': 3, 'HTML': 4, 'Java': 760, 'JavaScript': 16, 'Makefile': 2, 'Markdown': 4, 'PLSQL': 23, 'PLpgSQL': 62, 'PowerShell': 1, 'Python': 126, 'R': 68, 'RMarkdown': 1, 'Roff': 4, 'SQL': 158, 'SQLPL': 52, 'Scala': 2803, 'Shell': 77, 'Text': 260, 'Thrift': 2, 'reStructuredText': 6} INFO:repo2bow:Fetching and processing UASTs... INFO:repo2bow:https://github.com/apache/spark pending tasks: 880 INFO:repo2bow:https://github.com/apache/spark pending tasks: 872 INFO:repo2bow:https://github.com/apache/spark pending tasks: 864 INFO:repo2bow:https://github.com/apache/spark pending tasks: 856 INFO:repo2bow:https://github.com/apache/spark pending tasks: 848 INFO:repo2bow:https://github.com/apache/spark pending tasks: 840 INFO:repo2bow:https://github.com/apache/spark pending tasks: 832 INFO:repo2bow:https://github.com/apache/spark pending tasks: 824 INFO:repo2bow:https://github.com/apache/spark pending tasks: 816 INFO:repo2bow:https://github.com/apache/spark pending tasks: 808 INFO:repo2bow:https://github.com/apache/spark pending tasks: 800 INFO:repo2bow:https://github.com/apache/spark pending tasks: 792 INFO:repo2bow:https://github.com/apache/spark pending tasks: 784 INFO:repo2bow:https://github.com/apache/spark pending tasks: 776 INFO:repo2bow:https://github.com/apache/spark pending tasks: 768 INFO:repo2bow:https://github.com/apache/spark pending tasks: 760 INFO:repo2bow:https://github.com/apache/spark pending tasks: 752 INFO:repo2bow:https://github.com/apache/spark pending tasks: 744 INFO:repo2bow:https://github.com/apache/spark pending tasks: 736 INFO:repo2bow:https://github.com/apache/spark pending tasks: 728 INFO:repo2bow:https://github.com/apache/spark pending tasks: 720 INFO:repo2bow:https://github.com/apache/spark pending tasks: 712 INFO:repo2bow:https://github.com/apache/spark pending tasks: 704 INFO:repo2bow:https://github.com/apache/spark pending tasks: 696 INFO:repo2bow:https://github.com/apache/spark pending tasks: 688 INFO:repo2bow:https://github.com/apache/spark pending tasks: 680 INFO:repo2bow:https://github.com/apache/spark pending tasks: 672 INFO:repo2bow:https://github.com/apache/spark pending tasks: 664 INFO:repo2bow:https://github.com/apache/spark pending tasks: 656 INFO:repo2bow:https://github.com/apache/spark pending tasks: 648 INFO:repo2bow:https://github.com/apache/spark pending tasks: 640 INFO:repo2bow:https://github.com/apache/spark pending tasks: 632 INFO:repo2bow:https://github.com/apache/spark pending tasks: 624 INFO:repo2bow:https://github.com/apache/spark pending tasks: 616 INFO:repo2bow:https://github.com/apache/spark pending tasks: 608 INFO:repo2bow:https://github.com/apache/spark pending tasks: 600 INFO:repo2bow:https://github.com/apache/spark pending tasks: 592 INFO:repo2bow:https://github.com/apache/spark pending tasks: 584 INFO:repo2bow:https://github.com/apache/spark pending tasks: 576 INFO:repo2bow:https://github.com/apache/spark pending tasks: 568 INFO:repo2bow:https://github.com/apache/spark pending tasks: 560 INFO:repo2bow:https://github.com/apache/spark pending tasks: 552 INFO:repo2bow:https://github.com/apache/spark pending tasks: 544 INFO:repo2bow:https://github.com/apache/spark pending tasks: 536 INFO:repo2bow:https://github.com/apache/spark pending tasks: 528 INFO:repo2bow:https://github.com/apache/spark pending tasks: 520 INFO:repo2bow:https://github.com/apache/spark pending tasks: 512 INFO:repo2bow:https://github.com/apache/spark pending tasks: 504 INFO:repo2bow:https://github.com/apache/spark pending tasks: 496 INFO:repo2bow:https://github.com/apache/spark pending tasks: 488 INFO:repo2bow:https://github.com/apache/spark pending tasks: 480 INFO:repo2bow:https://github.com/apache/spark pending tasks: 472 INFO:repo2bow:https://github.com/apache/spark pending tasks: 464 INFO:repo2bow:https://github.com/apache/spark pending tasks: 456 INFO:repo2bow:https://github.com/apache/spark pending tasks: 448 INFO:repo2bow:https://github.com/apache/spark pending tasks: 440 INFO:repo2bow:https://github.com/apache/spark pending tasks: 432 INFO:repo2bow:https://github.com/apache/spark pending tasks: 424 INFO:repo2bow:https://github.com/apache/spark pending tasks: 416 INFO:repo2bow:https://github.com/apache/spark pending tasks: 408 INFO:repo2bow:https://github.com/apache/spark pending tasks: 400 INFO:repo2bow:https://github.com/apache/spark pending tasks: 392 INFO:repo2bow:https://github.com/apache/spark pending tasks: 384 INFO:repo2bow:https://github.com/apache/spark pending tasks: 376 INFO:repo2bow:https://github.com/apache/spark pending tasks: 368 INFO:repo2bow:https://github.com/apache/spark pending tasks: 360 INFO:repo2bow:https://github.com/apache/spark pending tasks: 352 INFO:repo2bow:https://github.com/apache/spark pending tasks: 344 INFO:repo2bow:https://github.com/apache/spark pending tasks: 336 INFO:repo2bow:https://github.com/apache/spark pending tasks: 328 INFO:repo2bow:https://github.com/apache/spark pending tasks: 320 INFO:repo2bow:https://github.com/apache/spark pending tasks: 312 WARNING:repo2bow:/var/folders/rx/z9zyr71d70x92zwbn3rrjx4c0000gn/T/repo2-0hkgj3su/apache&spark_github.com/sql/hive-thriftserver/src/gen/java/org/apache/hive/service/cli/thrift/TCLIService.java was skipped: it is too big - 516093 bytes INFO:repo2bow:https://github.com/apache/spark pending tasks: 304 INFO:repo2bow:https://github.com/apache/spark pending tasks: 296 INFO:repo2bow:https://github.com/apache/spark pending tasks: 288 INFO:repo2bow:https://github.com/apache/spark pending tasks: 280 INFO:repo2bow:https://github.com/apache/spark pending tasks: 272 INFO:repo2bow:https://github.com/apache/spark pending tasks: 264 INFO:repo2bow:https://github.com/apache/spark pending tasks: 256 INFO:repo2bow:https://github.com/apache/spark pending tasks: 248 INFO:repo2bow:https://github.com/apache/spark pending tasks: 240 INFO:repo2bow:https://github.com/apache/spark pending tasks: 232 INFO:repo2bow:https://github.com/apache/spark pending tasks: 224 INFO:repo2bow:https://github.com/apache/spark pending tasks: 216 INFO:repo2bow:https://github.com/apache/spark pending tasks: 208 INFO:repo2bow:https://github.com/apache/spark pending tasks: 200 INFO:repo2bow:https://github.com/apache/spark pending tasks: 192 INFO:repo2bow:https://github.com/apache/spark pending tasks: 184 INFO:repo2bow:https://github.com/apache/spark pending tasks: 176 INFO:repo2bow:https://github.com/apache/spark pending tasks: 168 INFO:repo2bow:https://github.com/apache/spark pending tasks: 160 INFO:repo2bow:https://github.com/apache/spark pending tasks: 152 INFO:repo2bow:https://github.com/apache/spark pending tasks: 144 INFO:repo2bow:https://github.com/apache/spark pending tasks: 136 INFO:repo2bow:https://github.com/apache/spark pending tasks: 128 INFO:repo2bow:https://github.com/apache/spark pending tasks: 120 INFO:repo2bow:https://github.com/apache/spark pending tasks: 112 INFO:repo2bow:https://github.com/apache/spark pending tasks: 104 INFO:repo2bow:https://github.com/apache/spark pending tasks: 96 INFO:repo2bow:https://github.com/apache/spark pending tasks: 88 INFO:repo2bow:https://github.com/apache/spark pending tasks: 80 INFO:repo2bow:https://github.com/apache/spark pending tasks: 72 INFO:repo2bow:https://github.com/apache/spark pending tasks: 64 INFO:repo2bow:https://github.com/apache/spark pending tasks: 56 INFO:repo2bow:https://github.com/apache/spark pending tasks: 48 WARNING:repo2bow:/var/folders/rx/z9zyr71d70x92zwbn3rrjx4c0000gn/T/repo2-0hkgj3su/apache&spark_github.com/python/pyspark/sql/tests.py was skipped: it is too big - 270463 bytes INFO:repo2bow:https://github.com/apache/spark pending tasks: 40 INFO:repo2bow:https://github.com/apache/spark pending tasks: 32 INFO:repo2bow:https://github.com/apache/spark pending tasks: 24 INFO:repo2bow:https://github.com/apache/spark pending tasks: 16 INFO:repo2bow:https://github.com/apache/spark pending tasks: 8 INFO:repo2bow:https://github.com/apache/spark pending tasks: 0 Parallel and distributed processing - General IT 4.49 Machine Learning, sklearn-like APIs - General IT 3.88 Java/JS + async + JSON serialization - General IT 3.77 Cryptography: libraries - General IT 3.23 SQL, working with databases - General IT 3.18 Java string input/output - Programming languages 3.16 Java: Spring, Hibernate - Technologies 3.11 Operations on numbers - General IT 3.02 Distributed clusters - General IT 2.69 Functional programming, Scala - Programming languages 2.64 ```
vmarkovtsev commented 6 years ago

Gigantic effort @bzz!

zurk commented 6 years ago

All right, I think this( https://github.com/src-d/ml/blob/master/sourced/ml/cmd/repos2bow.py ) can be helpful. We use it to convert repositories to BOW models. It is so complex because we also use it in the Apollo project. But the common pipeline idea to create BOW model can be found in an initial code of repo2bow: https://github.com/zurk/ml/blob/d7a093de39e90db9a9c74515d6b2029240de7b96/sourced/ml/cmd_entries/repos2bow.py

I am not sure how deep your knowledge in new sourced-ml, @bzz, If you want we can have a call and I explain to you main aspects.

vmarkovtsev commented 6 years ago

This is an excellent chance to improve our documentation btw.

zurk commented 6 years ago

yeah, good idea. we have something here: https://docs.sourced.tech/sourced-ml but it tells you how to use it and nothing about developing.

I think I can add more docstrings to our codebase. @bzz if you can, please let me know about everything that is confusing or hard to get in sourced-ml, I will add docstrings there firstly. I am asking, because It is hard to know most problematic places from inside :)

vmarkovtsev commented 6 years ago

@bzz The core part here is extracting the BOW. You can use the revamped function from Vecino now: https://github.com/src-d/vecino/blob/master/vecino/repo2bow.py

bzz commented 6 years ago

Yes, that is exactly missing component that I had to resurrect from git history 🚀

Is that ok to use vecino as dependency here?

vmarkovtsev commented 6 years ago

@bzz It is completely fine to copy-paste for now - we will add this to sourced-ml once we have time.