nexB / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/nexB/scancode-toolkit/releases/
2.03k stars 529 forks source link

Improve Programming language detection and classification #1445

Open pombredanne opened 5 years ago

pombredanne commented 5 years ago

Description

ScanCode programming language detection is not as accurate as it could be and this is important to get this right to drive further automation. We also need to automatically classify each file in facets when possible.

The goal of this ticket is to improve the quality of programming language detection (which is using only Pygments today and could use another tool, e.g. some Bayesian classifier like Github linguist, enry ?). And to create and implement a flexible framework of rules to automate assigning files to facets which could use some machine learning and classifier.

See https://github.com/nexB/aboutcode/wiki/GSOC-2019#improve-programming-language-detection-and-classification-in-scancode

Here are some actual tools for general filetype and Programming language detection: In use today:

( we also use a shannon entropy detector and binaryornot to detect binaries)

Things to look at could include :

See also: #1036 #1012 and #426 #1355 #1201

Ritvyk commented 5 years ago

Hi There! We can also use Regular Expression to Detect the correct Programming Language in which the code has been written. will Soon upload a source code of it , Working on it right now!

mjherzog commented 4 years ago

Some recent examples of errors for Programming Language (pygments):

pombredanne commented 4 years ago

@mjherzog thanks. I pushed an updated pygments library in 4aaec8c58d6813f7428b010bae1494fdf45ac5c8 but this is only a first baby step

mjherzog commented 4 years ago

I don't know how/if this factors in to a solution, but I would say that "false positives" are the main concern. It would be better for a .rST file to be reported as No Value Detected for Programming Language than a false positive for VB.Net.