Hyperpolyglot is a fast programming language detector written in Rust based on Github's Linguist Ruby library. Hyperpolyglot supports detecting the programming language of a file or detecting the programming language makeup of a directory. For more details on how the language detection is done, see the Linguist README.
Installing
cargo install hyperpolyglot
Usage
hyply [PATH]
Output
85.00% Rust
15.00% RenderScript
Adding as a dependency
[dependencies]
hyperpolyglot = "0.1.0"
Detect
use hyperpolyglot;
let detection = hyperpolyglot::detect(Path::new("src/bin/main.rs"));
assert_eq!(Ok(Some(Detection::Heuristics("Rust"))), detection);
Breakdown
use hyperpolyglot::{get_language_breakdown};
let breakdown: HashMap<&'static str, Vec<(Detection, PathBuf)>> = get_language_breakdown("src/");
println!("{:?}", breakdown.get("Rust"));
The probability of the language occuring is not taken into account when classifying. All languages are assumed to have equal probability.
An additional heuristic was added for .h files.
Vim and Emacs modelines are not considered in the detection process.
Generated and Binary files are not excluded from the breakdown function.
When calculating the language makeup of a directory, file count is used instead of byte count.
samples dir
Tool | mean (ms) | median (ms) | min (ms) | max (ms) |
---|---|---|---|---|
hyperpolyglot (multi-threaded) | 1,188 | 1,186 | 1,166 | 1,226 |
hyperpolyglot (single-threaded) | 2,424 | 2,424 | 2,414 | 2,442 |
enry | 21,619 | 21,566 | 21,514 | 21,855 |
Linguist | 42,407 | 42,386 | 42,070 | 42,856 |
Rust Repo
Tool | mean (ms) | median (ms) | min (ms) | max (ms) |
---|---|---|---|---|
hyperpolyglot (multi-threaded) | 3,808 | 3,751 | 3,708 | 4,253 |
hyperpolyglot (single-threaded) | 8,341 | 8,334 | 8,276 | 8,437 |
enry | 82,300 | 82,215 | 82,021 | 82,817 |
Linguist | 196,780 | 197,300 | 194,033 | 202,930 |
Linux Kernel
Tool | mean (s) | median (s) | min (s) | max (s) |
---|---|---|---|---|
hyperpolyglot (multi-threaded) | 3.7574 | 3.7357 | 3.7227 | 3.9021 |
hyperpolyglot (single-threaded) | 7.5833 | 7.5683 | 7.5445 | 7.6489 |
enry | 137.6046 | 137.4229 | 137.1955 | 138.8694 |
All of the programming language detectors are far from perfect and hyperpolyglot is no exception. It's language detections mirror Linguist and enry for most files with the biggest divergences coming from files that need to fall back on the classifier. Files that can be detected through a common known filename, an extension, or by following the set of heuristics should approach 100% accuracy.
Licensed under either of
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.