github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.31k stars 4.26k forks source link

Rust files are being misclassified as RenderScript files #3998

Closed urschrei closed 6 years ago

urschrei commented 6 years ago

Preliminary Steps

Please confirm you have...

Problem Description

Several files in the rust-geo repository are being mis-classified as RenderScript. Because they have large loc counts, the repo language is being mis-classified.

URL of the affected repository:

https://github.com/georust/rust-geo Mis-classified files: https://github.com/georust/rust-geo/search?l=renderscript

Last modified on:

2018-01-23

Expected language:

Rust

Detected language:

RenderScript

lildude commented 6 years ago

This is happening because the content of those files doesn't match anything the heuristic at https://github.com/github/linguist/blob/8da6ddf9d97ee1cd1d7119f3e8a5249df7d1590a/lib/linguist/heuristics.rb#L427-L433 ...can use to distinguish the two languages.

The content doesn't give a good variance in the language structure for the classifier to get it right either. In fact the content of those files is hard for a human to guess the language as they don't look like code at all but rather a list of co-ordinates, which would technically make it data.

As the vec at the beginning of each file appears to be a custom function/method name, rather than a standard language keyword or method, there's nothing we can do to improve the automatic classification of these files either. Your only option is to implement a manual override to mark those files as Rust or ignore them.