src-d / enry

A faster file programming language detector
https://blog.sourced.tech/post/enry/
Apache License 2.0
460 stars 51 forks source link

Language detection accuracy measurements #246

Open bzz opened 5 years ago

bzz commented 5 years ago

Enry right now consist of the sequence matching of strategies that narrow down the possible language options based on different available information:

As a users, as each strategy can be used independently, I would like to know how accurate will the language detection be for each of the distinct use cases.

Use cases

Evaluation

Right now, the only measure of overall accuracy of language detection process we have is binary (similar to linguist): if the linguist/examples/ are all classified or not.

This issue is about picking a better way of quantifying the prediction quality for the three use cases above.

Steps

The focus of this task is not to get best possible evaluation, but rather to quickly kick off the automation of having at least some evaluation, that will be improved in subsequent work.