Support multiple file extensions which map to the same language (simplify language identification)

dancompton commented 5 years ago

Some languages support multiple file extensions that result in no behavioral difference in compilation. Jsonnet is one of those. Frequently used extensions in jsonnet are .jsonnet, .libsonnet, and .TEMPLATE (by datastax).

In object Heuristics, fun analysze, we first attempt to find the extractor factory by file extension. For the above case, this will fail approximately 50% of the time unless we introduce file extensions in the HeuristicsMap that are not listed in Languages.kt.

These dangling mappings could be problematic if a language type is removed. Would it be possible to simplify this code by creating a single mapOf<String, Triple(String(languageName), listOf(String(languageExtensions)), ExtractorImplementation)?

An auxiliary map can be constructed that maps extensions back to the language name for speedup in language identification.

sergey48k commented 5 years ago

@yaronskaya Could you please comment?

dancompton commented 5 years ago

@sergey48k @yaronskaya

Depending on your philosophy and deployment strategy, a new microservice could potentially be a great way to break out language-detection. It could rely primarily on/proxy https://github.com/yoeo/guesslang which is 90% accurate across various language-guessing tasks (from snippets). The service's API might look like:

syntax = "proto3";
import "google/protobuf/empty.proto";
import "google/api/annotations.proto";

package sourcerer.linguist.v1;

enum Languages {
  UNKNOWN = 0; 
  GOLANG = 1;
  KOTLIN = 2;
  ... other fields in Languages.kt
}

// Linguist makes a best-effort guess at a snippet's language.
service Linguist {
  rpc GetLanguageBySnippet(GetLanguageBySnippetRequest) GetLanguageBySnippetResponse
    rpc CheckHealth(HealthCheckRequest) returns (HealthCheckResponse){
      option (google.api.http) = {
        get: "/_status"
      };
    }

  message GetLanguageBySnippetRequest {
    string request_ulid  = 1;  // sortable in time (https://github.com/ulid/spec)
    string snippet = 2;
    map<string, string> metadata = 3; // could container file extension or other identifying data
  }

  message GetLanguageBySnippetRequest {
    string response_ulid  = 1; 
    ErrorStatus error_status = 2;
    Language language = 3;
    float confidence = 4;
  }

  message ErrorStatus {
    string message = 1;
    repeated google.protobuf.Any details = 2;
  }

  // HealthCheckRequest is a request for the serving status of a service.
  message HealthCheckRequest {
      // service is the service name.
      string service = 1;
    }
  }

  // HealthCheckResponse wraps the requested serving status type.
  message HealthCheckResponse {
    enum ServingStatus {
      UNKNOWN = 0;
      SERVING = 1;
      NOT_SERVING = 2;
    }
    ServingStatus status = 1;
    string version = 2;
  }
}

yaronskaya commented 5 years ago

Hi @dan-compton. Languages.kt stores just language names, not extensions. Extensions are stored in HeuristicsMap. So if the language has multiple extensions then you should add the mapping of an extension -> Extractor for each extension of the language(see how mapping was implemented for cpp for example). If multiple languages have common extension then we use language regexes to decide which language the file has. Our language detection was inspired by linguist, but we have our own implementation to keep data confidential.

dancompton commented 5 years ago

@yaronskaya

https://github.com/sourcerer-io/sourcerer-app/blob/develop/src/main/kotlin/app/extractors/Heuristics.kt#L278-L357 demonstrates that the extensions used in Heuristics.kt are indeed provided by Language.kt in some cases. In others, regular expressions are used that have no quantifiable accuracy. I'm suggesting that this language-detection behavior be normalized and encapsulated in it's own service.

I acknowledge your point, but if we continue down the path of adding extra extensions to the map in Heuristics then we are left with dangling extensions if a language is removed, say. There's no singular location were one can obtain a mapping from language name to it's file extensions.

With respect to the confidentiality (which I do not fully understand), one of Linguist's key principles is to let the language author implement the language detector using some form of grammar and parser. This approach does two things:

Offloads the work of ensuring language recognition onto the language owner
Improves accuracy because parsing is performed and the parser is implemented by the foremost expert on the language.

I'd also like to reiterate the point that this can be handled by machine learning. guesslang is pushing 90% accuracy in test -- and that's just from snippets (no extension)

yaronskaya commented 5 years ago

@dan-compton Regarding confidentiality I mean that code snippets shouldn't go somewhere out of the sourcerer-app. Another thing that worries me is that we should run classifier for each snippet. For big repositories, it would take really much time. Also, there only 20 supported languages.

dancompton commented 5 years ago

@yaronskaya Sorry -- have been busy with handoffs and birthday celebrations. First of all, you bring up a good point regarding lack of support for many languages that linguist does support.

You bring up a good point that guesslang only supports 20 languages. So, I'm going to go ahead and implement the code in the existing framework -- look for a PR tonight or tomorrow. Linguist supports 453 types of structured textual document so you're right in that guesslang, by default, is not good enough for this use case. It is extensible, but again you bring up a good point in that we would need to train a model for each language (though I see this as the parallel to creating a regex classifier for each language we needs support.

The key issue here, I believe, is summed up in this chart: https://guesslang.readthedocs.io/en/latest/_images/co-occurrence.png

Indeed many languages have co-occurrences snippets of limited size, or of symbols and keywords used in contexts that might easily confuse a classifier. One example they provide is

As shown in the co-occurrence graph, one of Guesslang limitation is the proportion of C++ files mistaken for C files. That was expected because it is OK to write pure C source code in a C++ file.

This makes perfect sense. For the particular issue that I'm addressing (addition of JSONNET support), a classifier might prove difficult to implement due to the fact that jsonnet is "just" sugarared json and in fact JSON is valid jsonnet and can be used within a JSONNET program.

In summary:

I agree that guesslang does not support enough languages to be utilized here, but wonder if implementing regular expressions is akin to implementing new language models for guesslang (time-wise, probably much easier to implement custom regex classifiers).
Do we use any standard model for determining the accuracy of our classifiers. The point brought up here: https://guesslang.readthedocs.io/en/latest/how.html under subsection Language Crossover is worrying and it seems that hand-written regular expressions as well as machine learning models will fal short in certain scenarios.
How do we handle sugared languages like JSONNET in which the desugared version of the language is in fact valid in the context of the sugared AST? I'll look at how it was handled in linguist.
How important is performance in the classification and extraction steps relative to your current scale? Regular expresions as well as the execution of parsers can be very expensive. Something like simhash could be used to estimate likelyhood based on a set of known characteristics within the language.

yaronskaya commented 5 years ago

@dan-compton I wonder why just not to define jsonnet language by its extensions? It will not create false positives since there are no other languages with extensions of jsonnet language?

sourcerer-io / sourcerer-app

Support multiple file extensions which map to the same language (simplify language identification) #495