sourcerer-io / sourcerer-app

🦄 Sourcerer app makes a visual profile from your GitHub and git repositories.
https://sourcerer.io/start
MIT License
6.73k stars 281 forks source link

Support multiple file extensions which map to the same language (simplify language identification) #495

Open dancompton opened 5 years ago

dancompton commented 5 years ago

Some languages support multiple file extensions that result in no behavioral difference in compilation. Jsonnet is one of those. Frequently used extensions in jsonnet are .jsonnet, .libsonnet, and .TEMPLATE (by datastax).

In object Heuristics, fun analysze, we first attempt to find the extractor factory by file extension. For the above case, this will fail approximately 50% of the time unless we introduce file extensions in the HeuristicsMap that are not listed in Languages.kt.

These dangling mappings could be problematic if a language type is removed. Would it be possible to simplify this code by creating a single mapOf<String, Triple(String(languageName), listOf(String(languageExtensions)), ExtractorImplementation)?

An auxiliary map can be constructed that maps extensions back to the language name for speedup in language identification.

sergey48k commented 5 years ago

@yaronskaya Could you please comment?

dancompton commented 5 years ago

@sergey48k @yaronskaya

Depending on your philosophy and deployment strategy, a new microservice could potentially be a great way to break out language-detection. It could rely primarily on/proxy https://github.com/yoeo/guesslang which is 90% accurate across various language-guessing tasks (from snippets). The service's API might look like:

syntax = "proto3";
import "google/protobuf/empty.proto";
import "google/api/annotations.proto";

package sourcerer.linguist.v1;

enum Languages {
  UNKNOWN = 0; 
  GOLANG = 1;
  KOTLIN = 2;
  ... other fields in Languages.kt
}

// Linguist makes a best-effort guess at a snippet's language.
service Linguist {
  rpc GetLanguageBySnippet(GetLanguageBySnippetRequest) GetLanguageBySnippetResponse
    rpc CheckHealth(HealthCheckRequest) returns (HealthCheckResponse){
      option (google.api.http) = {
        get: "/_status"
      };
    }

  message GetLanguageBySnippetRequest {
    string request_ulid  = 1;  // sortable in time (https://github.com/ulid/spec)
    string snippet = 2;
    map<string, string> metadata = 3; // could container file extension or other identifying data
  }

  message GetLanguageBySnippetRequest {
    string response_ulid  = 1; 
    ErrorStatus error_status = 2;
    Language language = 3;
    float confidence = 4;
  }

  message ErrorStatus {
    string message = 1;
    repeated google.protobuf.Any details = 2;
  }

  // HealthCheckRequest is a request for the serving status of a service.
  message HealthCheckRequest {
      // service is the service name.
      string service = 1;
    }
  }

  // HealthCheckResponse wraps the requested serving status type.
  message HealthCheckResponse {
    enum ServingStatus {
      UNKNOWN = 0;
      SERVING = 1;
      NOT_SERVING = 2;
    }
    ServingStatus status = 1;
    string version = 2;
  }
}
yaronskaya commented 5 years ago

Hi @dan-compton. Languages.kt stores just language names, not extensions. Extensions are stored in HeuristicsMap. So if the language has multiple extensions then you should add the mapping of an extension -> Extractor for each extension of the language(see how mapping was implemented for cpp for example). If multiple languages have common extension then we use language regexes to decide which language the file has. Our language detection was inspired by linguist, but we have our own implementation to keep data confidential.

dancompton commented 5 years ago

@yaronskaya

https://github.com/sourcerer-io/sourcerer-app/blob/develop/src/main/kotlin/app/extractors/Heuristics.kt#L278-L357 demonstrates that the extensions used in Heuristics.kt are indeed provided by Language.kt in some cases. In others, regular expressions are used that have no quantifiable accuracy. I'm suggesting that this language-detection behavior be normalized and encapsulated in it's own service.

I acknowledge your point, but if we continue down the path of adding extra extensions to the map in Heuristics then we are left with dangling extensions if a language is removed, say. There's no singular location were one can obtain a mapping from language name to it's file extensions.

With respect to the confidentiality (which I do not fully understand), one of Linguist's key principles is to let the language author implement the language detector using some form of grammar and parser. This approach does two things:

  1. Offloads the work of ensuring language recognition onto the language owner
  2. Improves accuracy because parsing is performed and the parser is implemented by the foremost expert on the language.

I'd also like to reiterate the point that this can be handled by machine learning. guesslang is pushing 90% accuracy in test -- and that's just from snippets (no extension)

yaronskaya commented 5 years ago

@dan-compton Regarding confidentiality I mean that code snippets shouldn't go somewhere out of the sourcerer-app. Another thing that worries me is that we should run classifier for each snippet. For big repositories, it would take really much time. Also, there only 20 supported languages.

dancompton commented 5 years ago

@yaronskaya Sorry -- have been busy with handoffs and birthday celebrations. First of all, you bring up a good point regarding lack of support for many languages that linguist does support.

You bring up a good point that guesslang only supports 20 languages. So, I'm going to go ahead and implement the code in the existing framework -- look for a PR tonight or tomorrow. Linguist supports 453 types of structured textual document so you're right in that guesslang, by default, is not good enough for this use case. It is extensible, but again you bring up a good point in that we would need to train a model for each language (though I see this as the parallel to creating a regex classifier for each language we needs support.

The key issue here, I believe, is summed up in this chart: https://guesslang.readthedocs.io/en/latest/_images/co-occurrence.png

Indeed many languages have co-occurrences snippets of limited size, or of symbols and keywords used in contexts that might easily confuse a classifier. One example they provide is

As shown in the co-occurrence graph, one of Guesslang limitation is the proportion of C++ files mistaken for C files. That was expected because it is OK to write pure C source code in a C++ file.

This makes perfect sense. For the particular issue that I'm addressing (addition of JSONNET support), a classifier might prove difficult to implement due to the fact that jsonnet is "just" sugarared json and in fact JSON is valid jsonnet and can be used within a JSONNET program.

In summary:

yaronskaya commented 5 years ago

@dan-compton I wonder why just not to define jsonnet language by its extensions? It will not create false positives since there are no other languages with extensions of jsonnet language?