Closed sebflaesch closed 11 months ago
The existing code samples for the Genero language should be reviewed.
Why? We don't change samples unless they're blatantly wrong and have never been correct. Why? Because if they've been valid once before, there is almost certainly going to be code on GitHub using that old format/approach/whatever.
Update: if the language has evolved, new samples should be added for the new usage but this is only really needed if the classifier is having problems identifying files.
@lildude: Why replacing sample files? The Linguist doc recommendations is to avoid "Hello World" samples. We (FourJs) would like to provide state of the art samples here. The original samples were (kindly) provided by one of our customers back in 2020, but we agreed that there are not perfect, contain invalid programming patterns, and that can be replaced by much better samples, following the coding standards of the Genero language (for example, using UPPERCASE for language keywords) I am new here and do not know what is the "classifier". If the "classifier" is based on file extensions: There are .4gl and .per file changes/replacements, and one new library.sch file, which is a generated file that can be ignored by the Linguist parsers, but is mandatory to compile the provided .4gl / .per sample files with Genero fglcomp/fglform compilers. Do you see any issue with this?
I am new here and do not know what is the "classifier".
https://github.com/github-linguist/linguist/blob/master/docs/how-linguist-works.md details how Linguist works, but essentially it works like a funnel with each strategy attempting to whittle the list of languages for a specific file down to a single language. The classifier is the last step and is a bayesian classifier that is a "last resort" and the least reliable strategy. It's important to point out that Linguist considers files in isolation (think gists); the rest of the content of a repo has no impact on the language detection.
The classifier is trained using the samples which are broken up in to tokens and these tokens are then used to analyze files individually. The samples are not used for anything else and aren't shipped with the gem (only the resulting token file), hence it's the only place we allow GPL licensed files. As we tokenise the samples, things like case, function names, styling, formatting etc are not relevant and not considered by the classifier so changing samples to meet "best practices" don't have much impact, if any, hence we rarely need to change samples.
If the "classifier" is based on file extensions: There are .4gl and .per file changes/replacements, and one new library.sch file, which is a generated file that can be ignored by the Linguist parsers, but is mandatory to compile the provided .4gl / .per sample files with Genero fglcomp/fglform compilers. Do you see any issue with this?
The classifier isn't directly based on the extensions as detailed above, it's only the tokenized content that is relevant to the classifier, and as mentioned before, this is considered in isolation.
So with that in mind, there are a few issues:
.sch
is not currently associated with Genero so will need to meet usage requirements as with any new extension being added for a language, and in this case, as it is already associated with another language, we'll need a heuristic (ie a regex that uniquely identifies these files) too.Specifically regarding:
but is mandatory to compile the provided .4gl / .per sample files
Linguist doesn't care about this. As I mentioned before, we tokenize the samples and analyse in isolation.
And after all of that, you probably don't need to do anything as .4gl
is unique to "Genero":
... and .per
is unique to "Genero Forms":
... so the classifier is never used for these files. It'll only come into play if another language is added with this extension in future, and even then we try and encourage the use of heuristics for more accurate classification and only leave it to the classifier if it's not that simple.
Thanks @lildude for the details. All of this is quite new to me and need some time to digest. Just a thought about the classifier (and maybe I am missing something):
"we tokenize the samples and analyse in isolation"
I would expect that the "TextMate files" are THE reference to identify what a language syntax is, and I do not expect other tools to find out what syntax/keywords corresponds to a specific file extension...
I would expect that the "TextMate files" are THE reference to identify what a language syntax is, and I do not expect other tools to find out what syntax/keywords corresponds to a specific file extension...
These aren't used by linguist at all. They're purely collected for the highlighting engine which is a completely independent internal service and are only used once linguist has identified the language of the file. It's more efficient to collect these grammars when the language is added than attempting to maintain the things in two separate places. It's however not very efficient to use them for language detection.
@lildude: I have reviewed my new "state of the art" samples, having only .4gl and .per files, respectively in TOP/samples/Genero and "TOP/samples/Genero Forms" directories. Would that change be accepted by a pull request?
Would that change be accepted by a pull request?
Maybe. You'll need to clearly explain why you're deleting the current samples. As I mentioned above, we don't remove or change samples unless they're blatantly wrong and not reflective of the language at all.
Oh, and you can update the samples and grammar in the same PR... we're not fussy about this sort of thing. We also squash commits on merge so don't care about the commit history so go wild.
Thank you @lildude ! Finally I think you can close this issue as dup for #6629, we want to provide a single PR.
Changes have shipped. Closing.
The existing code samples for the Genero language should be reviewed. We can take care of this and make a pull request from the https://github.com/sebflaesch/linguist-genero fork.