github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.14k stars 4.2k forks source link

Replace Genero language syntax files with official files #6629

Closed sebflaesch closed 9 months ago

sebflaesch commented 9 months ago

New official syntax language files provided by FourJs are available here: https://github.com/FourjsGenero/GeneroFgl.tmbundle

We need to replace the existing files created by hand by Bryan Jackson (@alienriver49) in 2020.

Bryan, we discussed this today and I thank you again for your valued contribution. Please just confirm here that you accept that we replace your original syntax files with ours.

Cheers! Seb

lildude commented 9 months ago

Closing as a dupe of https://github.com/github-linguist/linguist/issues/6628

lildude commented 9 months ago

Oh wait. You mean the grammar files. Re-opening, but note your repo is not accessible.

sebflaesch commented 9 months ago

@lildude #6629 and #6628 can be provided as different pull requests so please can we keep both open?

sebflaesch commented 9 months ago

@lildude I have just started to work on these and doing github forking and pull requests, so please forgive my mistakes. I will contact the author of https://github.com/FourjsGenero/GeneroFgl.tmbundle to make it public!

alienriver49 commented 9 months ago

I am good with my version of the grammars being replaced with the official grammars maintained by the author(s) of the language. Thank you again @sebflaesch!

sebflaesch commented 9 months ago

https://github.com/FourjsGenero/GeneroFgl.tmbundle is now public. We will work on a pull request.

sebflaesch commented 9 months ago

I think I need some help on this:

The new TextMate files (https://github.com/FourjsGenero/GeneroFgl.tmbundle) we would like to be used by Linguist/Github are defining new language names and identifiers (for the two distinct file types .4gl and .per), that do not match the names and ids that are in place.

Consequently, I guess that a simple "--replace" option with script/add-grammar would not work.

In such case, what is the best practice? Keep in mind that the author of the current definition (@alienriver49) is one of our customers and is ok to replace all of the existing definitions by ours.

In fact, we have currently two "language" definitions: 1) For .4gl files: "Genero" (that would change to "Genero 4gl") 2) For .per files: "Genero Forms" (that would change to "Genero Forms (per)" or a better name like "Genero per")

It is possible to do a complete drop of the existing definitions, and add again the new languages definitions with the script/add-grammar command? I could not find the instructions to drop a definition, in https://github.com/github-linguist/linguist/blob/master/CONTRIBUTING.md.

Regarding samples (issue #6628), my understanding is that the directory names under TOP/samples must match the language names, correct? If yes, we should provide a single pull request for both #6628 and #6629. And we should certainly find a better name for the .per files, because non of the existing language names I can see use ( ) parentheses...

sebflaesch commented 9 months ago

Seems that the support for Genero language was added in a single commit:

commit adc53192f33c96ab34207f37d6c085efb78699ed

So if this works it would not be a big deal to revert this commit, and add a new language definition for Genero... Right?

lildude commented 9 months ago

The new TextMate files (https://github.com/FourjsGenero/GeneroFgl.tmbundle) we would like to be used by Linguist/Github are defining new language names and identifiers (for the two distinct file types .4gl and .per), that do not match the names and ids that are in place.

Consequently, I guess that a simple "--replace" option with script/add-grammar would not work.

It most definitely does work 😁 If you stick with the instructions for replacing a grammar in the CONTRIBUTING.md file as they're written you should be good to go. You shouldn't need to do anything that isn't documented in that section.

Use the path to the current vendor directory (not path) for the --replace argument.

So if you start with the "Genero" grammar, you'd run script/add-grammar --replace genero.tmbundle https://github.com/FourjsGenero/GeneroFgl.tmbundle.

You won't need to do anything for the "Genero Forms" grammar as the script will detect the same grammar is now used for both languages and update the relevant files.

As for dropping the current definitions: don't. You can certainly rename them, but don't change the ID. This is stored and heavily cached in a lot of locations outside of the main GitHub database and is not an easy task to change or correct. Many third parties rely on this ID too. As such, once an ID is set, it is never changed and a language is never removed.

As for the language names: I'd advise against using parentheses (it should work for GitHub, but we can't guarantee elsewhere) and use the names users are most likely to know and use. For most people this is the commerical name of the language.

sebflaesch commented 9 months ago

@lildude:

By ID do you mean the "language_id" field in lib/linguist/languages.yml, that is produced by the script/update-ids?

For now, in my local fork, I could manage to revert the original commit, setup the ruby/docker env and add from scratch the new languages "Genero 4gl" (instead of "Genero") and "Genero per" (instead of "Genero Forms"), with the regular script/add-grammar process, and created new IDs with script/update-ids ...

Would it be an issue, if I reuse the IDs ("language_id") that where generated by the first commit, but using the new names "Genero 4gl" / "Genero per" of my new adding?

This would save me some more additional work, as the only change I would have to do is:

diff --git a/lib/linguist/languages.yml b/lib/linguist/languages.yml
index bc63995a..61885902 100644
--- a/lib/linguist/languages.yml
+++ b/lib/linguist/languages.yml
@@ -2278,7 +2278,7 @@ Genero 4gl:
   - ".4gl"
   tm_scope: source.genero-4gl
   ace_mode: text
-  language_id: 241301092
+  language_id: 986054050
 Genero per:
   type: programming
   color: "#d8df39"
@@ -2286,7 +2286,7 @@ Genero per:
   - ".per"
   tm_scope: source.genero-per
   ace_mode: text
-  language_id: 828945790
+  language_id: 902995658
 Genie:
   type: programming
   ace_mode: text

Seb

lildude commented 9 months ago

By ID do you mean the "language_id" field in lib/linguist/languages.yml, that is produced by the script/update-ids?

Yes.

For now, in my local fork, I could manage to revert the original commit, setup the ruby/docker env and add from scratch the new languages "Genero 4gl" (instead of "Genero") and "Genero per" (instead of "Genero Forms"), with the regular script/add-grammar process, and created new IDs with script/update-ids ...

There's no need to revert the original commit; it's really not necessary. Follow the steps for replacing the grammar, commit those changes, rename the entries as you wish, commit those changes, rename the samples directories to match the new language names, commit your changes, update the scope lines if the grammars use different scope names, push.

Several files will be changed but the only changes in the languages.yml file will be the two name lines and the scope lines if different scope name are used in the grammar.

I will reject the PR if you change the IDs as I said before: these must not change.

sebflaesch commented 9 months ago

Ok no worries: I clearly understood that the language IDs must NOT change. I was wondering if I could just reuse what I have prepared, just by using the existing IDs. But I will try to use the "replace" method and the renaming as you suggest.

sebflaesch commented 9 months ago

There's no need to revert the original commit; it's really not necessary. Follow the steps for replacing the grammar, ...

So I followed: https://github.com/github-linguist/linguist/blob/master/CONTRIBUTING.md#changing-the-source-of-a-syntax-highlighting-grammar:

$ script/add-grammar --replace Genero https://github.com/sebflaesch/GeneroFgl.tmbundle
Checking Docker is installed and running
add-grammar: Submodule '' does not exist. Aborting.

or (using the new name):

$ script/add-grammar --replace "Genero 4gl" https://github.com/sebflaesch/GeneroFgl.tmbundle
Checking Docker is installed and running
add-grammar: Submodule '' does not exist. Aborting.

What's next?

I thought I should patch the tm_scope identifiers, to match the files of the new TM repo, but that did not help (I really wonder how the "replace" process can guess what is to be replaced if these are different!)

sf@toro (issue-6629):~/genero/devel/github/linguist/linguist-genero$ git diff
diff --git a/grammars.yml b/grammars.yml
index 82904434..dd5fdb9e 100644
--- a/grammars.yml
+++ b/grammars.yml
@@ -390,8 +390,8 @@ vendor/grammars/gemfile-lock-tmlanguage:
 vendor/grammars/gemini-vscode:
 - source.gemini
 vendor/grammars/genero.tmbundle:
-- source.genero
-- source.genero-forms
+- source.genero-4gl
+- source.genero-per
 vendor/grammars/gettext.tmbundle:
 - source.po
 vendor/grammars/gnuplot-tmbundle:
diff --git a/lib/linguist/languages.yml b/lib/linguist/languages.yml
index 1becf7ad..15e77ac0 100644
--- a/lib/linguist/languages.yml
+++ b/lib/linguist/languages.yml
@@ -2276,7 +2276,7 @@ Genero:
   color: "#63408e"
   extensions:
   - ".4gl"
-  tm_scope: source.genero
+  tm_scope: source.genero-4gl
   ace_mode: text
   language_id: 986054050
 Genero Forms:
@@ -2284,7 +2284,7 @@ Genero Forms:
   color: "#d8df39"
   extensions:
   - ".per"
-  tm_scope: source.genero-forms
+  tm_scope: source.genero-per
   ace_mode: text
   language_id: 902995658
 Genie:
lildude commented 9 months ago

I gave you the exact command to run earlier 😉

This won’t update languages.yml but will take care of everything else.

lildude commented 9 months ago

Start the PR and we can continue this there as it’s best suited in the PR.

sebflaesch commented 9 months ago

Oups ok Colin, using

$ script/add-grammar --replace genero.tmbundle ...

worked.

Before doing any PR, I can share the testing fork I am working on, if you like: https://github.com/sebflaesch/linguist-genero/tree/issue-6629 That's just for testing/preparing the PR.

I need also to sync with the author of https://github.com/FourjsGenero/GeneroFgl.tmbundle This is the real repo to be eventually used, and the names are not yet correct, so I did also a fork: https://github.com/sebflaesch/GeneroFgl.tmbundle

Thanks for your assistance!

sebflaesch commented 9 months ago

@lildude : I have worked on changes, but this is only for validation! please do not apply!

Would you mind check the changes I have prepared and tell me if something is missing? Commits are visible in my testing fork: https://github.com/github-linguist/linguist/compare/master...sebflaesch:linguist-genero:issue-6629

Note that there is NO license file (our TM repo has "The Unlicense") - is this an issue? Should it be an MIT license?

lildude commented 9 months ago

Diff looks good from a quick glance.

Note that there is NO license file (our TM repo has "The Unlicense") - is this an issue? Should it be an MIT license?

We need a license file, even for the The Unlicense. The text from https://choosealicense.com/licenses/unlicense/ in a LICENSE file will do the trick.

sebflaesch commented 9 months ago

Sorry I was not clear: We have that "unlicense" text: https://github.com/FourjsGenero/GeneroFgl.tmbundle/blob/master/LICENSE.txt PR will come after some renaming in https://github.com/FourjsGenero/GeneroFgl.tmbundle/tree/master. Thanks a lot @lildude !

sebflaesch commented 9 months ago

@lildude : How can I check that my local fork of linguist detects the new "Genero 4gl" and "Genero per" languages replacing the existing languages "Genero" and "Genero Forms"?

When using the github-linguist command, I still see the old languages... What I am missing here?

sf@toro (issue-6629):~/genero/devel/github/linguist/linguist-genero$ bin/github-linguist "samples/Genero 4gl/books1.4gl" 
samples/Genero 4gl/books1.4gl: 183 lines (140 sloc)
  type:      Text
  mime type: text/plain
  language:  Genero

sf@toro (issue-6629):~/genero/devel/github/linguist/linguist-genero$ bin/github-linguist "samples/Genero per/books1.per" 
samples/Genero per/books1.per: 38 lines (34 sloc)
  type:      Text
  mime type: text/plain
  language:  Genero Forms

Is this because I am on a branch (issue-6629) of my repo that is a fork of linguist?

sf@toro (issue-6629):~/genero/devel/github/linguist/linguist-genero$ git status
On branch issue-6629
Your branch is up to date with 'origin/issue-6629'.

I have also tried git-linguist command:

sf@toro (issue-6629):~/genero/devel/github/linguist/linguist-genero$ bin/git-linguist --commit=84b3fe0c8584ca64102caf1c621982b7820f358f -f breakdown
{"Dockerfile":[".devcontainer/Dockerfile","Dockerfile","tools/grammars/Dockerfile"],"Shell":[".devcontainer/onCreateCommand.sh","script/add-grammar","script/bootstrap","script/build-grammars-tarball","script/cibuild","script/grammar-compiler","tools/grammars/docker/build"],"Ruby":["Brewfile","Gemfile","Rakefile","bin/git-linguist","bin/github-linguist","ext/linguist/extconf.rb","github-linguist.gemspec","lib/linguist.rb","lib/linguist/blob.rb","lib/linguist/blob_helper.rb","lib/linguist/classifier.rb","lib/linguist/file_blob.rb","lib/linguist/generated.rb","lib/linguist/grammars.rb","lib/linguist/heuristics.rb","lib/linguist/language.rb","lib/linguist/lazy_blob.rb","lib/linguist/repository.rb","lib/linguist/samples.rb","lib/linguist/sha256.rb","lib/linguist/shebang.rb","lib/linguist/strategy/extension.rb","lib/linguist/strategy/filename.rb","lib/linguist/strategy/manpage.rb","lib/linguist/strategy/modeline.rb","lib/linguist/strategy/xml.rb","lib/linguist/tokenizer.rb","lib/linguist/version.rb","script/cross-validation","script/fast-submodule-update","script/list-grammars","script/normalise-url","script/sort-submodules","script/update-ids","test/helper.rb","test/test_blob.rb","test/test_classifier.rb","test/test_file_blob.rb","test/test_generated.rb","test/test_grammars.rb","test/test_heuristics.rb","test/test_instrumentation.rb","test/test_language.rb","test/test_pedantic.rb","test/test_repository.rb","test/test_samples.rb","test/test_sha256.rb","test/test_strategies.rb","test/test_tokenizer.rb"],"C":["ext/linguist/lex.linguist_yy.c","ext/linguist/lex.linguist_yy.h","ext/linguist/linguist.c"],"Lex":["ext/linguist/tokenizer.l"],"Go":["tools/grammars/cmd/grammar-compiler/main.go","tools/grammars/compiler/converter.go","tools/grammars/compiler/cson.go","tools/grammars/compiler/data.go","tools/grammars/compiler/errors.go","tools/grammars/compiler/loader.go","tools/grammars/compiler/loader_fs.go","tools/grammars/compiler/loader_url.go","tools/grammars/compiler/pcre.go","tools/grammars/compiler/pcre_test.go","tools/grammars/compiler/proto.go","tools/grammars/compiler/walker.go","tools/grammars/pcre/pcre.go"]}
lildude commented 9 months ago

@sebflaesch It looks like you've not updated the names in the languages.yml file.

sebflaesch commented 9 months ago

@lildude quite confusing... Was the script/add-grammar --replace genero.tmbundle ... command not supposed to do that? Do I have to update the lib/linguist/languages.yml file before/after executing that command?

lildude commented 9 months ago

Was the script/add-grammar --replace genero.tmbundle ... command not supposed to do that?

Nope. That script only changes the grammar submodule and related vendor/README.md and cached license files.

That's why I said:

Follow the steps for replacing the grammar, commit those changes, rename the entries as you wish, commit those changes, rename the samples directories to match the new language names, commit your changes, update the scope lines if the grammars use different scope names, push.

Do I have to update the lib/linguist/languages.yml file before/after executing that command?

It doesn't matter. It only needs to be done once.

sebflaesch commented 9 months ago

OK I did the renaming in languages.yml, I have pushed it to my fortk/branch and the change is visible here: https://github.com/github-linguist/linguist/commit/313bef7041be8a03d86ba4caa8366032d68d011b

Now when I run the github-linguist command, how can I see my "Genero 4gl" and "Genero per" languages?

sf@toro (issue-6629):~/genero/devel/github/linguist/linguist-genero$ ./bin/github-linguist 
67.14%  280830     Ruby
23.11%  96639      C
6.60%   27621      Go
1.60%   6695       Lex
1.25%   5216       Shell
0.30%   1258       Dockerfile

My guess is that it shows only the top most matching languages, but obviously, in this linguist repo, there are very few .4gl and .per sources so it would be something like 0.01% or less...

I have also tried:

sf@toro (issue-6629):~/genero/devel/github/linguist/linguist-genero$ bin/git-linguist stats -c 8f3d8d86e9fb30c23ede2c5fc83c2925870ac8d6
{"Dockerfile":1258,"Shell":5216,"Ruby":280830,"C":96639,"Lex":6695,"Go":27621}

I have executed the tests:

sf@toro (issue-6629):~/genero/devel/github/linguist/linguist-genero$ bundle exec rake test
...
  7) Error:
TestRepository#test_linguist_override_generated?:
Rugged::OdbError: object not found - no match for id (01d6b9c637a7a6581fe456c600725b68f355b295)
    /home/sf/genero/devel/github/linguist/linguist-genero/lib/linguist/repository.rb:128:in `lookup'
    /home/sf/genero/devel/github/linguist/linguist-genero/lib/linguist/repository.rb:128:in `current_tree'
    /home/sf/genero/devel/github/linguist/linguist-genero/lib/linguist/repository.rb:123:in `read_index'
    /home/sf/genero/devel/github/linguist/linguist-genero/test/test_repository.rb:137:in `test_linguist_override_generated?'

1727 runs, 35032 assertions, 0 failures, 7 errors, 1 skips
sebflaesch commented 9 months ago

"Genero 4gl" / "Genero per" does not show up when running:


bundle exec script/cross-validation --test
...
GLSL/recurse1.frag GOOD
GLSL/recurse1.fs GOOD
Genie/Class.gs GOOD
Genie/Hello.gs GOOD
Genie/IDataLoader.gs GOOD
Genie/web.gs BAD (JavaScript)
Gerber Image/LIDARLite.ncl GOOD
Gnuplot/defense_plotter.plt GOOD
...
lildude commented 9 months ago

Please open a PR. This discussion should be happening in the PR, not the issue. It's also a lot easier to see and also gives us constant confirmation of behaviour via the tests.

sebflaesch commented 9 months ago

@lildude here is the PR: https://github.com/github-linguist/linguist/pull/6632 Sorry for all the noise, I am new at this and wanted to make sure that the PR is good enough. Thanks a lot for you help!

lildude commented 9 months ago

Now when I run the github-linguist command, how can I see my "Genero 4gl" and "Genero per" languages?

This is because Linguist does not count the samples directory and that's where all your files are in the linguist repo:

https://github.com/github-linguist/linguist/blob/6a9a3e413a0b07d4560978a98da0bd4d68ee1f77/lib/linguist/documentation.yml#L33-L34

I have executed the tests:

You need to fetch the test/attributes branch. This is fetched if you use the devcontainer or Codespace.

lildude commented 9 months ago

Sorry for all the noise, I am new at this and wanted to make sure that the PR is good enough.

No probs. Noisy PRs are fine as this is where all the dev work is taking place. Context switching between issue and code makes this a lot harder to track.

lildude commented 9 months ago

Changes have shipped. Closing.

sebflaesch commented 9 months ago

@lildude : Seems that something is not working... I see "Other" language instead of "Genero 4gl" or "Genero per", in Genero repositories such as: https://github.com/FourjsGenero/ex_checkbox_tree Should I open another issue?

lildude commented 9 months ago

Should I open another issue?

No need. This is because the language has been renamed but the cache not invalidated. We don't invalidate the cache when making changes via Linguist as this would be incredibly resource intensive given the number of repositories involved. Pushing a new change will force a re-analysis and things should reflect the new name.

You can see the result of this in my fork (I'll delete this later) in which I edited the README to force a reanalysis:

New languages

sebflaesch commented 9 months ago

ok got it! thanks a lot!

sebflaesch commented 9 months ago

Any trick to force a Linguist re-analysis of sources without pushing changes into a given repo?

lildude commented 9 months ago

Any trick to force a Linguist re-analysis of sources without pushing changes into a given repo?

Nope. This is by design as per the same reasoning for not invalidating the caches.