github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.11k stars 4.19k forks source link

Oberon source files are erronously classified as Modula-2 and or Shell #3888

Closed btreut closed 5 years ago

btreut commented 6 years ago

all Oberon source files are incorrectly classified as Modula-2, see for example Andreas' Pirklbauers repositories here: https://github.com/andreaspirklbauer

There are also other repositories where Oberon is misclassified as Shell: https://github.com/AlexIljin/OPCL

Admitted: Oberon is a language in the tradition of Pascal and Modula-2, but it is a different language, see: http://www.inr.ac.ru/~info21/pdf/Modula-Oberon-June-2007.pdf (or: https://dl.acm.org/citation.cfm?id=1238847&dl=ACM&coll=DL&CFID=825109146&CFTOKEN=75438417 and https://dl.acm.org/ft_gateway.cfm?id=1238847&type=ppt&path=%2F1240000%2F1238847%2Fsupp%2FModula%202%20Oberon%2Eppt&supp=1&dwn=1&CFID=825109146&CFTOKEN=75438417).

Currently I do not have enough understanding of linguist to add it myself, but I might offer help and more information, e.g. source code samples, language definition in EBNF, and/or a parser.

regards

Bernhard Treutwein

pchaigno commented 6 years ago

Please have a look at the guidelines to add support for a new language to Linguist. It Oberon meets the requirement on in-the-wild usage and you can provide the necessary information, I can open the pull request for you.

btreut commented 6 years ago

I had a look at the Guidelines before filing the issue. I think that Oberon meets the requirement on in-the-wild usage.

What do you Need?

pchaigno commented 6 years ago

A color, a grammar, a link to a GitHub search result showing that it meets the requirement on in-the-wild usage, links to sample files with permissive licenses, the list of file extensions, etc.

btreut commented 6 years ago

here is a Grammar: https://github.com/PhilippeSigaud/Pegged/blob/master/pegged/examples/oberon2.d

More info here: https://en.wikipedia.org/wiki/Oberon_(programming_language)

EBNF grammar of Oberon can be found here: http://www.ethoberon.ethz.ch/EBNF.html EBNF of Oberon-2 (extracted from: http://www.ssw.uni-linz.ac.at/Research/Papers/Oberon2.pdf) is appended here: Oberon-2_EBNF.txt

File extension for source files is in general is .Mod or .mod although any extension might be encountered.

and a bunch of projects on github using Oberon:

lildude commented 6 years ago

If you’ve got the bandwidth, we’d happily accept a PR implementing this.

btreut commented 6 years ago

Sorrily I have neither spare time to invest and additionally I the following has not changed at all (citing from my original post):

Currently I do not have enough understanding of linguist to add it myself, but I might offer help and more information, e.g. source code samples, language definition in EBNF, and/or a parser.

pchaigno commented 6 years ago

It Oberon meets the requirement on in-the-wild usage and you can provide the necessary information, I can open the pull request for you.

My offer still stands :smiley:

GitHub only supports TextMate, Sublime Text, and Atom grammars for syntax highlighting. The search result link for in-the-wild usage is needed (see this example with placeholders, we can count the number of repositories from the number of files ourselves). For the sample files, we prefer if they have a permissive license; these might work.

btreut commented 6 years ago

My offer still stands ...

oops, I provided links to grammars and appended an EBNF grammar on Nov.30, 2017.. Apparently you wanted something different, what should I provide?

pchaigno commented 6 years ago

The grammar can only be a TextMate, Sublime Text, or Atom grammar, and, as I wrote above, the search result link for in-the-wild usage is required (a short list of repositories won't do).

pchaigno commented 6 years ago

Flagging as stale.

btreut commented 6 years ago

This search results in 94 hits, where on the first two result pages with 20 hits each only four repos are not containing Oberon sources (gorilla-cpm, jpoial/forth, sblendorio/mod-xterm-cpm and OS2World/UTIL-DISK-Compress).

I currently don't have any idea how I would exclude results where keywords IMPLEMENTATION or DEFINITION are present. That would narrow that search reults to Oberon sources.

pchaigno commented 6 years ago

Like this search query example, but for each extension. We'll then count the number of repositories in those search results with another tool. You'll want to use Oberon-specific keywords to limit the search results to Oberon files.

btreut commented 6 years ago

thanks.

But hmm, I am not aware that Oberon has specific keywords, which are not present in Modula-2. Some have slightly different meanings and some Modula-2 keyword are not existing in Oberon. I have to think deeper.

pchaigno commented 6 years ago

Without a way to distinguish the two languages, it's going to be very difficult to estimate their usage on GitHub.com or to classify .mod files... Until we have such a discriminator, I'd recommend we close this issue as, as far as I understand, the current miss-classification doesn't break syntax highlighting.

btreut commented 6 years ago

but it will appear that Modula-2 is still a living language, which I severely doubt.

There is a lot of legacy Modula-2 software, which may be given a second life on github (see e.g. parts of OS2World), but I don't think there is any significant new development going on in Modula-2. I have no idea, how you could differntiate between legacy and currently used.

btreut commented 6 years ago

Paul Reed suggested on the Oberon mailing list to change the classification from Modula-2 to Modula-2/Oberon.

btreut commented 6 years ago

Until we have such a discriminator

I think I've found two important discriminators between Modula-2 and Oberon, but I still have to think deeper how these can be formulated as regexes.

I can tell these first two discriminators in plain words:

  1. Any source file containing the key words DEFINITION or IMPLEMENTAION are definitely Modula-2 and not Oberon.

  2. Every source file containing a construct RECORD (identifier) ... END is definitely Oberon.

of course there may be any number of white space (including comments of the form ( ... ) between RECORD and (id).

pchaigno commented 6 years ago

For the second one, what if the file only contains RECORD identifier? Would that be enough to identify it as Oberon?

btreut commented 6 years ago

This search definitely finds Component Pascal, yet another Oberon dialect. Most probably these files are classified as binaries, but this is a kind of rich text format as quite often used in the world of Oberon.

The next search also finds Oberon files: Extension Ob2

btreut commented 6 years ago

... what if the file only contains RECORD identifier? Would that be enough to identify it as Oberon?

no, not at all. The construct RECORD ... END is legal Modula-2.

The construct RECORD (father-record) ... END is the Oberon version of inheritance, which is not existing in Modula-2.

Both contructs are the equivalents to declaration of C structs. The Oberon form inherits all fields from its father record. Of course RECORD ... END is legal Oberon too, but here are Modula-2 and Oberon once again too similar.

stale[bot] commented 5 years ago

This issue has been automatically closed because it has not had activity in a long time. Please feel free to reopen it or create a new issue.