Closed crisluengo closed 5 years ago
Is there a reason this was changed?
The change came in https://github.com/github/linguist/pull/3246 made by @pchaigno which I think provides a good bit of context. Reverting will re-introduce the original problem which is likely to be more prevalent than the probably quite unique example in your file.
The only option would be to improve the regexes used for Mathematic and/or MATLAB to cater for this scenario, though this may be easier said than done - I know nothing about either language too 😄
Is matching both the opening and the closing tag too expensive?
Probably not, especially if we simplify the search string for the opening comment to be (*
.
@lildude, I don't really understand the PR you linked. The regular expression to check for Mathematica is applied after it has been established that the file is not Objective C, if I understand the code correctly. In that PR there was this code:
disambiguate ".m" do |data|
if ObjectiveCRegex.match(data)
Language["Objective-C"]
elsif data.include?(":- module")
Language["Mercury"]
elsif /^: /.match(data)
Language["MUF"]
elsif /^\s*;/.match(data)
Language["M"]
elsif /\*\)$/.match(data)
Language["Mathematica"]
elsif /^\s*%/.match(data)
Language["Matlab"]
elsif /^\w+\s*:\s*module\s*{/.match(data)
Language["Limbo"]
end
This means that it doesn't matter if the regular expression matches Objective C or not. The whole issue they tried to solve there was because someone copy-pasted the regexp from here to a different project, and found it didn't work as intended. I don't see why this project needed to change because of that?
In current code the logic is the same. The Mathematica regexp needs to distinguish Mathematica files from MATLAB and Limbo. I don't know about Limbo, but (*
cannot appear at the beginning of a line in a MATLAB file, it would be illegal syntax. That makes it a good test. *)
can appear at the end of a comment line, even if it is rare.
The issue that is linked from that PR has this comment, which shows that *)
at the end of a line can still happen in Objective C. The Objective C regexp happens to not match that particular file, fixing the Mathematica regexp because of that is the wrong solution -- you need to fix the Objective C test, because that is run first.
This means that it doesn't matter if the regular expression matches Objective C or not.
No. Each heuristic rule is not meant to match all files for the target language. They are, by definition, heuristic. Therefore, each heuristic rule in this list has to consider all other languages.
The Objective C regexp happens to not match that particular file, fixing the Mathematica regexp because of that is the wrong solution -- you need to fix the Objective C test, because that is run first.
No. If it's easier to address that case in the Mathematica regexp that's probably how we will address it. We don't want to create crazy regexps that match 100% of files (which is often not possible anyway).
Is matching both the opening and the closing tag too expensive?
Probably not, especially if we simplify the search string for the opening comment to be
(*
.
@crisluengo Do you want to send a pull request to implement this change?
@pchaigno Sure, I can do this tonight or later this week.
I propose the regexp \(\*.+\*\)$
. It looks like comments don't need to be at the end of a line, but this refines the current heuristic to remove false positives, it doesn't need to match every single comment in the file. I'd fear that we'd add false positives if we remove the $
from the regexp (for example to match a C expression like (*(int*)ptr)
).
I was thinking we could go for
and:
- pattern: '(\*'
- pattern: '\*)$'
to avoid the costly .+
pattern, which may cause a lot of backtracking.
Sounds good.
@Christoph-Lauer 🤔 Your file is being identified as Mathematica because it contains the syntax specifically identified and discussed as Mathematica in this issue... you have (* ... *)
on two lines L82 and L120 so your file matches the heuristics that were implemented off the back of this issue.
From what I can find, these are valid Mathematica comments, but not MATLAB comments which uses%
so it would suggest the current heuristic has correctly identified your file as Mathematica and not MATLAB.
If that really is valid MATLAB syntax, we'd be grateful for a PR to update our heuristics so this scenario can be identified correctly. Until then, you'll need to implement an override.
As an aside, this really should have been posed as a new discussion given this issue has been closed for over two years.
Preliminary Steps
Please confirm you have...
Problem Description
One of my many MATLAB files is being incorrectly identified as Mathematica. (Yes, I will create a
.gitattributes
file to fix this, this issue is to hopefully improve your heuristics.)The current heuristic looks for the pattern
\*\)$
, which identifies Mathematica. If it fails, it continues with the MATLAB pattern. This one file contains this comment, which matches the pattern:Previously, Mathematica was identified by the opening comment tag:
^\s*\(\*
(see #1652). That would be an illegal sequence in a MATLAB file, as far as I know, and therefore seems to be a much better test. Is there a reason this was changed?Is matching both the opening and the closing tag too expensive?
Otherwise, I could suggest a different strategy: if a directory contains hundreds of files with the same extension, and most of them are MATLAB, it is very likely that they are all MATLAB. To me it seems highly unlikely that a repository would have, in one directory, multiple files with the same extension that are not all the same language. Such a test could help you skip the regular expression tests on many files.
URL of the affected repository:
Repository: https://github.com/DIPlib/diplib/
The affected file is: https://github.com/DIPlib/diplib/blob/master/dipimage/@dip_image/dip_image.m
Last modified on:
2018-11-20
Expected language:
MATLAB
Detected language:
Mathematica