MATLAB file misidentified as Mathematica

crisluengo commented 5 years ago

Preliminary Steps

Please confirm you have...

[x] reviewed How Linguist Works,
[x] reviewed the Troubleshooting docs,
[x] considered implementing an override,
[x] verified an issue has not already been logged for your issue (linguist issues).

Problem Description

One of my many MATLAB files is being incorrectly identified as Mathematica. (Yes, I will create a .gitattributes file to fix this, this issue is to hopefully improve your heuristics.)

The current heuristic looks for the pattern \*\)$, which identifies Mathematica. If it fails, it continues with the MATLAB pattern. This one file contains this comment, which matches the pattern:

%       LCH (or L*C*H*)

Previously, Mathematica was identified by the opening comment tag: ^\s*\(\* (see #1652). That would be an illegal sequence in a MATLAB file, as far as I know, and therefore seems to be a much better test. Is there a reason this was changed?

Is matching both the opening and the closing tag too expensive?

Otherwise, I could suggest a different strategy: if a directory contains hundreds of files with the same extension, and most of them are MATLAB, it is very likely that they are all MATLAB. To me it seems highly unlikely that a repository would have, in one directory, multiple files with the same extension that are not all the same language. Such a test could help you skip the regular expression tests on many files.

URL of the affected repository:

Repository: https://github.com/DIPlib/diplib/

The affected file is: https://github.com/DIPlib/diplib/blob/master/dipimage/@dip_image/dip_image.m

Last modified on:

2018-11-20

Expected language:

MATLAB

Detected language:

Mathematica

lildude commented 5 years ago

Is there a reason this was changed?

The change came in https://github.com/github/linguist/pull/3246 made by @pchaigno which I think provides a good bit of context. Reverting will re-introduce the original problem which is likely to be more prevalent than the probably quite unique example in your file.

The only option would be to improve the regexes used for Mathematic and/or MATLAB to cater for this scenario, though this may be easier said than done - I know nothing about either language too 😄

pchaigno commented 5 years ago

Is matching both the opening and the closing tag too expensive?

Probably not, especially if we simplify the search string for the opening comment to be (*.

crisluengo commented 5 years ago

@lildude, I don't really understand the PR you linked. The regular expression to check for Mathematica is applied after it has been established that the file is not Objective C, if I understand the code correctly. In that PR there was this code:

    disambiguate ".m" do |data|
      if ObjectiveCRegex.match(data)
        Language["Objective-C"]
      elsif data.include?(":- module")
        Language["Mercury"]
      elsif /^: /.match(data)
        Language["MUF"]
      elsif /^\s*;/.match(data)
        Language["M"]
      elsif /\*\)$/.match(data)
        Language["Mathematica"]
      elsif /^\s*%/.match(data)
        Language["Matlab"]
      elsif /^\w+\s*:\s*module\s*{/.match(data)
        Language["Limbo"]
      end

This means that it doesn't matter if the regular expression matches Objective C or not. The whole issue they tried to solve there was because someone copy-pasted the regexp from here to a different project, and found it didn't work as intended. I don't see why this project needed to change because of that?

In current code the logic is the same. The Mathematica regexp needs to distinguish Mathematica files from MATLAB and Limbo. I don't know about Limbo, but (* cannot appear at the beginning of a line in a MATLAB file, it would be illegal syntax. That makes it a good test. *) can appear at the end of a comment line, even if it is rare.

The issue that is linked from that PR has this comment, which shows that *) at the end of a line can still happen in Objective C. The Objective C regexp happens to not match that particular file, fixing the Mathematica regexp because of that is the wrong solution -- you need to fix the Objective C test, because that is run first.

pchaigno commented 5 years ago

This means that it doesn't matter if the regular expression matches Objective C or not.

No. Each heuristic rule is not meant to match all files for the target language. They are, by definition, heuristic. Therefore, each heuristic rule in this list has to consider all other languages.

The Objective C regexp happens to not match that particular file, fixing the Mathematica regexp because of that is the wrong solution -- you need to fix the Objective C test, because that is run first.

No. If it's easier to address that case in the Mathematica regexp that's probably how we will address it. We don't want to create crazy regexps that match 100% of files (which is often not possible anyway).

pchaigno commented 5 years ago

Is matching both the opening and the closing tag too expensive?

Probably not, especially if we simplify the search string for the opening comment to be (*.

@crisluengo Do you want to send a pull request to implement this change?

crisluengo commented 5 years ago

@pchaigno Sure, I can do this tonight or later this week.

I propose the regexp $\*.+\*$$. It looks like comments don't need to be at the end of a line, but this refines the current heuristic to remove false positives, it doesn't need to match every single comment in the file. I'd fear that we'd add false positives if we remove the $ from the regexp (for example to match a C expression like (*(int*)ptr)).

pchaigno commented 5 years ago

I was thinking we could go for

and:
- pattern: '(\*'
- pattern: '\*)$'

to avoid the costly .+ pattern, which may cause a lot of backtracking.

crisluengo commented 5 years ago

Sounds good.

lildude commented 3 years ago

@Christoph-Lauer 🤔 Your file is being identified as Mathematica because it contains the syntax specifically identified and discussed as Mathematica in this issue... you have (* ... *) on two lines L82 and L120 so your file matches the heuristics that were implemented off the back of this issue.

From what I can find, these are valid Mathematica comments, but not MATLAB comments which uses% so it would suggest the current heuristic has correctly identified your file as Mathematica and not MATLAB.

If that really is valid MATLAB syntax, we'd be grateful for a PR to update our heuristics so this scenario can be identified correctly. Until then, you'll need to implement an override.

As an aside, this really should have been posed as a new discussion given this issue has been closed for over two years.

github-linguist / linguist