github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.14k stars 4.2k forks source link

BASIC is incorrectly labeled as Visual Basic .NET in many repositories #5156

Closed telnet23 closed 3 years ago

telnet23 commented 3 years ago

Preliminary Steps

Please confirm you have...

Problem Description

URL of the affected repository:

https://github.com/search?q=10+20+30+GOTO+extension%3Abas&type=Code

While not every BASIC program begins the line numbering at 10, increments the line numbering by 10, and contains the GOTO statement, the search does return numerous BASIC programs that have been incorrectly labeled as Visual Basic .NET.

Of the roughly 13,000 search results, it appears that the majority are BASIC programs that have been incorrectly labeled as Visual Basic .NET. The presence of line numbering is a dead giveaway that these are not Visual Basic .NET programs.

Last modified on:

Today

Expected language:

"BASIC"

It is true that there are numerous dialects of BASIC (e.g. AppleSoft, Commodore, GW-BASIC), but most dialects share similar syntax and reserved words, and all dialects are collectively known as BASIC. At the very least, a file with the extension .bas and line numbering should be labeled as BASIC rather than Visual Basic .NET, which is clearly incorrect.

Detected language:

"Visual Basic .NET"

lildude commented 3 years ago

While not every BASIC program begins the line numbering at 10, increments the line numbering by 10, and contains the GOTO statement, the search does return numerous BASIC programs that have been incorrectly labeled as Visual Basic .NET.

Of the roughly 13,000 search results, it appears that the majority are BASIC programs that have been incorrectly labeled as Visual Basic .NET.

The language you're seeing in the search results is the cached results from an analysis performed prior to https://github.com/github/linguist/pull/4725 being merged and deployed way back in 2019.

Thanks to that PR, the .bas extension is now only associated with VBA:

https://github.com/github/linguist/blob/9eb9472be957108fc48b6c9f725f2100c18b7a5e/lib/linguist/languages.yml#L5837-L5845

The changes in that PR were deployed early last year, however we do not go back and reanalyse every repository that could be affected as this is incredibly resource intensive, so these repositories will continue to report the old language until such time as the files are updated which forces re-analysis and cache invalidation.

At the very least, a file with the extension .bas and line numbering should be labeled as BASIC rather than Visual Basic .NET, which is clearly incorrect.

Linguist doesn't currently know about "BASIC", but it does several other variants, and once https://github.com/github/linguist/pull/4998 has been finished and merged, it'll know about FreeBasic and will then have the .bas extension associated with that language too.

I have no idea how BASIC differs from the other forms Linguist knows about or FreeBasic, but if it's distinctly different and ideally easily distinguishable via a regex for the heuristics, we'd welcome a PR adding support.

telnet23 commented 3 years ago

@lildude Thanks for the information. I've opened a pull request at #5166.