github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.22k stars 4.21k forks source link

Delphi code misidentified as Pascal #905

Closed Faq closed 9 years ago

Faq commented 10 years ago

Seems all delphi projects show as pascal now Example: https://github.com/Faq/TXTGenerator https://github.com/Gurux/Gurux.DLMS.Delphi

EarlGlynn commented 10 years ago

If all Delphi code must be marked as Pascal, please change all references to Linux on GitHub to UNIX.

pchaigno commented 10 years ago

Is there a simple way to differentiate Pascal and Delphi based on the code source that could be used to make a heuristic?

EarlGlynn commented 10 years ago

A "Delphi project" requires a .dpr or .dproj (newer versions of Delphi), which is a "make" file for a Delphi project. I believe either command line or GUI programs should have one of these in the same folder as other .pas source code files.

All Delphi GUI programs would have one or more .dfm (Delphi form) files, but a Delphi command line utility would not.

So perhaps a directory having .pas files, but no .dpr/.dproj or no .dfm files could be labeled "Pascal" but otherwise could be labeled as a "Delphi" project if these files were present.

Here is some additional info about Delphi-related file types and source control: http://stackoverflow.com/questions/438414/delphi-file-types

When the heuristics fail, why not provide a way for the owner to specify the type from a controlled list?

pchaigno commented 10 years ago

I don't think linguist is capable, for the moment, of working on a folder rather than on a file. Thus, a heuristic would have to be something in the code.

When the heuristics fail, why not provide a way for the owner to specify the type from a controlled list?

I don't know but it's a recurring question, maybe @arfon could explain...

EarlGlynn commented 10 years ago

If the label "Pascal" is attached to the whole repository, I'd argue it should reflect the whole repository, not just a few files.

I'd rather see "unknown" on my Delphi repository instead of "Pascal". There are many flavors of Pascal: somewhere I have USCD Pascal and TurboPascal examples. If allowed, I'd label them "USCD Pascal" and "TurboPascal", not just "Pascal" if I ever put them on GitHub.

Why not let the person that knows put the correct label on the repository instead of using a forced heuristic? A machine learning algorithm could use these known classifications to look for misclassified ones.

I was surprised I didn't get to classify the Delphi repository when I uploaded it. I'd argue the "Pascal" label is somewhere between "misleading" and "wrong".

arfon commented 10 years ago

@pchaigno @EarlGlynn I think some heuristics would work best here.

Are there any good ways to disambiguate between the two languages by syntax or file extensions?

EarlGlynn commented 10 years ago

I gave a description of Delphi-related file extensions above. The link discussed Delphi file extensions for source control, which could be used to classify a set of files in a repository as a Delphi project.

We may need to agree to disagree on approach here. Perhaps heuristics can often classify individual files, but I don't understand why a heuristic classifying a repository is better than the person developing the code.

Since Delphi was introduced, ~1995, I don't remember ever searching for "Pascal" to find something that is "Delphi". Delphi grew out of TurboPascal, not simply "Pascal."

Why is the author not qualified to make the classification? Why isn't the heuristic applied only when the author fails to specify one?

I don't understand why the category assigned to the repository for my GitHub page is "JavaScript". I won't be publishing anything about JavaScript there, yet I don't see how to change the assignment.

dbohdan commented 9 years ago

Module imports such as

uses SysUtils;

are one way to tell apart Turbo Pascal, Free Pascal and Delphi and other Pascal dialects. However, I don't think you can reliably differentiate between Free Pascal ("FP"), Turbo Pascal ("TP") and Delphi themselves when operating on a single .pas file; a lot of FP code is valid TP/Delphi code and vice versa.

One difference between FP and Delphi that you could leverage is the encoding for Unicode strings: Delphi uses UTF-16 while Free Pascal uses UTF-8. You could detect the encoding using character frequency analysis, for which there are existing libraries. Of course, files without any Unicode strings would remain ambiguous.

Personally, I wouldn't mind if TP/FP/Delphi code was identified as a single category distinct from plain Pascal, although I can't think of a good name for it. Edit: Actually, I can: "Object Pascal".

arfon commented 9 years ago

We have a couple of different ways to override language detection now in Linguist which is probably as good as we can do right now. Please take a look at these over here: https://github.com/github/linguist#overrides

EarlGlynn commented 9 years ago

I really do not understand this thread. Why are there summaries by language of GitHub repositories when you refuse to put the right label on the repository? You even refuse to let the authors put the right labels on the repository. This "resolution" simply does not make any sense to me.

nunopicado commented 8 years ago

Still nothing on this? arfon's suggestion was tried, but does not work. If I add

*.pas linguist-language=Delphi

to the .gitattributes file, the language changes from Pascal to Component Pascal, which is another completely different language, which is neither Pascal nor Object Pascal/Delphi.

It's funny how people use github statistics to argument that Delphi does not show in the Top languages used today, when github doesn't even recognize the existence of the language.

Would it be so hard to make it work correctly? I don't mind having to add a .gitattributes file, if it would only work!

arfon commented 8 years ago

@nunopicado - this is because delphi is listed as an alias for Component Pascal.

Probably the best thing we can do right now is to list Dephi as a language in languages.yml but not add any extensions as we still don't have a way to reliably identify a Delphi project on a per-file basis.

Doing this would allow the overrides to work (*.pas linguist-language=Delphi) but little else :-\

nunopicado commented 8 years ago

Thanks @arfon, for your reply.

I do think Delphi should be separated from Component Pascal, the same for Object Pascal. Delphi and Object Pascal could be marked as alias to one another (one as a language, one as an alias, I don't really mind which is which). Even though there are different flavours of Object Pascal, at least it fits in the description.

There are some file extensions which are exclusive to Delphi. Those could be added. The problem is with .pas files, which is a common extension.

It was mentioned that there must be something in the code to differentiate standard Pascal from Delphi/Object Pascal. Well, I guess there is.

Standard Pascal is not object oriented, so there are no classes. Delphi, on the other hand, is heavily object oriented, so there will probably be not many .pas files which do not have the keyword Class.

Would that be enough to create an rule for linguist?

arfon commented 8 years ago

Standard Pascal is not object oriented, so there are no classes. Delphi, on the other hand, is heavily object oriented, so there will probably be not many .pas files which do not have the keyword Class.

Very possibly! Basically we need to be able to write a heuristic (a regular expression) that is applied on a per-file extension basis. Here's a good example for .es which is used by JavaScript and Erlang: https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb#L139-L145

@nunopicado - Would you be interested in attempting this in a Pull Request?

nunopicado commented 8 years ago

Strange as it may seem, I never used regular expressions! :) But I'll give it a try. I'll check you example and try to create something that can differentiate .pas files. I'll get back to you on this. Thank you @arfon! ;)

arfon commented 8 years ago

Strange as it may seem, I never used regular expressions! :)

Welcome to the dark side 😉

I'll check you example and try to create something that can differentiate .pas files. I'll get back to you on this.

👍 thanks. We can definitely help you get this polished up.

nunopicado commented 8 years ago

:+1:

JazzMaster commented 6 years ago

still un patched. seen as C++ and java when its clearly FPC sources for SDL.

pchaigno commented 6 years ago

@JazzMaster Could you open a separate issue with the appropriate details so that we can look into a fix?