github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
11.95k stars 4.14k forks source link

OCaml code is reported as standard ML #2208

Closed samoht closed 9 years ago

samoht commented 9 years ago

Everything was working fine until few days ago: all my new projects are now begin reported to be written in Standard ML instead of OCaml. See https://github.com/samoht/ocaml-huffman-code.

dbuenzli commented 9 years ago

Was going to report that aswell. My repos are being converted to SML on new pushes. See e.g.

https://github.com/dbuenzli/rtime https://github.com/dbuenzli/mtime

dbuenzli commented 9 years ago

Makes good jokes though... https://github.com/ocaml/ocaml

pchaigno commented 9 years ago

My bad, it's a side effect of #2087. The Bayesian classifier doesn't seem to be able to distinguish the two languages. Do you know which keyword we could use to distinguish them? (keywords that exist in one of the languages and not the other)

dbuenzli commented 9 years ago

I'm not very knowlegable in SML but this (a little bit old) page has a few hints. I'd suggest

For reference here is the list of OCaml reserved keywords.

@johnwhitington may have a definitive answer.

dbuenzli commented 9 years ago

(could appear as identifiers in OCaml though, but rather unlikely)

Well or in comments, so I would rule out datatype. The other ones mentioned seem sufficiently peculiar.

johnwhitington commented 9 years ago

Some thoughts.

Reserved words in Standard ML:

abstype and andalso as case datatype do else end exception fn fun handle if in infix infixr let local nonfix of op open orelse raise rec then type val with withtype while

In OCaml:

and         as          assert      asr         begin       class
      constraint  do          done        downto      else        end
      exception   external    false       for         fun         function
      functor     if          in          include     inherit     initializer
      land        lazy        let         lor         lsl         lsr
      lxor        match       method      mod         module      mutable
      new         object      of          open        or          private
      rec         sig         struct      then        to          true
      try         type        val         virtual     when        while
      with

So, if we remove words which might be very common value (variable) names, then good positive indicators of OCaml would be

assert class external functor match mutable struct try inherit module virtual

And good positive indicators of Standard ML might be

abstype datatype handle infixr nonfix withtype local andalso orelse

Unfortunately, many Standard ML programs may not contain any of these.

Perhaps the common strongest discriminator of OCaml would be the two-keyword sequence "let rec" appearing in a .ml file.

samoht commented 9 years ago

One other option: use the project context. If the project name contains the string ocaml or if the filename contains ocaml (case insensitive), or if there are some .mli files in the project then that's definitely an OCaml project.

raphael-proust commented 9 years ago

The ratio of -> vs => can be a good indicator. In OCaml, the former is very common (pattern matches, lambdas, arrows types) and the latter is not even keyword nor a bound identifier. In SML the former is for types only and the latter is very common (pattern matches, lambdas).

It might be that the mere presence of => would be a good enough classifier for sml.

pchaigno commented 9 years ago

It might be that the mere presence of => would be a good enough classifier for sml.

We could use the presence of => for SML and a regular expression on a -> construction for OCaml (see heuristics.rb for examples). What do you think?

samoht commented 9 years ago

@pchaigno module, let rec and -> seem to be a good way to disambiguate OCaml code.

raphael-proust commented 9 years ago
disambiguate "SML", "OCaml" do |data|
  if /=> /.match(data)
    Language["SML"]
  elsif /module|let rec /.match(data)
    Language["OCaml"]
  end
end

Something like that. I've never written ruby before so can't guarantee anything.

dmbaturin commented 9 years ago

SML uses "signature" and "structure" where OCaml uses "module type" and "module". This can never be seen in OCaml:

signature Foo = sig ... end
structure Bar [ : Foo | :> Foo ] = struct ... end

Case expression and anonymous function syntax is distinct:

(* SML *)
case <pattern> of

(* Ocaml *)
match <pattern> with

(* SML *)
fn x => <expr>
(* OCaml *)
fun x -> <expr>

SML "val foo = ..." binding syntax also cannot occur in OCaml. OCaml "val" keyword is only used in module signatures where it looks like

val foo : int -> int

Since virtually every ML program contains bindings, this probably can be a good indicator.

SML let expression syntax is "let ... in ... end" with multiple bindings between "let" and "in", which also never occurs in OCaml.

amirmc commented 9 years ago

Hi @pchaigno, Curious where things stand with this. I see a PR from @samoht and if this is ok, can it be merged? It may not seem obvious right now but this issue is affecting the discovery of new repos/projects that use OCaml.

pchaigno commented 9 years ago

@amirmc I answered in #2227. I should also clarify that I am not from GitHub, I'm just a regular contributor.

amirmc commented 9 years ago

Whoops! My bad. I saw an Octocat and just jumped to conclusions. :) Thanks for helping out with this (and labtocat is pretty cool, btw).

samoht commented 9 years ago

@johnwhitington or @kayceesrk, do you know if .ML is an usual extension for Standard ML programs?

johnwhitington commented 9 years ago

I don't know what is standard, but a quick google suggests both mlton and SML/NJ seem to expect .sml:

http://mlton.org/Installation

http://www.smlnj.org/doc/FAQ/usage.html#loadFile

kayceesrk commented 9 years ago

.sml and .sig is the standard in MLton.

amirmc commented 9 years ago

I tried to do some quick searching on GitHub (not straightforward given the current state) and found the following Standard ML project - https://github.com/HOL-Theorem-Prover/HOL. It seems substantive and active, given the number of watchers, forks and stars. Files end in .sml

@mn200, sorry to tag you in a thread out of the blue but perhaps you can help us clarify whether .ML is usual extension for Standard ML projects.

dmbaturin commented 9 years ago

.sml is nearly universal among SML users (and implementations), I've never seen anyone using .ml. You can see a lot of SML code here: http://github.com/standardml

Conversely, .ml and .mli are nearly universal among OCaml users.

mn200 commented 9 years ago

As others have said, .sig and .sml are standard for SML. Some do use .ML (see for example the sources in the Isabelle system), but this is less common (and certainly overlaps with OCaml usage if you ignore case).

arfon commented 9 years ago

2227 is in production now and the results are looking good:

screen shot 2015-03-18 at 10 38 33 am

amirmc commented 9 years ago

Wonderful. Thank you, all.

I expect it might take a little while to propagate but it's already looking much better:

arfon commented 9 years ago

Great!

yminsky commented 9 years ago

There are lots of repos that still seem to be wrong. For example, near and dear to my heart:

https://github.com/janestreet/core_kernel

is supposedly 77.8% SML. When should we expect this rerun to be complete?

dbuenzli commented 9 years ago

@yminsky it seems that updates are performed when you push to the repo, see https://help.github.com/articles/my-repository-is-marked-as-the-wrong-language/

(though I also still have mislabellings after having done so e.g. https://github.com/dbuenzli/mtime, I don't know if you have to actually touch the files).

larsbrinkhoff commented 9 years ago

Would the regexps ^# and/or ;; be unique to OCaml?

mn200 commented 9 years ago

Both are possible but unlikely in SML.

fun f (x : {fld : int}) = 
#fld x;;

is weird looking but valid SML.

arfon commented 9 years ago

There are lots of repos that still seem to be wrong. For example, near and dear to my heart:

@yminsky - repository stats are only updated when there is a push event. I've manually recalculated the statistics for https://github.com/janestreet/core_kernel and things are looking much better.

arfon commented 9 years ago

@dbuenzli - any update to any file after a new version of Linguist will completely rebuild the language statistics.

larsbrinkhoff commented 9 years ago

I submitted #2270 which should at least catch those few OCaml files which use a shebang.

larsbrinkhoff commented 9 years ago

I guess ^# would only catch a few usages of OCaml toplevel directives.

Another idea: isn't it the case that module at the start of a line is fairly common in OCaml files (about 90% of them it seems), and not common in Standard ML?

See e.g. https://github.com/search?q=language%3Asml+module+-extension%3Asig+-extension%3Acache&type=Code

mn200 commented 9 years ago

module is not an SML keyword, and so would be unlikely in column 1.

dsheets commented 9 years ago

Here is a list of repositories that are classified by GitHub/Linguist as SML but contain the string "ocaml" in their name, description, or README. Some of these repositories are actually SML (and some of those contain files incorrectly classified as OCaml) but the most popular ones are definitely OCaml.

$ ./search.native repo ocaml --language sml --sort stars 98 results returned of 98 total

frenetic-lang/frenetic (Standard ML) [67 stars] https://github.com/frenetic-lang/frenetic The Frenetic Programming Language and Runtime System

ocaml/ocaml-re (Standard ML) [46 stars] https://github.com/ocaml/ocaml-re Pure OCaml regular expressions, with support for Perl and POSIX-style strings

zoggy/stog (Standard ML) [34 stars] https://github.com/zoggy/stog XML documents and web site compiler.

Cumulus/Cumulus (Standard ML) [27 stars] https://github.com/Cumulus/Cumulus A friendly and minimalist link sharing website

rdicosmo/parmap (Standard ML) [26 stars] https://github.com/rdicosmo/parmap Parmap is a minimalistic library allowing to exploit multicore architecture for OCaml programs with minimal modifications.

nojb/ocaml-imap (Standard ML) [25 stars] https://github.com/nojb/ocaml-imap Non-blocking IMAP4rev1 client library for OCaml

johnwhitington/cpdf-source (Standard ML) [18 stars] https://github.com/johnwhitington/cpdf-source PDF Command Line Tools Source

dbuenzli/tgls (Standard ML) [17 stars] https://github.com/dbuenzli/tgls Thin bindings to OpenGL {3,4} and OpenGL ES {2,3} for OCaml

dbuenzli/tsdl (Standard ML) [16 stars] https://github.com/dbuenzli/tsdl Thin bindings to SDL for OCaml

akabe/slap (Standard ML) [16 stars] https://github.com/akabe/slap BLAS and LAPACK binding in OCaml with type-based static size checking for matrix operations

mackwic/To.ml (Standard ML) [14 stars] https://github.com/mackwic/To.ml Implementation in OCaml of the Toml minimal langage

mjambon/biniou (Standard ML) [14 stars] https://github.com/mjambon/biniou Extensible binary data format, like JSON but faster

ocaml/opam2web (Standard ML) [13 stars] https://github.com/ocaml/opam2web A tool to generate a website from an OPAM repository

c-cube/cconv (Standard ML) [12 stars] https://github.com/c-cube/cconv combinators for type conversion (serialization/deserialization) to/from several formats. See this blog post (outdated): http://cedeela.fr/universal-serialization-and-deserialization.html

axiles/ocaml-efl (Standard ML) [11 stars] https://github.com/axiles/ocaml-efl An OCaml interface to the Enlightenment Foundation Libraries (EFL) and Elementary

pyrocat101/opal (Standard ML) [11 stars] https://github.com/pyrocat101/opal Self-contained monadic parser combinators for OCaml

mirage/ocaml-crunch (Standard ML) [10 stars] https://github.com/mirage/ocaml-crunch Convert a filesystem into a static OCaml module

modlfo/firmata (Standard ML) [10 stars] https://github.com/modlfo/firmata Ocaml library to control Firmata boards like Arduino

mirage/ocaml-fat (Standard ML) [8 stars] https://github.com/mirage/ocaml-fat Read and write FAT format filesystems from OCaml

tel/ocaml-cats (Standard ML) [8 stars] https://github.com/tel/ocaml-cats Signatures of the category theoretic style; a experiment in flattery

coccinelle/herodotos (Standard ML) [8 stars] https://github.com/coccinelle/herodotos Tracking code patterns through software versions

mirage/ocaml-pcap (Standard ML) [6 stars] https://github.com/mirage/ocaml-pcap Ocaml code for generating and analysing pcap (packet capture) files

nojb/ocaml-gsasl (Standard ML) [6 stars] https://github.com/nojb/ocaml-gsasl OCaml bindings for the GNU SASL library using Ctypes

arlencox/mlbdd (Standard ML) [6 stars] https://github.com/arlencox/mlbdd A not-quite-so-simple Binary Decision Diagrams implementation for OCaml

RobertHarper/TILT-Compiler (Standard ML) [6 stars] https://github.com/RobertHarper/TILT-Compiler Standard ML compiler based on typed intermediate languages.

hcarty/ocaml-gdal (Standard ML) [5 stars] https://github.com/hcarty/ocaml-gdal OCaml bindings to the GDAL and OGR Libraries

infidel/ocaml-mdns (Standard ML) [5 stars] https://github.com/infidel/ocaml-mdns OCaml implementation of the Multicast DNS protocol

tokenrove/shred-for-satan (Standard ML) [5 stars] https://github.com/tokenrove/shred-for-satan MIDI-driven metronome

ahrefs/ocaml-qfs (Standard ML) [4 stars] https://github.com/ahrefs/ocaml-qfs

jonsterling/ocaml-modular-typechecking (Standard ML) [4 stars] https://github.com/jonsterling/ocaml-modular-typechecking Modular type checking using open types

rgrinberg/stringext (Standard ML) [4 stars] https://github.com/rgrinberg/stringext Extra string functions for OCaml

mirage/mirage-net-unix (Standard ML) [4 stars] https://github.com/mirage/mirage-net-unix Ethernet networking interface for Unix Mirage applications using tuntap

tobiasBora/phluor_tools (Standard ML) [4 stars] https://github.com/tobiasBora/phluor_tools A framework to organise a website based on ocsigen (Ocaml)

mirage/mirage-net-xen (Standard ML) [4 stars] https://github.com/mirage/mirage-net-xen Xen Netfront ethernet device driver for Mirage

mirage/mirage-console (Standard ML) [4 stars] https://github.com/mirage/mirage-console Portable console handling for Mirage applications

mirage/mirage-block-xen (Standard ML) [4 stars] https://github.com/mirage/mirage-block-xen Client and server implementations of the xen paravirtualised block driver protocol

OCamlPro/operf-macro (Standard ML) [4 stars] https://github.com/OCamlPro/operf-macro Some macro-benchmarks for operf and an OPAM repository for them

jhckragh/SMLDoc (Standard ML) [4 stars] https://github.com/jhckragh/SMLDoc SMLDoc, detached from the SML# distribution

avsm/ocaml-dockerfile (Standard ML) [3 stars] https://github.com/avsm/ocaml-dockerfile OCaml interface for creating Dockerfiles

struktured/ocaml-prob-cache (Standard ML) [3 stars] https://github.com/struktured/ocaml-prob-cache Polymorphic probability caches in OCaml, including a distributed riak backed cache.

lpw25/compiler_eq (Standard ML) [3 stars] https://github.com/lpw25/compiler_eq Tool for comparing OCaml compilers

aluuu/frmttr (Standard ML) [3 stars] https://github.com/aluuu/frmttr Type-safe sprintf analog in OCaml

scvalex/Super-Max (Standard ML) [3 stars] https://github.com/scvalex/Super-Max A catch-all for game-related projects

tokenrove/zookicker (Standard ML) [3 stars] https://github.com/tokenrove/zookicker

1GAM February backup plan

mietek/et-language (Standard ML) [3 stars] https://github.com/mietek/et-language ET (IPL) language interpreters and literature

yanguango/visual_sort (Standard ML) [2 stars] https://github.com/yanguango/visual_sort Sorting Visualization based on OCaml

linerlock/featherweight-java (Standard ML) [2 stars] https://github.com/linerlock/featherweight-java An experimental implementation of (extended) featherweight-java (FJ) written in OCaml.

samoht/mirmin (Standard ML) [2 stars] https://github.com/samoht/mirmin Example of a Mirage unikernels using Irmin

whitequark/ocamlnet (Standard ML) [1 stars] https://github.com/whitequark/ocamlnet An automatically updated mirror of https://godirepo.camlcity.org/svn/lib-ocamlnet2/trunk/code

OCamlPro/ocaml-benchs (Standard ML) [1 stars] https://github.com/OCamlPro/ocaml-benchs Sources of the set of benchmarks distributed in OCamlPro/opam-bench-repo

bkc39/ocaml-prelude (Standard ML) [1 stars] https://github.com/bkc39/ocaml-prelude Includes the functions you need, that INRIA didn't.

tel/ocaml-collage (Standard ML) [1 stars] https://github.com/tel/ocaml-collage

tel/ocaml-abt (Standard ML) [1 stars] https://github.com/tel/ocaml-abt Abstract binding trees

choeger/modelica.ml (Standard ML) [1 stars] https://github.com/choeger/modelica.ml Modelica frontend implemented in OCaml

stephlm2dev/SchmilkaHashCode (Standard ML) [1 stars] https://github.com/stephlm2dev/SchmilkaHashCode Team Schmilka for Google Hash Code 2015

smondet/locoseq (Standard ML) [1 stars] https://github.com/smondet/locoseq Automatically exported from code.google.com/p/locoseq

melsman/sml-llvm (Standard ML) [1 stars] https://github.com/melsman/sml-llvm Standard ML Bindings for LLVM

gameboy1024/minijavac (Standard ML) [1 stars] https://github.com/gameboy1024/minijavac A school project where we tries to develop a compiler for a fictional language called minijava.

massimo-nocentini/theory-of-programming-languages (Standard ML) [1 stars] https://github.com/massimo-nocentini/theory-of-programming-languages Bag for my work during the course of Theory of Programming Languages at University of Florence

simonegasperoni/funzionale (Standard ML) [0 stars] https://github.com/simonegasperoni/funzionale ocaml

thomas-huet/coop-ocaml (Standard ML) [0 stars] https://github.com/thomas-huet/coop-ocaml coop is a cooperative threads library

MFreidank/ocaml_exercising (Standard ML) [0 stars] https://github.com/MFreidank/ocaml_exercising

Lokibes/obelisk-ocaml (Standard ML) [0 stars] https://github.com/Lokibes/obelisk-ocaml Automatically exported from code.google.com/p/obelisk-ocaml

zakhar/ocaml-onnt (Standard ML) [0 stars] https://github.com/zakhar/ocaml-onnt Automatically exported from code.google.com/p/ocaml-onnt

taquangtrung/ocaml-tools (Standard ML) [0 stars] https://github.com/taquangtrung/ocaml-tools

i-am-jd/ocaml-onnt (Standard ML) [0 stars] https://github.com/i-am-jd/ocaml-onnt Automatically exported from code.google.com/p/ocaml-onnt

suisse91/ocaml_mylist (Standard ML) [0 stars] https://github.com/suisse91/ocaml_mylist

jrrk/ocaml-for-ios (Standard ML) [0 stars] https://github.com/jrrk/ocaml-for-ios Automatically exported from code.google.com/p/ocaml-for-ios

zoggy/ocamldoc-generators (Standard ML) [0 stars] https://github.com/zoggy/ocamldoc-generators A collection of custom ocamldoc generators.

domsj/orocksdb (Standard ML) [0 stars] https://github.com/domsj/orocksdb An OCaml RocksDb binding using ocaml-ctypes

HerbertJordan/otest (Standard ML) [0 stars] https://github.com/HerbertJordan/otest OCaml testing framework

fetburner/OFold (Standard ML) [0 stars] https://github.com/fetburner/OFold fold in OCaml influenced by "Programming in OCaml"

fetburner/OCat (Standard ML) [0 stars] https://github.com/fetburner/OCat cat in OCaml influenced by "Programming in OCaml"

fetburner/owc (Standard ML) [0 stars] https://github.com/fetburner/owc wc in OCaml influenced by "Programming in OCaml

SusanHuang/MinimalistGrammarWithCoordination (Standard ML) [0 stars] https://github.com/SusanHuang/MinimalistGrammarWithCoordination Minimalist Grammar with Coordination (OCAML)

art1pirat/img_pipieline (Standard ML) [0 stars] https://github.com/art1pirat/img_pipieline Ocaml image Pipeline (using camlimages)

rcefala/pascaml (Standard ML) [0 stars] https://github.com/rcefala/pascaml A pascal interpreter written in OCaml

fpottier/pprint (Standard ML) [0 stars] https://github.com/fpottier/pprint A pretty-printing combinator library for OCaml

thomas-huet/lwt-pgocaml (Standard ML) [0 stars] https://github.com/thomas-huet/lwt-pgocaml Wrapper to use Lwt with PG'OCaml

coutar-a/My_list (Standard ML) [0 stars] https://github.com/coutar-a/My_list Alternative implementation of lists in Ocaml

sfritz/a-song-of-ones-and-zeros (Standard ML) [0 stars] https://github.com/sfritz/a-song-of-ones-and-zeros Conway's Game of Life in OCaml

antoyo/tq (Standard ML) [0 stars] https://github.com/antoyo/tq Text-User Interface Library with Widgets Written in OCaml

juster/ffp (Standard ML) [0 stars] https://github.com/juster/ffp The FFP language of John Backus in Ocaml

mohamedaf/Projet1-CompilationAvancee (Standard ML) [0 stars] https://github.com/mohamedaf/Projet1-CompilationAvancee Compilation d'un programme en OCAML en un programme C équivalent

IzzyRahaman/99MLProblems (Standard ML) [0 stars] https://github.com/IzzyRahaman/99MLProblems Attempts at solving the 99 Ocaml Problems originally derived from the 99 Prolog Problems in ML

daherb/Kreis-Kugel (Standard ML) [0 stars] https://github.com/daherb/Kreis-Kugel Trying to find a way to place points with equal distance onto a circle/sphere

cicku/stcntroll (Standard ML) [0 stars] https://github.com/cicku/stcntroll This is a "handy" but enigmatic tool to pick up the lucky dog ^_^/

pgalland/ProgProblems (Standard ML) [0 stars] https://github.com/pgalland/ProgProblems

remyzorg/ppx_comprehension (Standard ML) [0 stars] https://github.com/remyzorg/ppx_comprehension Syntax extension point for list comprhension

iraikov/pprint (Standard ML) [0 stars] https://github.com/iraikov/pprint Pretty printing library for Standard ML

iraikov/mpi-mlton (Standard ML) [0 stars] https://github.com/iraikov/mpi-mlton MPI bindings for Standard ML / MLton

BernardBeefheart/ml-games (Standard ML) [0 stars] https://github.com/BernardBeefheart/ml-games jouer avec ML (Standard ML)

spacemanaki/lexluthor (Standard ML) [0 stars] https://github.com/spacemanaki/lexluthor a library for building lexical analyzers

Alexis211/SystemeReseaux-Projet (Standard ML) [0 stars] https://github.com/Alexis211/SystemeReseaux-Projet

gfxmonk/passe (Standard ML) [0 stars] https://github.com/gfxmonk/passe

khuumi/SNL (Standard ML) [0 stars] https://github.com/khuumi/SNL

bdkoepke/pfds (Standard ML) [0 stars] https://github.com/bdkoepke/pfds Purely Functional Data Structures

velour/caml-spt (Standard ML) [0 stars] https://github.com/velour/caml-spt Automatically exported from code.google.com/p/caml-spt

dsheets commented 9 years ago

It would appear that the search index has a cache that is/was out-of-date. Only some of the repositories I just reported are now misclassified. Sorry for the confusion.

arfon commented 9 years ago

Yup - I'm going through these manually now re-indexing them.

On 8 April 2015 at 10:36, David Sheets notifications@github.com wrote:

It would appear that the search index has a cache that is/was out-of-date. Only some of the repositories I just reported are now misclassified. Sorry for the confusion.

— Reply to this email directly or view it on GitHub https://github.com/github/linguist/issues/2208#issuecomment-90953023.

keleshev commented 9 years ago

In case you are interested, some more examples of (recently pushed to) repositories with files misidentified as SML: