ndmitchell / hoogle

Haskell API search engine
http://hoogle.haskell.org/
Other
749 stars 135 forks source link

Hoogle for OCaml #203

Open UnixJunkie opened 7 years ago

UnixJunkie commented 7 years ago

We need this tool since 10 years ago...

ndmitchell commented 7 years ago

Description of what is required is at http://neilmitchell.blogspot.co.uk/2011/03/hoogle-for-your-language-ie-f-scala-ml.html. We're currently at:

A volunteer needs to generate some Hoogle input files containing details of the modules/functions/packages etc. to be searched. These files should be plain text, but can be in a language specific format - i.e. ML syntax for type signatures. For a rough idea of how these files could look see this example - for Haskell I get these files from Hackage. The code to generate these input files can be written in any language, and can live outside Hoogle.

UnixJunkie commented 7 years ago

@avsm @yminsky @diml @lefessan

UnixJunkie commented 7 years ago

The INRIA SED might also be interested in helping with that: @shindere @thierry-martinez

UnixJunkie commented 7 years ago

@dbuenzli

UnixJunkie commented 7 years ago

@samoht

UnixJunkie commented 7 years ago

I generated all the .mli files I could install with OPAM. They are here: https://github.com/UnixJunkie/ocaml-4.04.0-mlis @ndmitchell does this help?

UnixJunkie commented 7 years ago

I did this for the latest stable version of OCaml.

ndmitchell commented 7 years ago

That looks good. Next thing we'd need is a Haskell parser for the subset of OCaml contained in mli files.

UnixJunkie commented 7 years ago

I asked more competent people. I hope some will manifest themselves.

nojb commented 7 years ago

Hello! I have just started working on this (https://github.com/nojb/haskell-ocaml-parser). It is my first time writing Haskell, but I hope to have something working in the next several days (work allowing).

shindere commented 7 years ago

Many thanks @nojb!

ndmitchell commented 7 years ago

I'm happy enough with Happy, although I would say my first approach is usually to use something like parsec/attoparsec (typically the latter). Use whichever you prefer though.

nojb commented 7 years ago

That was my first thought as well, but the OCaml syntax is large and complicated. Moreover the official compiler also uses a yacc-based parser. Using a similar technology makes it easier for me and, I think, will make it easier to maintain in the long run.

ndmitchell commented 7 years ago

That makes a lot of sense. Are you imagining to release your OCaml parser as a library I depend on, or merge the code inside Hoogle? I'm happy either way, although I imagine an OCaml parser could be generally useful.

nojb commented 7 years ago

I haven't given it much thought yet, but indeed I think it makes sense to release as a library.

thierry-martinez commented 7 years ago

Another approach would be to apply a preprocessing phase implemented in OCaml (and that would be able to use compiler-libs for example) that would output a text file with easier-to-parse signatures. Something that worried me about the project of having Hoogle for OCaml is that the OCaml module system makes bare signatures carry very little information (you will get a lot of functions of type t -> t, or something like that). Expanding type signatures to fully qualified types probably requires a non-trivial treatment that could benefit from compiler-libs and that would be difficult to reproduce in an ad-hoc parser.

nojb commented 7 years ago

I like the idea of applying a preprocessing step on the OCaml side to make it easier to parse on the Haskell side.

Regarding the second part of your suggestion: if I understand correctly you are saying that to "know" which type a particular type constructor refers to requires more than just a syntactic analysis (for example to take into account "opens" and "includes"). This means that we probably need take the .cmtis as input instead of .mlis, what do you think ?

UnixJunkie commented 7 years ago

If you need more files to be pushed in there: https://github.com/UnixJunkie/ocaml-4.04.0-mlis just ping me

nojb commented 7 years ago

@ndmitchell Hi Neil, in order to concretize matters a little bit, could you post a link to the internal Hoogle representation that we must target (step "2" of your blog post) ? Thanks !

ndmitchell commented 7 years ago

https://github.com/ndmitchell/hoogle/blob/master/src/Input/Item.hs has all the data types. Item is the root type, Sig is where most of the complexity lies since that is type signatures.

thierry-martinez commented 7 years ago

@nojb I think that .cmi files are just fine. I wrote a small tool that dumps .cmi files in Haskell syntax: it should be directly parsable with Hoogle without any custom parser. Queries will need to be preprocessed, though. https://github.com/thierry-martinez/hooglebackend

nojb commented 7 years ago

Great! Does it mean we can already use Hoogle with OCaml (even if in a basic manner) ?

UnixJunkie commented 7 years ago

I pushed all .cmi files in there too: https://github.com/UnixJunkie/ocaml-4.04.0-mlis We have 3518 cmi files and 1881 mli files.

UnixJunkie commented 7 years ago

Should I run Thierry's dumper on all the .cmi files and store its outputs? I can do this tomorrow if needed. Just ping me.

nojb commented 7 years ago

I think we are not there quite yet. I played around with Thierry's tool a little bit. For example, running hooglebackend foo.cmi, where foo.ml is

open Map
module M = Make (String)

gives

module Foo where
module Foo__2EM where
data Tkey
data Tt t0
empty :: Tt a
is_empty :: (Tt a -> Tbool)
mem :: (Tkey -> (Tt a -> Tbool))
add :: (Tkey -> (a -> (Tt a -> Tt a)))
singleton :: (Tkey -> (a -> Tt a))
remove :: (Tkey -> (Tt a -> Tt a))
merge :: ((Tkey -> (Toption a -> (Toption b -> Toption c))) -> (Tt a -> (Tt b -> Tt c)))
union :: ((Tkey -> (a -> (a -> Toption a))) -> (Tt a -> (Tt a -> Tt a)))
compare :: ((a -> (a -> Tint)) -> (Tt a -> (Tt a -> Tint)))
equal :: ((a -> (a -> Tbool)) -> (Tt a -> (Tt a -> Tbool)))
iter :: ((Tkey -> (a -> Tunit)) -> (Tt a -> Tunit))
fold :: ((Tkey -> (a -> (b -> b))) -> (Tt a -> (b -> b)))
for_all :: ((Tkey -> (a -> Tbool)) -> (Tt a -> Tbool))
exists :: ((Tkey -> (a -> Tbool)) -> (Tt a -> Tbool))
filter :: ((Tkey -> (a -> Tbool)) -> (Tt a -> Tt a))
partition :: ((Tkey -> (a -> Tbool)) -> (Tt a -> (Tt a, Tt a, Tt a)))
cardinal :: (Tt a -> Tint)
bindings :: (Tt a -> Tlist (Tkey, Tkey, a))
min_binding :: (Tt a -> (Tkey, Tkey, a))
max_binding :: (Tt a -> (Tkey, Tkey, a))
choose :: (Tt a -> (Tkey, Tkey, a))
split :: (Tkey -> (Tt a -> (Tt a, Tt a, Toption a, Tt a)))
find :: (Tkey -> (Tt a -> a))
map :: ((a -> b) -> (Tt a -> Tt b))
mapi :: ((Tkey -> (a -> b)) -> (Tt a -> Tt b))
module Foo where

I can see two issues right away:

ndmitchell commented 7 years ago

I would advise against targeting Item directly from an external tool. Item is very much an internal detail of Hoogle, so anything operating on it should live in Hoogle itself. Translating to Haskell (or Haskell-like) is probably a better approach.

zhenya1007 commented 7 years ago

One observation/suggestion: tools like ocp-index and ocamlspot manage to keep a rather accurate account of the types in one's source tree. They do rely on .cmt/.cmti files being present, but it doesn't sound like that's a big problem for this project. Perhaps it would make sense to look at extending one of those tools (or even ocamlbrowser) to dump their database of types in a tree as S-experssions (say), so parsing on the Haskell side is not too difficult?

UnixJunkie commented 7 years ago

In my opam setup in a VM with as many packages as I could install: there are more .cmi files than any other type (3518 .cmi files, 1880 .mli files, 1713 .cmt files, 1328 .cmti files). So, I guess it is better to target .cmi files if we want to have as many libraries as possible being indexed.

thierry-martinez commented 7 years ago

I updated https://github.com/thierry-martinez/hooglebackend : I chose a JSON-compatible format, using only lists, strings and integers. I checked that there exist some JSON parsers in Haskell, and it should be trivial to parse from scratch anyway. Type manifests are now handled correctly (thanks @nojb !).

UnixJunkie commented 7 years ago

I added a .cmi.json file for each .cmi file in there: https://github.com/UnixJunkie/ocaml-4.04.0-mlis. The .json files were created using Thierry's hooglebackend software.

nojb commented 7 years ago

@thierry-martinez The Json list delimiter should be , instead of ;.

thierry-martinez commented 7 years ago

@nojb Oops! Fixed, thanks!

thierry-martinez commented 7 years ago

@nojb Oops! Fixed, thanks!

UnixJunkie commented 7 years ago

I updated all the .json files with the lattest version of Thierry's hooglebackend software.

UnixJunkie commented 7 years ago

What should we do now?

ndmitchell commented 7 years ago

The indices certainly look plausible. The next step is for Hoogle to consume that output, generate it's internal data types, and build an index. If one of you wants to do it, I can point you at where it should go. If it's me, it will probably be a few weeks before I get to it.

UnixJunkie commented 7 years ago

I'm not proficient in Haskell, but if there is some testing needed at some point, to run a few queries in the Hoogle for OCaml, just ask me. I am eager to test it. :)

UnixJunkie commented 7 years ago

@ndmitchell please don't forget this issue, thanks! :^)

UnixJunkie commented 7 years ago

Neil, this is your chance to become a hero for the OCaml community too (I guess you are already a semi god in the Haskell one).

UnixJunkie commented 7 years ago

Neil, look what I am currently using, out of complete despair: https://github.com/UnixJunkie/hoogle_for_ocaml/blob/master/hoogle_for_ocaml.sh Of course I want the real thing too, and I hope that one day I will be able to approach the legendary productivity of Haskell programmers.

ngzhian commented 7 years ago

I saw the json output here, and thought maybe instead of writing something to parse the json, the hooglebackend can be modified to emit hoogle compatible docs, for example, instead.

My WIP is here, which is a fork of the original hooglebackend, modified to emit haddock compatible docs instead of json. It's incomplete and what it can emit thus far looks like this. I'm hoping this can be more easily consumed by hoogle.

I have 2 question:

  1. Does hoogle care about casing? Ocaml data types are lowercase, whereas Haskell's are capitalised?
  2. Ocaml uses int maybe, whereas Haskell uses maybe int, should the output be normalised, or can hoogle handle both cases (if it has to be normalised then the frontend might have to handle normalising a ocaml user's query as well)

P.S. I tried indexing the still-broken output ~/.local/bin/hoogle generate --local=hooglebackend/, and it gives some response.

$ ~/.local/bin/hoogle "a"
author :: driver_t -> string
package Ao
driver_author :: driver_t -> string
driver_name :: driver_t -> string
driver_preferred_byte_format :: driver_t -> byte_format_t
driver_short_name :: driver_t -> string
get_default_driver :: () -> driver_t
name :: driver_t -> string
play :: t -> string -> ()
preferred_byte_format :: driver_t -> byte_format_t