Open UnixJunkie opened 7 years ago
Description of what is required is at http://neilmitchell.blogspot.co.uk/2011/03/hoogle-for-your-language-ie-f-scala-ml.html. We're currently at:
A volunteer needs to generate some Hoogle input files containing details of the modules/functions/packages etc. to be searched. These files should be plain text, but can be in a language specific format - i.e. ML syntax for type signatures. For a rough idea of how these files could look see this example - for Haskell I get these files from Hackage. The code to generate these input files can be written in any language, and can live outside Hoogle.
@avsm @yminsky @diml @lefessan
The INRIA SED might also be interested in helping with that: @shindere @thierry-martinez
@dbuenzli
@samoht
I generated all the .mli files I could install with OPAM. They are here: https://github.com/UnixJunkie/ocaml-4.04.0-mlis @ndmitchell does this help?
I did this for the latest stable version of OCaml.
That looks good. Next thing we'd need is a Haskell parser for the subset of OCaml contained in mli files.
I asked more competent people. I hope some will manifest themselves.
Hello! I have just started working on this (https://github.com/nojb/haskell-ocaml-parser). It is my first time writing Haskell, but I hope to have something working in the next several days (work allowing).
Many thanks @nojb!
I'm happy enough with Happy, although I would say my first approach is usually to use something like parsec/attoparsec (typically the latter). Use whichever you prefer though.
That was my first thought as well, but the OCaml syntax is large and complicated. Moreover the official compiler also uses a yacc-based parser. Using a similar technology makes it easier for me and, I think, will make it easier to maintain in the long run.
That makes a lot of sense. Are you imagining to release your OCaml parser as a library I depend on, or merge the code inside Hoogle? I'm happy either way, although I imagine an OCaml parser could be generally useful.
I haven't given it much thought yet, but indeed I think it makes sense to release as a library.
Another approach would be to apply a preprocessing phase implemented in OCaml (and that would be able to use compiler-libs for example) that would output a text file with easier-to-parse signatures. Something that worried me about the project of having Hoogle for OCaml is that the OCaml module system makes bare signatures carry very little information (you will get a lot of functions of type t -> t, or something like that). Expanding type signatures to fully qualified types probably requires a non-trivial treatment that could benefit from compiler-libs and that would be difficult to reproduce in an ad-hoc parser.
I like the idea of applying a preprocessing step on the OCaml side to make it easier to parse on the Haskell side.
Regarding the second part of your suggestion: if I understand correctly you are saying that to "know" which type a particular type constructor refers to requires more than just a syntactic analysis (for example to take into account "opens" and "includes"). This means that we probably need take the .cmtis
as input instead of .mlis
, what do you think ?
If you need more files to be pushed in there: https://github.com/UnixJunkie/ocaml-4.04.0-mlis just ping me
@ndmitchell Hi Neil, in order to concretize matters a little bit, could you post a link to the internal Hoogle representation that we must target (step "2" of your blog post) ? Thanks !
https://github.com/ndmitchell/hoogle/blob/master/src/Input/Item.hs has all the data types. Item
is the root type, Sig
is where most of the complexity lies since that is type signatures.
@nojb I think that .cmi files are just fine. I wrote a small tool that dumps .cmi files in Haskell syntax: it should be directly parsable with Hoogle without any custom parser. Queries will need to be preprocessed, though. https://github.com/thierry-martinez/hooglebackend
Great! Does it mean we can already use Hoogle with OCaml (even if in a basic manner) ?
I pushed all .cmi files in there too: https://github.com/UnixJunkie/ocaml-4.04.0-mlis We have 3518 cmi files and 1881 mli files.
Should I run Thierry's dumper on all the .cmi files and store its outputs? I can do this tomorrow if needed. Just ping me.
I think we are not there quite yet. I played around with Thierry's tool a little bit. For example, running hooglebackend foo.cmi
, where foo.ml
is
open Map
module M = Make (String)
gives
module Foo where
module Foo__2EM where
data Tkey
data Tt t0
empty :: Tt a
is_empty :: (Tt a -> Tbool)
mem :: (Tkey -> (Tt a -> Tbool))
add :: (Tkey -> (a -> (Tt a -> Tt a)))
singleton :: (Tkey -> (a -> Tt a))
remove :: (Tkey -> (Tt a -> Tt a))
merge :: ((Tkey -> (Toption a -> (Toption b -> Toption c))) -> (Tt a -> (Tt b -> Tt c)))
union :: ((Tkey -> (a -> (a -> Toption a))) -> (Tt a -> (Tt a -> Tt a)))
compare :: ((a -> (a -> Tint)) -> (Tt a -> (Tt a -> Tint)))
equal :: ((a -> (a -> Tbool)) -> (Tt a -> (Tt a -> Tbool)))
iter :: ((Tkey -> (a -> Tunit)) -> (Tt a -> Tunit))
fold :: ((Tkey -> (a -> (b -> b))) -> (Tt a -> (b -> b)))
for_all :: ((Tkey -> (a -> Tbool)) -> (Tt a -> Tbool))
exists :: ((Tkey -> (a -> Tbool)) -> (Tt a -> Tbool))
filter :: ((Tkey -> (a -> Tbool)) -> (Tt a -> Tt a))
partition :: ((Tkey -> (a -> Tbool)) -> (Tt a -> (Tt a, Tt a, Tt a)))
cardinal :: (Tt a -> Tint)
bindings :: (Tt a -> Tlist (Tkey, Tkey, a))
min_binding :: (Tt a -> (Tkey, Tkey, a))
max_binding :: (Tt a -> (Tkey, Tkey, a))
choose :: (Tt a -> (Tkey, Tkey, a))
split :: (Tkey -> (Tt a -> (Tt a, Tt a, Toption a, Tt a)))
find :: (Tkey -> (Tt a -> a))
map :: ((a -> b) -> (Tt a -> Tt b))
mapi :: ((Tkey -> (a -> b)) -> (Tt a -> Tt b))
module Foo where
I can see two issues right away:
It is probably a good idea to emit Haskell code that constructs the required internal representation directly (data type Item
in https://github.com/ndmitchell/hoogle/blob/master/src/Input/Item.hs). This would avoid the gymnastics to account for the difference in lexical conventions between Haskell and OCaml.
Manifest types are not taken into account. For example above the type key
(Tkey
) appears abstract when in fact is known to be string
.
I would advise against targeting Item directly from an external tool. Item is very much an internal detail of Hoogle, so anything operating on it should live in Hoogle itself. Translating to Haskell (or Haskell-like) is probably a better approach.
One observation/suggestion: tools like ocp-index and ocamlspot manage to keep a rather accurate account of the types in one's source tree. They do rely on .cmt
/.cmti
files being present, but it doesn't sound like that's a big problem for this project. Perhaps it would make sense to look at extending one of those tools (or even ocamlbrowser) to dump their database of types in a tree as S-experssions (say), so parsing on the Haskell side is not too difficult?
In my opam setup in a VM with as many packages as I could install: there are more .cmi files than any other type (3518 .cmi files, 1880 .mli files, 1713 .cmt files, 1328 .cmti files). So, I guess it is better to target .cmi files if we want to have as many libraries as possible being indexed.
I updated https://github.com/thierry-martinez/hooglebackend : I chose a JSON-compatible format, using only lists, strings and integers. I checked that there exist some JSON parsers in Haskell, and it should be trivial to parse from scratch anyway. Type manifests are now handled correctly (thanks @nojb !).
I added a .cmi.json file for each .cmi file in there: https://github.com/UnixJunkie/ocaml-4.04.0-mlis. The .json files were created using Thierry's hooglebackend software.
@thierry-martinez The Json list delimiter should be ,
instead of ;
.
@nojb Oops! Fixed, thanks!
@nojb Oops! Fixed, thanks!
I updated all the .json files with the lattest version of Thierry's hooglebackend software.
What should we do now?
The indices certainly look plausible. The next step is for Hoogle to consume that output, generate it's internal data types, and build an index. If one of you wants to do it, I can point you at where it should go. If it's me, it will probably be a few weeks before I get to it.
I'm not proficient in Haskell, but if there is some testing needed at some point, to run a few queries in the Hoogle for OCaml, just ask me. I am eager to test it. :)
@ndmitchell please don't forget this issue, thanks! :^)
Neil, this is your chance to become a hero for the OCaml community too (I guess you are already a semi god in the Haskell one).
Neil, look what I am currently using, out of complete despair: https://github.com/UnixJunkie/hoogle_for_ocaml/blob/master/hoogle_for_ocaml.sh Of course I want the real thing too, and I hope that one day I will be able to approach the legendary productivity of Haskell programmers.
I saw the json output here, and thought maybe instead of writing something to parse the json, the hooglebackend can be modified to emit hoogle compatible docs, for example, instead.
My WIP is here, which is a fork of the original hooglebackend, modified to emit haddock compatible docs instead of json. It's incomplete and what it can emit thus far looks like this. I'm hoping this can be more easily consumed by hoogle.
I have 2 question:
int maybe
, whereas Haskell uses maybe int
, should the output be normalised, or can hoogle handle both cases (if it has to be normalised then the frontend might have to handle normalising a ocaml user's query as well)P.S. I tried indexing the still-broken output ~/.local/bin/hoogle generate --local=hooglebackend/
, and it gives some response.
$ ~/.local/bin/hoogle "a"
author :: driver_t -> string
package Ao
driver_author :: driver_t -> string
driver_name :: driver_t -> string
driver_preferred_byte_format :: driver_t -> byte_format_t
driver_short_name :: driver_t -> string
get_default_driver :: () -> driver_t
name :: driver_t -> string
play :: t -> string -> ()
preferred_byte_format :: driver_t -> byte_format_t
We need this tool since 10 years ago...