purescript / spago

🍝 PureScript package manager and build tool
BSD 3-Clause "New" or "Revised" License
793 stars 132 forks source link

Add searchbox to generated documentation #89

Closed f-f closed 5 years ago

f-f commented 5 years ago

We should be able to generate documentation for all the libraries in the package set (and/or just the libraries used in the project), in a similar way to what stack haddock does for Haskell projects.

The use-cases for this are multiple and useful:

How to achieve this:

justinwoo commented 5 years ago

Going through externs is probably a lot of extra work. Easier to just scrape the docs output.

If you use "markdown" output, it doesn't give you links:

Markdown:

#### `toUnfoldable`

``` purescript
toUnfoldable :: forall f. Unfoldable f => List ~> f

Convert a list into any unfoldable structure.

Running time: O(n)


HTML: (note the anchor links)

````html
<div class="decl" id="v:toUnfoldable">
  <h3 class="decl__title clearfix">
    <a class="decl__anchor" href="#v:toUnfoldable">#</a
    ><span>toUnfoldable</span>
  </h3>
  <div class="decl__body">
    <pre class="decl__signature">
        <code>

            <a href="Data.List.html#v:toUnfoldable" title="Data.List.toUnfoldable">
                <span class="ident">toUnfoldable</span>
            </a>

            <span class="syntax">::</span>
            <span class="keyword">forall</span> f<span class="syntax">.</span>

            <a href="Data.Unfoldable.html#t:Unfoldable" title="Data.Unfoldable.Unfoldable">
                <span class="ctor">Unfoldable</span>
            </a>

            f <span class="syntax">=&gt;</span>

            <a href="Data.List.Types.html#t:List" title="Data.List.Types.List">
                <span class="ctor">List</span>
            </a>

            <a href="Data.NaturalTransformation.html#t:type (~&gt;)" title="Data.NaturalTransformation.type (~&gt;)">
                <span class="ident">~&gt;</span>
            </a>

            f
        </code>
    </pre>
    <p>Convert a list into any unfoldable structure.</p>
    <p>Running time: <code>O(n)</code></p>
  </div>
</div>
justinwoo commented 5 years ago

I'd just like some suggestions on what all people use to search and browse HTML files though, with directory searching instead of just in-page searching

f-f commented 5 years ago

This is the implementation of Pursuit's search: it imports the PureScript compiler to parse the source of the packages and weigh the tokens to rank them in the results. It might be useful to use this information (so it's easier to implement the search), so to avoid reimplementing that stuff it could be useful if this information could be output by purs docs with some flag.

hdgarrood commented 5 years ago

Pursuit doesn’t do any parsing; the information is all available via the JSON which has been generated by purs publish and uploaded to pursuit.

hdgarrood commented 5 years ago

By the way I’m interested in supporting use cases like this in the compiler: see also https://github.com/purescript/purescript/pull/3528#issuecomment-460052868

f-f commented 5 years ago

Thanks @hdgarrood, that would be indeed wonderful! 👏

It looks like with a json export for Module we'd have all the data Pursuit has from purs publish, so it would be possible to create the same SearchIndex as in there.

The only thing missing at this point would be a frontend way to parse the input query. For the MVP (Minimum Viable Pursuit, clearly) I wanted to try shipping purescript-cst through GHCJS. But since that's going to be integrated in the compiler (doing weird stuff for MVP is ok, but I don't want to ship the whole compiler over GHCJS), I guess the next best approach would be to write a small parser for signatures in PureScript. I wonder if someone has tried a PureScript parser in PureScript?

hdgarrood commented 5 years ago

They have indeed :) https://github.com/purescript/purescript-in-purescript there was an effort to make the compiler self-hosting a while ago, but it was put on hold because of performance issues.

f-f commented 5 years ago

Cool, thanks for the pointer. I had memory of this, just didn't look in the right place :)

It looks like it would be possible to port some of the parsing code (though the project is quite outdated)

f-f commented 5 years ago

@hdgarrood I think the compiler's feature that would allow this is now being tracked in https://github.com/purescript/purescript/issues/3503 right?

hdgarrood commented 5 years ago

I guess so, since there isn't a dedicated issue for it. We probably should have a dedicated issue for it, though.

hdgarrood commented 5 years ago

Actually no, sorry, I don't think a dedicated issue makes sense, because the design I have in mind currently for using externs files while generating docs is very closely tied to this hypothetical new --codegen docs option.

f-f commented 5 years ago

Wonderful, thanks!

f-f commented 5 years ago

Now that #127 went in, let's repurpose this issue to focus on the search functionality, since it's the only thing missing for this

f-f commented 5 years ago

@hdgarrood I was wondering if I could help in any way with purescript/purescript#3503?

hdgarrood commented 5 years ago

That would be great! I think the first step should be to implement a docs target for the —codegen option as described in https://github.com/purescript/purescript/issues/3503#issuecomment-460053455. Having not yet looked into that in much detail I’m unable to give any more useful hints than that but if you get stuck or confused by anything please feel free to ask; the best way is probably via commenting on that issue.

f-f commented 5 years ago

Update: 0.13 is out, and it contains the upstream fix needed for this to work (thanks @hdgarrood and props on the release! 🎉), so here's a small recap on what we're missing for this issue.

The goal is to make the docs produced by spago docs searchable. A way we could do it would be:

klntsky commented 5 years ago

I checked the code, and from what I saw there I conclude that the most easy way is to add purescript as a dependency and reuse the Module type, the asModule parser, and also to start depending on pursuit to reuse its SearchIndex. Another option is to copypaste the required definitions, obviously.

Regarding the interface, I think that CLI is a must have.

The obvious problem will then be speed - when I was tackling with pursuit, its index rebuild was taking quite a lot of time. I think that building the index once and dumping the Trie to a file somehow may be faster than reparsing docs.json files on each search invocation.

Regarding search function in the web interface, converting a Trie to JS will probably result in having a too large bundle. It may be possible to use files on the system as Trie nodes, so that new parts of index will be loading into the browser on demand as the user types.

klntsky commented 5 years ago

Apparently, pursuit is depending on an old version of purescript, so maybe it isn't a good idea to copy its approach. The parser is completely new, but pursuit is still using old version to parse types.

f-f commented 5 years ago

@klntsky thanks for looking into this!

I checked the code, and from what I saw there I conclude that the most easy way is to add purescript as a dependency and reuse the Module type, the asModule parser, and also to start depending on pursuit to reuse its SearchIndex. Another option is to copypaste the required definitions, obviously.

As far as I understand you'd like to create the index on the Haskell side. The problem with that is that we'd need to duplicate types (SearchIndex, etc) on the PureScript side. I don't have a strong opinion on this, but I'd not like to depend on the compiler and/or Pursuit anyways, so we could copy paste code or just port it to PureScript, so we have one codebase to handle all the search-related stuff.

Assuming we'd go for the all-PureScript route, this is what I think it should happen when running spago docs:

Regarding the interface, I think that CLI is a must have.

I think it's very nice to have, but I'd say the biggest priority here is to replicate the search behavior on Pursuit (in fact that's also the title of this issue)

The obvious problem will then be speed - when I was tackling with pursuit, its index rebuild was taking quite a lot of time. I think that building the index once and dumping the Trie to a file somehow may be faster than reparsing docs.json files on each search invocation.

As I described above, parsing the docs.json and generating the index is a one-off task, and I think it's not a problem if it's slow, since running purs docs takes a while anyways. So yes, I think we should dump the search index on a file

Regarding search function in the web interface, converting a Trie to JS will probably result in having a too large bundle. It may be possible to use files on the system as Trie nodes, so that new parts of index will be loading into the browser on demand as the user types.

I'd say we could try going for a single bing bundle in the beginning, and then benchmark how bad it is, and then eventually split it in parts to be loaded on demand?

klntsky commented 5 years ago

Do we want to generate index for all packages in a set, or for listed dependencies only? If first, maybe it is better to generate it once and distribute with package set somehow? AFAIK stack+hoogle use similar approach. UPD: not true actually

f-f commented 5 years ago

@klntsky since we'd generate the SearchIndex from the output of the compiler, that would include only the listed dependencies. Even if we'd want to generate the index for all packages in the set we still could not cache that since users are allowed to have local packages, so distributing a precache wouldn't work as the index would have to be regenerated locally anyways.

Btw I think stack distributes a package version index to speedup downloads (and we do that too), I don't know anything about them distributing a code search index, do you have any pointers about that?

klntsky commented 5 years ago

I don't know anything about them distributing a code search index, do you have any pointers about that?

Nevermind, I checked it now and realized that I was wrong.

Also, looks like the only available purescript implementation of tries I could find is not suitable for storing search index.

klntsky commented 5 years ago

I'm working on it here: https://github.com/klntsky/spago-search

klntsky commented 5 years ago

As for now it looks like this: 2019-06-23-200245_842x290_scrot

It is clear that some UI design decisions should be made.

That "PureScript API documentation" title is too long: the search field does not fit. Maybe we can replace it with something shorter?

Also, since the number of results changes as the user types, the main contents of the page jump up and down. There are some ways to solve this:

f-f commented 5 years ago

@klntsky great work so far! 👏

That "PureScript API documentation" title is too long: the search field does not fit. Maybe we can replace it with something shorter?

How about "$ProjectName Docs"?

In any case the searchbar going on a newline is perfectly fine, Pursuit does it too on mobile:

image

Also, since the number of results changes as the user types, the main contents of the page jump up and down

I pictured having the results in the same place as they are right now in Pursuit (i.e. let's literally replace the contents of the page with them):

image

I'm not sure how they'd jump since you're just typing in the header so that should not move?

klntsky commented 5 years ago

How about "$ProjectName Docs"?

I figured out it is possible to shrink the width of the container of that "Index" link, at the same time making the search field narrower. In practice, most search queries will fit into the field. And it looks way better than with a line break.

2019-06-24-122031_999x300_scrot

let's literally replace the contents of the page with them

OK. I think we should also restore the main contents when the search field is empty.

Another idea unrelated to UI is that we can populate the trie by camel-cased parts of names, i.e. for querySelectorAll we may store the same index entry by three paths: queryselectorall, selector and all (the latter two should have lower priority).

klntsky commented 5 years ago

The sad thing is that we seemingly can't split the index to load it on demand and have type search ability at the same time, because type searching requires comparing with all available types (if we go the pursuit way).

Splitting the index is absolutely required: as for now it is 6.4 MB (for modules used by the project itself), and loading it freezes the page for ~8s on my machine.

f-f commented 5 years ago

@klntsky I think we can detect when type search is being used right? Then we could split the index and load all the parts when we get a type query. If it's split in parts then it should also be possible to keep the page responsive (and show a loading spinner) while we load the parts

klntsky commented 5 years ago

Then we could split the index and load all the parts when we get a type query.

This is very straightforward, but I'd say I dislike this approach. We can't improve the total time to load the index by loading it in parts: the only thing that we can do is suppressing the "tab is not responding" message.

Anyway, last night I brainstormed the problem and found a solution.

We can split all types in the index by their shapes.

A "type shape" is a lossy encoding of types:

type TypeShape = List ShapeChunk

data ShapeChunk
  = PVar
  | PFun
  | PApp
  | PForAll Int
  | PRow Int

For example, we encode forall a. (forall h. ST h (STArray h a)) -> Array a as [ PForAll 1, PFun, PForAll 1, PApp, PApp, PVar, PVar, PApp, PApp, PVar, PVar, PVar, PApp, PVar, PVar ] (using Polish notation).

"Type shapes" preserve the crucial info about the type, but don't store the unnecessary details. They can be hashed and used as keys in a Map, which can be loaded on demand.

Unfortunately, by using them we are giving up on the ability to find subtypes (e.g. a -> (b -> b) type will not be matched with a -> b query). But I believe it is acceptable, since most of these subtypes are usually irrelevant.

I implemented the search, and IMO the quality level of results is OK (as for now they are only matched by shape, and not sorted by distance, like done in pursuit, but I'll implement that sorting too).

2019-06-28-144111_915x956_scrot

klntsky commented 5 years ago
  1. If we want results sorting by popularity (i.e. by the number of reverse dependencies of the related package), it'd be great if spago allowed getting some info from the package set. In particular, a JSON file of the following format put to generated-docs/index/index.json would be of great help:
{ packages :: Array { name :: String
                    , modules :: Array String
                    , dependencies :: Array String
                    , repository :: String
                    , version :: String -- AFAIK spago only deals with git revisions, right?
                    }
, modules :: Array String -- list of user-defined modules from src/ and test/
}

The repository field would allow to make package names clickable.

This info is also needed to map modules to package names: this is a temporary solution I'd like replace with something less hacky.

  1. Do we really want to call bundle-app and run on the client side? IMO distributing a precompiled version of both the app and the index builder is better for the UX. Besides, building on the client will introduce more maintenance burden.

  2. Directory structure will look like this:

generated-docs/
  index/
    index.json
    app.js
    declarations/
      1.js
      2.js
      ...
      n.js
    types/

declarations and types will contain parts of the corresponding search indices.

File names in generated-docs/index/types will be in a format defined by this function. Each of these files will contain all types of the corresponding "type shape".

f-f commented 5 years ago

@klntsky

  1. Great idea! Should we generate that file from Haskell or PureScript?
  2. Agreed. We'd also lose control on very sensitive things like the purs version used to compile. So let's instead build it as part of the build here, and ship all the artifacts as templates in the binary (so the build works offline too). The problem on this side of the tradeoff is that bootstrapping spago's build will become slightly more difficult - as one will either need a previous version of spago to build the artifacts, or just do a build with empty files first, and then use that build to produce the correct ones, and build again
  3. Sounds good. We should probably collect all these docs in the spago-search README or in the "internals" document
  4. Would you like to move the spago-search project under the spacchetti org? You'd retain admin etc etc, it's just to make it more "official" and keep things tidy. Actually given the build setup from (2) I'd also be fine merging all the code in this repo (I originally proposed having a separate repo because of the build setup)
  5. I just realized we should somehow include Prim modules too - since we don't have docs.json for them - but we can worry about them at a later stage, let's get this merged first
klntsky commented 5 years ago

Great idea! Should we generate that file from Haskell or PureScript?

From PS. Initially I thought that this should be done on the spago side, because I completely forgot that there are bower.json files under the .spago dir.

Sounds good. We should probably collect all these docs in the spago-search README or in the "internals" document

OK, I'll have a "documentation day" after I'm done.

Would you like to move the spago-search project under the spacchetti org?

Yes, definitely.

Actually given the build setup from (2) I'd also be fine merging all the code in this repo (I originally proposed having a separate repo because of the build setup)

I think that the artifacts (a nodejs script and a webapp) can be just downloaded at runtime from github releases. In this case they should also be pinned by hash. This will make offline builds possible while not making spago self-dependent.

I just realized we should somehow include Prim modules too - since we don't have docs.json for them - but we can worry about them at a later stage, let's get this merged first

Yeah, pursuit has the luxury of extracting the Prim definitions from the compiler. We may write a simple app that generates these jsons later.

Also, I think that we may want to put the index dir to generated-docs/html, so that setting up a local "pursuit" with a static-only web server will be as simple as making generated-docs/html a webroot. If we have generated-docs/html and generated-docs/index, the webroot will have to be generated-docs, and all doc URLs will begin with html/, which will be slightly less elegant.

f-f commented 5 years ago

@klntsky sounds great! I invited you to the spacchetti org, so after you accept the invite you should be able to move the repo

klntsky commented 5 years ago

Maybe it's time to choose a better name for that search thing, @f-f? I could make it work without spago (by adding a CLI interface to provide paths where to search for data), so putting spago in the title may be confusing. What do you think of purescript-docs-search as a name for the repo?

f-f commented 5 years ago

What do you think of purescript-docs-search as a name for the repo?

@klntsky sounds great!

klntsky commented 5 years ago

OK. One more question then. If you agree to distribute the precompiled app, then it is no more a requirement to avoid JS dependencies, right? I was thinking that if the directories to search the data in could be passed through the CLI, then introducing glob-syntax for these arguments is a natural thing to do. More specifically, a call to the index builder could look like this:

./index-builder.js --docs-files './output/**/docs.json' --generated-docs-files './generated-docs/html/*.html' --bower-flies './.spago/**/bower.json'

This requires adding glob as a dependency.

hdgarrood commented 5 years ago

The docs.json files don’t include re-exports currently, by the way (this is as a result of various implementation details). Perhaps I should have made this clearer before, but the intended way of consuming them is to use the collectDocs function from the compiler library.

klntsky commented 5 years ago

The docs.json files don’t include re-exports currently

I know, this is absolutely fine for our purposes (it's even better this way, cause no need to deal with duplicate index entries).

the intended way of consuming them is to use the collectDocs function from the compiler library.

How stable is the format of these files? Should I expect my ad hoc docs.json decoder to break eventually?

hdgarrood commented 5 years ago

Backwards-incompatible changes are pretty rare (they're a pain for us because they necessitate regenerating all of Pursuit's docs) but forwards-incompatible changes are not that uncommon. These files also aren't part of the compiler's public API, so e.g. they could potentially change in a backwards-incompatible way without a major compiler release.

f-f commented 5 years ago

I think it's OK to depend on the format of the docs.json if they are mostly backwards-compatible: at generation time we know the version of the compiler we're dealing with, so we can just have a big switch on the version in the app

If you agree to distribute the precompiled app, then it is no more a requirement to avoid JS dependencies, right?

@klntsky yep, I'm fine with adding JS deps