ndmitchell / hoogle

Haskell API search engine
http://hoogle.haskell.org/
Other
738 stars 134 forks source link

Type search #263

Closed matt-noonan closed 5 years ago

matt-noonan commented 5 years ago

This is a first go at adding type search on top of the fingerprint filtering. The query (or some simple variations of it) is checked against the proposed answer signature by attempting to unify the answer with the query, keeping type variables in the query fixed. The amount of "unification work" needed gives a cost, that is later combined with the fingerprint cost and a heuristic about the number of abstract constraints.

I made a list of test cases (see todo3.txt) that did not previously work, to help mark progress as type search is added. Some of the test cases come from github issues, others are things I came across on stackoverflow, twitter, etc. Even this simple type search seems to improve the results quite a bit, compared to fingerprint-only search.

My editor setup made some formatting changes via stylish-haskell, adding a bit of noise to this PR. Please do let me know if you'd rather I undid those changes.

ndmitchell commented 5 years ago

Thanks! Awesome work!

Regarding stylish Haskell, I'm OK with it being applied to the extensions/imports, but anywhere that gets edited such that a future diff will effect multiple lines as indentation changes seems non-ideal. If you add a config file at the root of the repo it should configure that with those settings disabled. Can you give that a go? If you do that, a simple resave should put the bits I dislike back (don't worry about the history, as long as its fine in the end I'm fine).

What's the performance if you generate a full database of all of Stackage with hoogle gen? How long does it take to search for a -> [(a,b)] -> Maybe b or similar? If it's not < 0.1, where does profiling say the time is being spent?

matt-noonan commented 5 years ago

Thanks! Awesome work!

Regarding stylish Haskell, I'm OK with it being applied to the extensions/imports, but anywhere that gets edited such that a future diff will effect multiple lines as indentation changes seems non-ideal. If you add a config file at the root of the repo it should configure that with those settings disabled. Can you give that a go? If you do that, a simple resave should put the bits I dislike back (don't worry about the history, as long as its fine in the end I'm fine).

This should be better now.

What's the performance if you generate a full database of all of Stackage with hoogle gen? How long does it take to search for a -> [(a,b)] -> Maybe b or similar? If it's not < 0.1, where does profiling say the time is being spent?

The baseline on my machine is that a command-line query for a -> [(a,b)] -> b with master takes 0.12s.

On this branch, that same query takes 0.47s. This is roughly a worst-case result, because there are not that many hits so we end up trying every variation on the query. If I change it so that we only search for the query as written and do not try to add Maybe etc, then the query takes a more reasonable 0.18s.

As for where the actual search time is going, it looks like about 2/3 of the search time goes into fingerprint matching via bestByFingerprint, and the other 1/3 is due to the more detailed type-matching.

matt-noonan commented 5 years ago

As for where the actual search time is going, it looks like about 2/3 of the search time goes into fingerprint matching via bestByFingerprint, and the other 1/3 is due to the more detailed type-matching.

Also, the type-matching spends most of its time just creating fresh NameInfos in preparation for running union-find, so unfortunately there isn't much low-hanging fruit there for optimization as far as I could see.

ndmitchell commented 5 years ago

Thanks for the Stylish Haskell - looks fine now.

I can live with 0.5s in the short term, although 0.1s would be ideal for the website - but better results slower is probably a preference from what we have now. If only 1/3 of the time goes into type matching then I'm not sure I'd worry too much just yet. Does the fingerprint actually care about permuted arguments? It shouldn't - and then you can only rerun the fingerprint for the Maybe insertion rather than each time around. I'd hoped you could even run the fingerprint without Maybe insertion and use the results with it, since the difference should be not huge.

matt-noonan commented 5 years ago

Does the fingerprint actually care about permuted arguments? It shouldn't - and then you can only rerun the fingerprint for the Maybe insertion rather than each time around. I'd hoped you could even run the fingerprint without Maybe insertion and use the results with it, since the difference should be not huge.

Oh, that's a good point. I changed it to only do the fingerprint scan once per group of variations. That gets the number of scans down to 2 from 4. It would be great to just do the fingerprint scan once, but I'm not getting the lookup hit for a -> [(a,b)] -> b that way. I'll have to investigate further to see if some tweak can help find it.

Anyway, with 2 fingerprint scans the time goes down to just under 0.4s. Not quite as good as I had expected, but not bad either.

ndmitchell commented 5 years ago

Lookup is my goto example, but it's somewhat reminiscent of problems of old (find structure manipulation), rather than modern problems (large APIs), so don't consider it essential.

I'm happy to merge at this point and do follow ups in separate PRs. Does that work for you?

matt-noonan commented 5 years ago

Yes, sounds good!

ndmitchell commented 5 years ago

So it doesn't get forgotten, I raised https://github.com/ndmitchell/hoogle/issues/265 to lift these into test cases at some point

ndmitchell commented 5 years ago

So it doesn't get forgotten, I raised https://github.com/ndmitchell/hoogle/issues/265 to lift these into test cases at some point