metacpan / metacpan-web

Web interface for MetaCPAN
http://metacpan.org
Other
414 stars 236 forks source link

Suboptimal Searches (Dev, Trial and any unindexed releases not to be included here) #1231

Open oalders opened 10 years ago

oalders commented 10 years ago

I'm opening this issue as a place to collect searches which could be improved. Individual searches can be broken into issues as they are tackled, but this is essentially a place to get the conversation started.

oalders commented 10 years ago

I'm looking for File::Temp.

https://metacpan.org/search?q=tmpfile

(result 8)

vs

http://search.cpan.org/search?query=tmpfile&mode=all

(result 2)

ribasushi commented 10 years ago

Thanks for kicking this off

http://search.cpan.org/search?query=mop&mode=all vs https://metacpan.org/search?q=mop (known issue, but the most painful manifestation of it)

http://search.cpan.org/search?query=dbix+helper&mode=all vs https://metacpan.org/search?q=dbix+helper (note how the only thing coming up is the deprecated one)

oalders commented 10 years ago

@ribasushi I think the main issue with the dbix+helper search is that the MetaCPAN search results are collapsed. If you follow through on the link for more results you get https://metacpan.org/search?q=distribution:DBIx-Class-Helpers+dbix%20helper which is much more helpful. I'm not invalidating your comment. I'm just trying to work through what we're seeing. Obviously showing a deprecated module as the first result is not helpful. We should look at tweaking the collapsed search in this kind of case.

One other problem may be that the search is for "helper" and not "helpers". The collapsed results for "helpers" look better: https://metacpan.org/search?q=dbix+helpers

ribasushi commented 10 years ago

@oalders Does ES provide a way to calculate a "churn coefficient"? In other words - can it rank the entries by "most changes since" and thus give you a sane collapse criteria?

dagolden commented 10 years ago

You need a way to specify a search by module name -- you effectively have this for the search box autocomplete, but something like module:MooseX ought to give all dists with MooseX in the name rather than a full-text search. SCO has had this feature forever and it's a major gap in MetaCPAN.

ranguard commented 10 years ago

More smarts on start matching...

I want to find all Plack::Middleware::\ modules that have 'time'

https://metacpan.org/search?q=plack%3A%3Amiddleware+time

This might be a new feature rather than a suboptimal search but thought I'd mention it here

oalders commented 10 years ago

What @dagolden is proposing is something we can do relatively easily, so I think we should make that a priority. We'd just need to sort out the syntax. The single colon is part of lucene's search syntax. Also we just need to advertise that you can use lucene's syntax to constrain searches. A good example is https://metacpan.org/search?q=plack+author%3ADAGOLDEN

tsibley commented 10 years ago

module.name:MooseX is accepted, but I don't get why it only returns "MooseX" and not any of the subclasses. I thought term queries/filters were contains not equals?

rwstauner commented 10 years ago

Putting field:val in the search box ends up doing a query_string search (that's what recognizes the operators), not a term filter.

To clarify term filters, they are for exact values (like not_analyzed strings).

The reference docs do use the word "contain" (which isn't very clear) but they also say "not_analyzed":

Matches documents that have fields that contain a term (not analyzed).

which means it won't be tokenized (hence the exact match requirement).

The book ("definitive guide") is slightly more specific:

The term filter is used to filter by exact values, be they numbers, dates, booleans, or not_analyzed exact value string fields".

Also note that the "term" operator doesn't analyze the input, so for example {"filter": {"term": {"file.module.name.analyzed": "MooseX"}}} returns no results, but {"filter": {"term": {"file.module.name.analyzed": "moosex"}}} returns several relevant matches.

However you can't see that difference using the search box because of the query_string query (which does analyze the input). So, since we have several "fields" for module name, using an analyzed field can get you what you want: module.name.analyzed:MooseX

dagolden commented 10 years ago

This is not user friendly.

Instead of making us jump hoops to know, understand and remember your data model and search engine behaviors, why not just intercept the search box contents before it goes to Lucene and create the right search for us?

module:Foo  → match modules names containing "Foo"
module:^Foo  → match modules names starting with "Foo"

Or, if you don't like colon separators, do something like DDG: !module Foo

oalders commented 10 years ago

My preference here would be to go with the colon separators because that's what people are used to. We could use some other character for stuff that people want to pass directly to ES/Lucene. Aside from the distribution search, I don't think we use this syntax at all. Nobody really seems to be aware of it and it would follow that really nobody is taking advantage of this. Also, you really need to know a fair bit about the internals to take advantage of this.

So, I'd say, let's make this as friendly as possible. If someone wants the old behaviour, they can preface the query with some syntax that doesn't get in the way.

rwstauner commented 10 years ago

I wasn't suggesting that people should know how to work that (or that it was good enough), I was just trying to clarify what Thomas was experiencing.

We actually do have some special casing for author: and dist: (and distribution:) and I agree we should add some more (like module:). Intercepting these is fairly easy and continuing to let other fields that we don't capture pass through to lucene will continue to work.

rwstauner commented 10 years ago

There is also some DDG-like operator in there, but I'm not sure how that works.

We obviously could use a page to explain what's available and how it works.

@oalders FWIW, In the search results there's a link that says "search in distribution" which just redoes the current search with an added dist:blah on it. I'm not implying that anybody knows how to use it directly, but the site itself actually does make use of it :-)

oalders commented 10 years ago

@rwstauner Yeah, that's what I meant with "Aside from the distribution search, I don't think we use this syntax at all". :)

rwstauner commented 10 years ago

Yeah, I guess so. I was looking at the next sentence and thinking you were considering not needing to keep it if it wasn't used much.

tsibley commented 10 years ago

@rwstauner Thanks for the great explanation. I was looking at the Lucene docs, which I swear mentioned something about being contains not equals, but I don't see it now. And then to make it more confusing I conflated foo:bar in a query string as being the same as "term": { "foo": "bar" }. Thanks for straightening me out!

I wrote the user-friendly version which munges "module:..." as PR #1246.

mattp- commented 10 years ago

vanity searches for pause ids seem to return weird results for modules: https://metacpan.org/search?q=mattp why is DDP::s returned? https://metacpan.org/search?q=data%3A%3Aprinter%3A%3Ascoped shows the proper main pod for Data::Printer::Scoped.

You can see a similar result searching for https://metacpan.org/search?q=FREW

frioux commented 10 years ago

Searching for GetOpt yields a weird, apparently unsorted set of output.

oalders commented 9 years ago

@andreeap Despite the fact that this ticket is on metacpan-web, most of the fixes here would involve a deep dive into Elasticsearch rather than front end work, so this is perfect for the scope of your OPfW time. You can pick searches from this list which interest you, create new issues for them and then link those issues back to this one so that we can track their progress.

oalders commented 9 years ago

I should note that a bunch of search-related issues can also be found here https://github.com/CPAN-API/metacpan-web/labels/group:Search

frioux commented 9 years ago

dbix::class datemethods finds nothing at all

frioux commented 9 years ago

https://metacpan.org/search?q=IO%3A%3AAsync%3A%3ATimer%3A%3APeriod should find https://metacpan.org/pod/IO::Async::Timer::Periodic, but inexplicably finds something else

its-johnt commented 9 years ago

I'm trying to find something to parse XML, so I searched for "xml". Most of the first results are from modules with last uploads circa 2000. Giving more weight to modules with more recent upload dates may be helpful.

shlomif commented 9 years ago

From IRC:

This search - https://metacpan.org/search?q=uri - places a module from 1998 with no upvotes or reviews above URI.pm which has 71 upvotes and three 5-star reviews. Furthermore, https://metacpan.org/search?q=XSLT does not find XML::LibXSLT anywhere in the top results.

ranguard commented 9 years ago

See also:

1373

1372

1253

905

1265

oalders commented 9 years ago
[11:27:44]  <ether> https://metacpan.org/search?q=Extutils%3A%3ADepends returns its first match as the wrong distribution
[11:27:51]  <ether> I think this may have come up before?
[11:28:00]  <ether> the indexed module should be ranked first in search results
[11:34:46]  <haarg> caps
[11:42:49]  <leont> He has comaint on it, so it doesn't trigger unauthorized
[11:44:39]  <haarg> for search we really should be ignoring case
oalders commented 9 years ago
[18:16:18]  <ether> more on search results - searching for "YAML-Tiny" results in that distribution in second place, with Tiny::YAML in #1.
oalders commented 8 years ago

[09:10:03] <kentnl> [07:16:19] https://metacpan.org/search?q=JSON&search_type=modules # I'm not sure what to say here, but for some reason, JSON::MaybeXS doesn't rank, despite having a 5-star review rating and 26 ++'s

pink-mist commented 8 years ago

If you search for either perlvar or perlrun you get a result from PodSimplify from 1996 instead of the latest perl release as first result; perl's perlvar and perlrun pages are the second result for their respective searches.

Grinnz commented 7 years ago

https://metacpan.org/search?q=overload In a search for overload, the first result is the correct overload module in core, but its link https://metacpan.org/pod/overload goes to a very unrelated module.