metacpan / metacpan-api

A free, open API for everything you want to know about CPAN
http://www.metacpan.org/
Other
290 stars 196 forks source link

Full text searchable for resources.bugtracker.mailto and resources.bugtracker.web #238

Open idnorton opened 12 years ago

idnorton commented 12 years ago

Hi Peeps,

Please could you make the following two fields analyzed in order to permit full text searching against them:

resources.bugtracker.mailto resources.bugtracker.web

Thanks, Ian.

clintongormley commented 12 years ago

Or create an analyzed multi-type field if we need these fields to be not analyzed for any reason. Also, given that they're URLs/emails, probably want the simple analyzer, rather than the standard analyzer

tsibley commented 11 years ago

The doc for the simple analyzer suggests that URLs/emails will still be tokenized by splitting on non-letters, which it seems like it could lead to non-intuitive matching. Consider the matching behaviour if bugtracker.mailto is "cpan-bugs@example.org": searching on "cpan.org" would match even though intuitively you would not expect it to.

This is all based on doc and assumptions about matching in ES, so please correct me if I'm making an ass out of myself by assuming incorrectly. :)

clintongormley commented 11 years ago

Actually, your particular example would be tokenized as [cpan,bugs,example.org] - it keeps words separated by a period together.

There is the UAX-URL-email tokenizer http://www.elasticsearch.org/guide/reference/index-modules/analysis/uaxurlemail-tokenizer.html which works just like the standard tokenizer except that it preserves the whole email/url as a single token.

Of course, that would mean that you would have to search on the whole email address - it couldn't match just part of it.

Using the standard analyzer combined with a phrase search would work, eg a phrase search for "cpan.org" would not match "cpan-bugs@example.org" as it takes the token order and position into account.

tsibley commented 11 years ago

Ah ha, good to know that the simple analyzer keeps period-separated words together. I saw the URL-email tokenizer and came to the same conclusion as you. Thanks for the education about ES. :)

The actual use cases I care about are negative matching on "rt.cpan.org" in the mailto and web fields. If the standard analyzer + phrase search will work for that, great!