Open idnorton opened 12 years ago
Or create an analyzed multi-type field if we need these fields to be not analyzed for any reason.
Also, given that they're URLs/emails, probably want the simple
analyzer, rather than the standard analyzer
The doc for the simple analyzer suggests that URLs/emails will still be tokenized by splitting on non-letters, which it seems like it could lead to non-intuitive matching. Consider the matching behaviour if bugtracker.mailto is "cpan-bugs@example.org": searching on "cpan.org" would match even though intuitively you would not expect it to.
This is all based on doc and assumptions about matching in ES, so please correct me if I'm making an ass out of myself by assuming incorrectly. :)
Actually, your particular example would be tokenized as [cpan,bugs,example.org]
- it keeps words separated by a period together.
There is the UAX-URL-email tokenizer http://www.elasticsearch.org/guide/reference/index-modules/analysis/uaxurlemail-tokenizer.html which works just like the standard tokenizer except that it preserves the whole email/url as a single token.
Of course, that would mean that you would have to search on the whole email address - it couldn't match just part of it.
Using the standard analyzer combined with a phrase search would work, eg a phrase search for "cpan.org"
would not match "cpan-bugs@example.org"
as it takes the token order and position into account.
Ah ha, good to know that the simple analyzer keeps period-separated words together. I saw the URL-email tokenizer and came to the same conclusion as you. Thanks for the education about ES. :)
The actual use cases I care about are negative matching on "rt.cpan.org" in the mailto and web fields. If the standard analyzer + phrase search will work for that, great!
Hi Peeps,
Please could you make the following two fields analyzed in order to permit full text searching against them:
resources.bugtracker.mailto resources.bugtracker.web
Thanks, Ian.