meilisearch / meilisearch

A lightning-fast search API that fits effortlessly into your apps, websites, and workflow
https://www.meilisearch.com
MIT License
47.16k stars 1.84k forks source link

No results when search string is adjoining an html tag? #1621

Closed zehawki closed 3 years ago

zehawki commented 3 years ago

Describe the bug I'm seeing a strange issue and I can't understand how something so basic can be an issue. Perhaps something on my side? But here goes: when searching for a string that I know is in the doc, I get no results. Ive tried this with multiple docs and dozens of query strings, and each time I get the same results, ie no search result.

To Reproduce Here's the key and data:

"content": "<p>Channels from any Network Site can be made available natively on other Network Sites. Any channel X syndicated from Network Site 2 to Network Site 1 will act and behave like channels that have been created on Network 1.</p><p>We've invented channel syndication as an easy way for a network to curate great content from across other networks and offer to its member. It ensures that there is evergreen content available to members while at the same time increasing the reach and <a href=\"https://www.maincross.net/help/channelpulse\">ChannelPulse</a> for the original channel.</p><p>Channel owners may make their channels available for free, or for a subscription fee.</p><blockquote>This feature is in beta at the moment.</blockquote><h2>Things to know</h2><ol><li>Channel ownership does not get transferred when syndicating. In the above example, channel X continues being owned Network 2, and is only \"loaned\" to Network 1.</li><li>All the rules applied to Channel X continues to apply - eg post control, moderation, etc.</li><li>Once syndicated, channel X in Network 1 will show all the posts that belong to that channel - whether posted from Network 2 or Network 1.</li></ol><h2>See this live</h2><p>Scheme: Channel has been syndicated from <strong>demo2</strong> to <strong>demo1</strong>. The channel owned is demo1.</p><p>Original channel: https://demo2.maincross.org/topics/sexuality-disability</p><p>Syndicated channel: https://demo1.maincross.org/topics/sexuality-disability</p><p>After syndication, a member has posted into the channel on demo 1: https://demo1.maincross.org/topics/515/opinion/3493/posting-into-a-syndicated-channel#card</p><h2>Visual indication</h2><p>Syndicated channels show an icon at the top corner:</p><figure class=\"image\"><img src=\"https://ne-store-a.s3.amazonaws.com/media/nn/www-maincross-net/posts/image_bzub5qV.png\"></figure><p>Hovering over the icon shows the full detail, and clicking on the icon will load the original Network Site.</p><figure class=\"image\"><img src=\"https://ne-store-a.s3.amazonaws.com/media/nn/www-maincross-net/posts/image_IL3z5mi.png\"></figure><h2>Canonical content</h2><p>Content syndication leads to a standard issue of <a href=\"https://en.wikipedia.org/wiki/Canonical_link_element\">duplicate content</a>, since posts from syndicated channels will appear at 2 different URLs. Continuing the above example, the same post is available at</p><ol><li>https://demo2.maincross.org/topics/515/opinion/3493/posting-into-a-syndicated-channel</li><li>https://demo1.maincross.org/topics/515/opinion/3493/posting-into-a-syndicated-channel</li></ol><p>This is elegantly taken care of by having meta tags correctly generated which let the world know which is the \"original\" URL, vs the duplicate one. This ensures that search engines give more weightage to the original URL rather than the duplicate, thus protecting the original network's SEO value.</p>",

Searching for "Things to know", gives no results! Same with "See this live", "Hovering over the icon", "We've invented", "After syndication". Each of these are adjoining an HTML tag. But "invented channel", "channel X continues", "weightage to the original" are all fine.

Expected behavior Search results should throw up the document when any search string matches.

MeiliSearch version: v0.20.0

curquiza commented 3 years ago

Hello @zehawki

This is not a bug, but expected. Things to know is <h2>Things to know</h2> in the document and </> are not considered as separator at the moment by MeiliSearch. It means, for MeiliSearch, the word Things does not exist but <h2>Things does. So MeiliSearch cannot retrieve only Things

This is related to both of these feature requests we have right now:

zehawki commented 3 years ago

Got it. Thank you for the quick response, and apologies for posting a duplicate. I searched thru issues and didnt find similar - guess from now on I should also check product/discussion.

curquiza commented 3 years ago

No problem, thank you for your report 🙂 I close the issue then

zehawki commented 3 years ago

So I guess for now I'll send HTML stripped content into Meili.

curquiza commented 3 years ago

Yes :)