renovatebot / renovatebot.github.io

Auto-generating docs repository for Renovate Bot
https://docs.renovatebot.com
44 stars 37 forks source link

Setup tokenization for Material for MkDocs search #264

Closed HonkingGoose closed 1 year ago

HonkingGoose commented 1 year ago

What browser are you using?

Firefox

Other browser name

No response

Describe the bug

I can't find the presets via the Material for MkDocs search.

Steps to reproduce

  1. Go to production docs site.
  2. Enter workarounds:javaLTSVersions in the search bar
  3. Search bar says: "no matching documents"
  4. But we do have a page with workarounds:javaLTSVersions as the heading title: https://docs.renovatebot.com/presets-workarounds/#workaroundsjavaltsversions

Additional context

@viceice thinks the : character breaks the search somehow.

Related issue:

TWiStErRob commented 1 year ago

This might be a telling sign: image notice how the word "js" is not found on the page "js-lib", but if you search for js alone, there are results: image this should confirm the : theory.

TWiStErRob commented 1 year ago

https://www.mkdocs.org/user-guide/configuration/#separator

I think mkdocs.yml change would fix this:

plugins:
    - search:
        separator: '[\s\-.:]+'
        min_search_length: 2
HonkingGoose commented 1 year ago

@HonkingGoose please open another issue for search, I can confirm that search experience is pretty bad for any presets, even when trying to search just for the thing after the :. I think tokenization setup is messed up (if it's configurable)

The key term I need was tokenization. 😄 Quote from the Material for MkDocs manual: ^setting-up-search

separator

Default: automatically set – The separator for indexing and query tokenization can be customized, making it possible to index parts of words separated by other characters than whitespace and -, e.g. by including .:

plugins:
  - search:
      separator: '[\s\-\.]+'

With 9.0.0, a faster and more flexible tokenizer method is shipped, allowing for tokenizing with lookahead, which yields more influence on the way documents are indexed. As a result, we use the following separator setting for this site's search:

plugins:
  - search:
      separator: '[\s\-,:!=\[\]()"/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;'
TWiStErRob commented 1 year ago

'[\s\-,:!=\[\]()"/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;'

Bless you! What a beauty 🤣

TWiStErRob commented 1 year ago

Make sure to test thoroughly because that "case change" part might mess things up. Is there a way to deploy the website into public URL but not prod to test before merge?

viceice commented 1 year ago

Only via dev server from local / codespaces or gitpod

HonkingGoose commented 1 year ago

I copy/pasted the example code into a branch:

plugins:
  - search:
      separator: '[\s\-,:!=\[\]()"/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;'

This makes the search match too much. The search prediction also shows too much. So just copy/pasting things is right out. 😄

I don't understand this, so I'll let one of you regex wizards fix this problem. 😉

TWiStErRob commented 1 year ago

Can you give an example of "too much" please (screenshot)

HonkingGoose commented 1 year ago

Hmm, I can't reproduce my problem with the example search tokenization anymore. Maybe the upstream fixed something, or I was messing things up. 😄

Here's my branch: https://github.com/HonkingGoose/renovatebot.github.io/tree/search-tokens

You can check it out with GitHub Codespaces, or use Gitpod to test things yourself. 😉

TWiStErRob commented 1 year ago

Btw, I fully understand this regex, the question is what requirements do you want. What are reasonable token-separators for Renovate? Tokenization simply splits along these characters / magical places. The options are (from left to right from the above separator):

Please check/edit the ones above you want to keep and I'll refine the regex. I put my reasons why it should/shouldn't be included.

TWiStErRob commented 1 year ago

After going through the above exercise it looks like all of them are useful for something, maybe even case change if javaLTSVersions directly searched is matching the right thing.

Tip, you can experiment with the regex at https://regex101.com/r/VflcWH/1

HonkingGoose commented 1 year ago

Is there a way to deploy the website into public URL but not prod to test before merge?

Yes, it's called Vercel. 😜 I use Vercel to host a small docs site, and I get a link to a public preview URL on each pull request. Makes it really easy to click around and prod things until I'm happy things work.

Vercel can cost money though, once you exceed certain limits of the free tier. It's for the maintainers to decide if they want to spend time/money switching from GitHub Pages to Vercel.

Btw, I fully understand this regex, the question is what requirements do you want.

Lucky you, I never managed to get far with learning regex, it just looks like gibberish to me. I'd rather click and type around in the development server preview and see how the search behaves with real data. 😄

For me the big things are that you should be able to find the presets when searching for them, either by their full name or parts of their name.

TWiStErRob commented 1 year ago

Yes, it's called Vercel. 😜

I know, I was more curious if it or another was set up already. I managed to get Codespaces running, it's pretty nice, but not public. Anyway, it'll do for now.

Lucky you, I never managed to get far with learning regex

It was forced on us at uni, had to learn language theory and regex is the most basic class of languages. Although I knew practical regex before I knew the theory, because of Operating systems class taught basic grep/sed. I thoroughly recommend learning it, for text processing (search, replace) it's unbeatable, and since our websites and source code is text, we do text processing a lot ;) The basics are only a few symbols, and after that using a site like regex101.com just for syntax highlight helps to read them a lot.

For me the big things are that you should be able to find the presets when searching for them, either by their full name or parts of their name.

Looks to me that your branch is pretty good now compared to prod. I added a few more as described above (I kept case switches too, because they help to find config options partially): separator: '[\s\-,:!?=\[\]()<>{}"/\\]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;' The only problem is that foo:bar presets are not directly searchable, but searching for foo or bar alone (or even sub-words of those) yields way better results than in prod right now. To me it looks like this might be a bug in the search plugin, not our usage. If I was you, I would merge this ^ regex and open an issue upstream to ask them why search containing colon is not working.

This makes the search match too much. The search prediction also shows too much.

It shows more than production that's for sure, but that's only because there's a problem in prod :) I think you've been used to few search results, that normal amount looks "too much" :D additionally these "Missing" search results would be nice to disable, but can't seem to find an option for it: image that said, it might help people discover more related options.

TWiStErRob commented 1 year ago

Make sure #265 doesn't close this issue when it's merged. I reported the problem upstream: https://github.com/squidfunk/mkdocs-material/issues/4884

HonkingGoose commented 1 year ago

Thank you for reporting the problem upstream. ❤️

This issue should remain open, I'm not using any closes keywords in my PR's body text. 😉

TWiStErRob commented 1 year ago

So this will be fixed as soon as Renovatebot picks up the new patch, right?

HonkingGoose commented 1 year ago

We should test the new behavior after applying the latest patch for Material for MkDocs. Then we know if the search is fixed now.

viceice commented 1 year ago

the update is currently pending because of stability days

HonkingGoose commented 1 year ago

We're using Material for MkDocs 9.0.7. When I put workarounds:javaLTSVersions in the search bar, I get the correct result! 🥳

search-matches-input
TWiStErRob commented 1 year ago

Confirmed, I think we can close this as fixed by https://github.com/squidfunk/mkdocs-material/issues/4884#issuecomment-1407394164 via https://github.com/renovatebot/renovatebot.github.io/commit/294979eed191f29b71bdc1211c09de72702b2518.

It also works for prefixes. image image

HonkingGoose commented 1 year ago

Closed by upstream issue: