sphinx-doc / sphinx

The Sphinx documentation generator
https://www.sphinx-doc.org/
Other
6.44k stars 2.1k forks source link

[search] Add ability to treat "-" as a normal letter, to not split search term into several words #12400

Open Ashark opened 3 months ago

Ashark commented 3 months ago

I am using Sphinx for documentation, and I use a MyST Parser

My documentation is technical, so I often want to search some command line options, like for example --run. However, the "-" (dash, munis) symbols are ignored in search field, and I see lots of unrelated results with just word "run".

I also tried to use quotes, like '--run' in search, that did not help. Found out there is also a request for that:

would be helpful, such as not splitting words in quotes

More of that, the "-" is treated as separator (like a space), and if I search for example --start-program, I get unrelated results with the word "starting" for example.

The feature request is to add the possibility to configure sphinx in that way so it recognizes some symbols as normal letters. For example, in conf.py:

sphinx_search_split_regex = r"[\w-]"

I have tried to place the "-" to searchtools.js#L167:

/**
 * Default splitQuery function. Can be overridden in ``sphinx.search`` with a
 * custom function per language.
 *
 * The regular expression works by splitting the string on consecutive characters
 * that are not Unicode letters, numbers, underscores, or emoji characters.
 * This is the same as ``\W+`` in Python, preserving the surrogate pair area.
 */
if (typeof splitQuery === "undefined") {
  var splitQuery = (query) => query
      .split(/[^\p{Letter}\p{Number}_\p{Emoji_Presentation}-]+/gu)
      .filter(term => term)  // remove remaining empty strings
}

and also tried to change the regexp in splitting words in search/init.py#L71:

 _word_re = re.compile(r'[\w-]+')

Which is used in split method:

    def split(self, input: str) -> list[str]:
        """
        This method splits a sentence into words.  Default splitter splits input
        at white spaces, which should be enough for most languages except CJK
        languages.
        """
        return self._word_re.findall(input)

Seems it is not sufficient. I guess, the "-" are stripped from search line also somewhere else.

Would be glad to hear any suggestions.

Also, in the comment from above:

Default splitQuery function. Can be overridden in sphinx.search with a custom function per language.

My documentation is in English, so I guess that would still require the separate option.

picnixz commented 3 months ago

This is something I would agree, but we could solve this issue by implementing the quoted-based match. However, until we fix the current search algorithm, I don't think we should push for new features (or maybe we can?).