sphinx-doc / sphinx

The Sphinx documentation generator
https://www.sphinx-doc.org/
Other
6.56k stars 2.12k forks source link

Problematic HTML search #1918

Open nastasviatoha opened 9 years ago

nastasviatoha commented 9 years ago

Hi all,

I have problems with HTML search. The search engine doesn't find any results in the documentation for some words. And I don't see any dependencies why some words can be found and some can't be.

For example, I have the following glossary in my Sphinx documentation:

API Application Programming Interface – A set of routines, protocols, and tools for building or interfacing with software applications.

CentOS Community Enterprise Operating System, a Linux distribution that operating system provides a free, enterprise class, community-supported computing platform.

DHCP Dynamic Host Configuration Protocol

DNS Domain Name System

EPEL Extra Packages for Enterprise Linux, a volunteer-based community effort from the Fedora project to create a repository of high-quality add-on packages that complement the Fedora-based Red Hat Enterprise Linux (RHEL) and its compatible spinoffs, such as CentOS and Scientific Linux.

HA High Availability

Hadoop MapReduce Software framework for easily writing applications, which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

HBase Apache HBase is a column-oriented database management system that runs on top of HDFS.

Regarding the example above: if I search for the following words: CentOS, DNS, EPEL, HA, Hadoop MapReduce, HBase, I get this message:

"Your search did not match any documents. Please make sure that all words are spelled correctly and that you've selected enough categories."

For the rest words in the list (API, DHCP) it works fine.

Probably somebody of you has any ideas about the root cause of it.

rocmit commented 9 years ago

I believe I have ran into a similar issue (may not 100% match though)... Some notes that I have while investigating this:

1) It doesn't index 2 or less characters. If user input has two character string (by itself or with other matching strings), it will never match anything. "HA" in the above example will never work and if you search with HA and/or with other matching terms. I guess the search should either ignore two character terms or index to include 2 letter terms (I prefer this option).

2) The search doesn't work well with specific characters. Let's take "abc-defg" for example, It is able to index as two separate terms: "abc" and "defg". However, when you are searching with "abc-defg", no match will be found (probably didn't ignore the "-").

3) In my document, I have a term "show" that is giving me trouble. If I search "show" by itself, it seems to work. If I search for "show" plus another matching term (example: show abcd), it will not match anything. I couldn't replicate this with a simple example.

hamishwillee commented 8 years ago

Another problem example is to search for version numbers/periods. For example, if I search for APM 2.5 nothing shows up - but "APM 2" works fine.

I suspect the indexing routine splits on non word characters and possibly also omits short strings - so basically APM would be in the index, but 2.5 would not match anything. While that is reasonable, I think it is a bug that you can't declare a term/dfn or some other inline keyword markup and thereby force a complex term to be indexed.

ThomasWaldmann commented 8 years ago

https://github.com/borgbackup/borg/issues/1485 "append-only" is in a heading, but searching for "append-only" or "append only" or "append" does not find it.

ThomasWaldmann commented 7 years ago

Just as a note:

when using special tokenization rules (like splitting "foo-bar" on "-" into "foo" and "bar", for example), it is essential that the precisely the same tokenization is done at indexing time (python?) and at search time (javascript?) - otherwise the tokens made from the search term(s) won't match the tokens that were put into the index, even for precisely matching strings.

also, if other modifications are applied to the tokens (like stemming, lower-casing, ...) at indexing time, these also must be done on the search input later.

jbqubit commented 7 years ago

Seems like similar behavior is reported in multiple Issues. #3270 #2989 #2930 #2989

enkore commented 7 years ago

A workaround would be to search-index all index entries, not just those that came from special domain documentation functions. See https://github.com/borgbackup/borg/issues/1485#issuecomment-303127961 for an example.