php / web-php

The www.php.net site
http://www.php.net
Other
839 stars 534 forks source link

Incorrect grouping of search results between "Extensions" and "Other Matches" #1088

Open lhsazevedo opened 1 day ago

lhsazevedo commented 1 day ago

Background

In https://github.com/php/phd/pull/154, we resolved the issue of missing pages in the search index. However, now that these pages are visible in search results, a long-standing bug in result grouping has become apparent.

Issue

Some search results are incorrectly categorized between the "Extensions" and "Other Matches" groups.

Example:

image
Query: security

As shown:

  1. "Security (PHP Manual)" appears in the "Extensions" group, although it is not a PHP extension.
  2. "Security consideration" (from the win32service extension) is incorrectly placed in the "Other Matches" group.

Cause

The client-side search code groups results based on types, including Function, Variable, Class, Exception, Extension, and Other Matches (general). These types are assigned according to the XML element tags in the manual's source.

Issue 1: Incorrect grouping in "Extensions"

The first issue occurs in this section of the code:

https://github.com/php/web-php/blob/27fbef13e912547b4086793a5dd2e04fc0fcf684/js/search.js#L130-L134

The code assumes that any entry with the element tag <book>, <set>, or <reference> is related to extensions, which is inaccurate. Many entries, though using these elements, do not belong to extensions.

Example data:

id ldesc element
getting-started Getting Started book
install Installation and Configuration book
... ... ...
reserved.variables Predefined Variables reference
wrappers Supported Protocols and Wrappers reference
... ... ...
SELECT "docbook_id", "ldesc", "element"
FROM "ids" 
WHERE "element" IN ('book','set','reference')

Issue 2: Incorrect grouping in "Other Matches"

The second issue is due to an assumption in the following code:

https://github.com/php/web-php/blob/27fbef13e912547b4086793a5dd2e04fc0fcf684/js/search.js#L136-L141

The code assumes that entries with the tags <section>, <chapter>, <appendix>, or <article> do not belong to an extension. While this is not as bad, there are many pages that are part of an extension but are currently placed in the "Other Matches" group:

id ldesc element
... ... ...
apcu.installation Installation section
apcu.configuration Runtime Configuration section
... ... ...
pdo.setup Installing/Configuring chapter
pdo.constants Predefined Constants appendix
pdo.connections Connections and Connection management chapter
... ... ...
SELECT "docbook_id", "ldesc", "element"
FROM "ids" 
WHERE "element" IN ('section','chapter','appendix','article')

PHP Manual index dump

For convenience, here is the dump from the PHD SQLite index for the PHP Manual: php-manual-index_2024-10-08.sql.gz

Notes

Girgias commented 1 day ago

Stream wrapper now (should) have the role="stream_wrapper" attribute on the <refentry> tag. So those should be easy to filter out.

I don't know why chapter/section are not considered part of an extension as this markup has existed for decades.

Same for the which has always listed constants.