kiwix / libkiwix

Common code base for all Kiwix ports
https://download.kiwix.org/release/libkiwix/
GNU General Public License v3.0
118 stars 56 forks source link

Unable to query entries with dot in Name #1004

Closed rgaudin closed 11 months ago

rgaudin commented 11 months ago

OPDS /catalog/v2/entries endpoint allows filtering entries by specifying a name.

As per the doc, it is the ZIM-name which is a modified filename in case of file-using kiwix-serve or the name in library.xml for a library-using kiwix-serve. [[ why isn't it using the Name metadata? ]]

This is not working as expected. I am using only library based example as it is supposed to reuse the name directly.

❯ curl 'https://library.kiwix.org/catalog/v2/partial_entries?name=3dprinting'
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:opds="https://specs.opds.io/opds-1.2"
      xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">
  <id>c8839a45-9f70-9ae2-02b1-4e2047617f8b</id>

  <link rel="self"
        href="/catalog/v2/partial_entries?name=3dprinting"
        type="application/atom+xml;profile=opds-catalog;kind=acquisition"/>
  <link rel="start"
        href="/catalog/v2/root.xml"
        type="application/atom+xml;profile=opds-catalog;kind=navigation"/>
  <link rel="up"
        href="/catalog/v2/root.xml"
        type="application/atom+xml;profile=opds-catalog;kind=navigation"/>

  <title>Filtered Entries (name=3dprinting)</title>
  <updated>2023-10-13T12:28:49Z</updated>
  <totalResults>1</totalResults>
  <startIndex>0</startIndex>
  <itemsPerPage>1</itemsPerPage>
  <entry>
    <id>urn:uuid:c607c3cc-1d88-3d43-4816-d274eb030938</id>
    <title>3D Printing</title>
    <updated>2023-07-30T00:00:00Z</updated>
    <link rel="alternate"
          href="/catalog/v2/entry/c607c3cc-1d88-3d43-4816-d274eb030938"
          type="application/atom+xml;type=entry;profile=opds-catalog"/>
  </entry>
</feed>

This works, we get the 3dprinting ZIM entry. It's name in the library is 3dprinting.stackexchange.com_en_all.

❯ curl -L https://download.kiwix.org/library/library_zim.xml |grep 3dprinting |grep name
<book id="c607c3cc-1d88-3d43-4816-d274eb030938" size="96638" url="https://download.kiwix.org/zim/stack_exchange/3dprinting.stackexchange.com_en_all_2023-07.zim.meta4" mediaCount="5257" articleCount="10627" favicon="iVBORw0KG[snip]AASUVORK5CYII=" title="3D Printing" description="Q&amp;A for 3D printing enthusiasts" language="eng" creator="Stack Exchange" publisher="Kiwix" name="3dprinting.stackexchange.com_en_all" tags="stack_exchange;_category:stack_exchange;_ftindex:no;_pictures:yes;_videos:yes;_details:yes" date="2023-07-30" faviconMimeType="image/png"/>

Trying to filter using the name doesn't work.

❯ curl 'https://library.kiwix.org/catalog/v2/partial_entries?name=3dprinting.stackexchange.com_en_all'
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:opds="https://specs.opds.io/opds-1.2"
      xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">
  <id>17ffb733-20ef-72bf-8064-dcf449d5eb84</id>

  <link rel="self"
        href="/catalog/v2/partial_entries?name=3dprinting.stackexchange.com_en_all"
        type="application/atom+xml;profile=opds-catalog;kind=acquisition"/>
  <link rel="start"
        href="/catalog/v2/root.xml"
        type="application/atom+xml;profile=opds-catalog;kind=navigation"/>
  <link rel="up"
        href="/catalog/v2/root.xml"
        type="application/atom+xml;profile=opds-catalog;kind=navigation"/>

  <title>Filtered Entries (name=3dprinting.stackexchange.com_en_all)</title>
  <updated>2023-10-13T12:34:06Z</updated>
  <totalResults>0</totalResults>
  <startIndex>0</startIndex>
  <itemsPerPage>0</itemsPerPage>
</feed>

As the filter matches the beginning of the name, we can test that it stops working as soon as there is a . in it.

❯ curl 'https://library.kiwix.org/catalog/v2/partial_entries?name=3dprinting.'
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:opds="https://specs.opds.io/opds-1.2"
      xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">
  <id>f42be027-b4ad-e8c5-d8bf-e52594099a6f</id>

  <link rel="self"
        href="/catalog/v2/partial_entries?name=3dprinting."
        type="application/atom+xml;profile=opds-catalog;kind=acquisition"/>
  <link rel="start"
        href="/catalog/v2/root.xml"
        type="application/atom+xml;profile=opds-catalog;kind=navigation"/>
  <link rel="up"
        href="/catalog/v2/root.xml"
        type="application/atom+xml;profile=opds-catalog;kind=navigation"/>

  <title>Filtered Entries (name=3dprinting.)</title>
  <updated>2023-10-13T12:34:39Z</updated>
  <totalResults>0</totalResults>
  <startIndex>0</startIndex>
  <itemsPerPage>0</itemsPerPage>
</feed>

This prevents checking that a ZIM is indeed in the library… without downloading all the catalog. Kind of not the point of OPDS.

mgautierfr commented 11 months ago

It seems it is issue with the xapian indexation of the book (internal database in library).

We index the book with Xapian::TermGenerator::index_text(https://github.com/kiwix/libkiwix/blob/main/src/library.cpp#L439) which is made to index "classic" text, and so understand the . as a sentence separator. So we have 3 indexed name for the book : 3dprinting, stackexchange and com_en_all.

Changing to doc.add_term("XN"+normalizeText(book.getName())); properly index the name as only one name containing dot.

curl '<host>/catalog/v2/partial_entries?name=3dprinting.stackexchange.com_en_all' now return a result. However curl '<host>/catalog/v2/partial_entries?name=3dprinting' does not as there is no book with the name "3dprinting".

rgaudin commented 11 months ago

OK thank you for the explanation. I'm in favor of the move. @kelson42 WDYT?

rgaudin commented 11 months ago

Looks like hyphens (-) are triggering it as well

kelson42 commented 11 months ago

I'm in favour of that move, but this should not impact other metadata of the ZIM file, so only the Name... but if I look to the code, this seems to be already the case! @mgautierfr Can you please provide the quickfix?