`catalog/v2` Kiwix Server end-point does not specify anymore the charset in the HTTP headers

kelson42 commented 1 year ago

Following remark of @tim-moody, I have run the following test:

$ curl -I https://library.kiwix.org/catalog/root.xml
HTTP/2 200 
date: Sat, 22 Apr 2023 06:57:53 GMT
content-type: application/atom+xml; profile=opds-catalog; kind=acquisition; charset=utf-8
content-length: 5002801
access-control-allow-origin: *
cache-control: max-age=0, must-revalidate
vary: Accept-Encoding
x-varnish: 770016197 769329506
age: 458
via: 1.1 varnish (Varnish/7.1)
etag: W/"1670254420720624352.12572515/z"
accept-ranges: bytes
strict-transport-security: max-age=15724800; includeSubDomains

and then

$ curl -I https://library.kiwix.org/catalog/v2/entries?count=10
HTTP/2 200 
date: Sat, 22 Apr 2023 06:57:36 GMT
content-type: application/atom+xml;profile=opds-catalog;kind=acquisition
content-length: 13227
access-control-allow-origin: *
cache-control: max-age=0, must-revalidate
vary: Accept-Encoding
x-varnish: 770016157 769361332
age: 15
via: 1.1 varnish (Varnish/7.1)
etag: W/"1670254420720624352.12572515/z"
accept-ranges: bytes
strict-transport-security: max-age=15724800; includeSubDomains

The new end-point does not specify anymore the charset=utf-8 in the content-type header. Any good reason or we have just forgotten?

tim-moody commented 1 year ago

A little background to shed additional light on this issue, and which I find puzzling.

IIAB reads the Kiwix catalog in python using requests.get and then parses it using Beautiful Soup in order to produce an equivalent json file for use in generating lists of available zims.

There are only two lines that are different between the code to read root.xml and the code to read v2/entries?count=-1

# kiwix_catalog_url = 'https://library.kiwix.org/catalog/root.xml' # OPDS catalog
kiwix_catalog_url = 'https://library.kiwix.org/catalog/v2/entries?count=-1' # 3/21/2023 new OPDS endpoint

https://github.com/iiab/iiab-admin-console/blob/master/roles/cmdsrv/templates/scripts/get_kiwix_catalog_rootxml.py https://github.com/iiab/iiab-admin-console/blob/master/roles/cmdsrv/templates/scripts/get_kiwix_catalog_opds_v2.py

Does requests honor the charset directive?

rgaudin commented 1 year ago

Hum, it's an XML document and the encoding is properly declary in the XML Declaration Node (very first line)

❯ curl https://library.kiwix.org/catalog/v2/entries?count=1
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:opds="https://specs.opds.io/opds-1.2"
      xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">
  <id>a072cd94-2127-291e-d54a-b8fc5f9dee62</id>

  <link rel="self"
        href="/catalog/v2/entries?count=1"
        type="application/atom+xml;profile=opds-catalog;kind=acquisition"/>
  <link rel="start"
        href="/catalog/v2/root.xml"
        type="application/atom+xml;profile=opds-catalog;kind=navigation"/>
  <link rel="up"
        href="/catalog/v2/root.xml"
        type="application/atom+xml;profile=opds-catalog;kind=navigation"/>

  <title>Filtered Entries (count=1)</title>
  <updated>2023-04-22T14:38:18Z</updated>
  <totalResults>4116</totalResults>
  <startIndex>0</startIndex>
  <itemsPerPage>1</itemsPerPage>
  <entry>
    <id>urn:uuid:00099ae9-499a-15b4-2e48-27fe7b56e28f</id>
    <title>Reverse Engineering</title>
    <updated>2022-11-09T00:00:00Z</updated>
    <summary>Q&amp;A for researchers and developers who explore the principles of a system through analysis of its structure, function, and operation</summary>
    <language>eng</language>
    <name>reverseengineering.stackexchange.com_en_all</name>
    <flavour></flavour>
    <category>stack_exchange</category>
    <tags>_category:stack_exchange;stack_exchange;_ftindex:no;_pictures:yes;_videos:yes;_details:yes</tags>
    <articleCount>16304</articleCount>
    <mediaCount>8262</mediaCount>
    <link rel="http://opds-spec.org/image/thumbnail"
          href="/catalog/v2/illustration/00099ae9-499a-15b4-2e48-27fe7b56e28f/?size=48"
          type="image/png;width=48;height=48;scale=1"/>
    <link type="text/html" href="/content/reverseengineering.stackexchange.com_en_all_2022-11" />
    <author>
      <name>Stack Exchange</name>
    </author>
    <publisher>
      <name>Kiwix</name>
    </publisher>
    <dc:issued>2022-11-09T00:00:00Z</dc:issued>
    <link rel="http://opds-spec.org/acquisition/open-access" type="application/x-zim" href="https://download.kiwix.org/zim/stack_exchange/reverseengineering.stackexchange.com_en_all_2022-11.zim.meta4" length="107168768" />
  </entry>
</feed>

@tim-moody beautifulsoup is supposed to parse and use the declared encoding. Can you share a short piece of code that shows it not working?

tim-moody commented 1 year ago

Can you share a short piece of code that shows it not working?

@rgaudin as I said above the only difference between the code that works and the code that doesn't is the url that is read. root.xml works and v2/entries doesn't. (The links to the code are above.)

But let's keep it simple:

import requests

kiwix_root_catalog = 'https://library.kiwix.org/catalog/root.xml' # OPDS catalog
kiwix_v2_catalog = 'https://library.kiwix.org/catalog/v2/entries?count=-1' # 3/21/2023 new OPDS endpoint

test_entry = '637723c8-a736-6b1c-10d6-1b63d79ed5ea'

r = requests.get(kiwix_root_catalog)
offset = r.text.find(test_entry)
print(r.text[offset:offset+100])

r = requests.get(kiwix_v2_catalog)
offset = r.text.find(test_entry)
print(r.text[offset:offset+100])

The first prints

print(r.text[offset:offset+100]) 637723c8-a736-6b1c-10d6-1b63d79ed5ea
Matemáticas por Wikipedia

The second

print(r.text[offset:offset+100]) 637723c8-a736-6b1c-10d6-1b63d79ed5ea
MatemÃ¡ticas por Wikipedia
<updated

rgaudin commented 1 year ago

@rgaudin as I said above the only difference between the code that works and the code that doesn't is the url that is read. root.xml works and v2/entries doesn't. (The links to the code are above.)

Well pardon me for being focused on our end of the issue and not yours 😅 It's clear from your initial message that the charset is not conveyed via HTTP. That doesn't make it a problem since we are sending the proper Content-Type as per the spec and the charset is specified in the document.

Now if you're looking at requests only, it would be just fair that encoding isn't picked up since it's in the document and requests here is being used for the transport only.

As I wrote before, beautifulsoup being the xml parser (with lxml), it states in its doc that it reads the XMLDecl and adjusts for it.

That said, I see those strings properly decoded here. And the response's apparent_encoding is correct. Can you try updating requests?

Python 3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>>
>>> kiwix_root_catalog = 'https://library.kiwix.org/catalog/root.xml' # OPDS catalog
>>> kiwix_v2_catalog = 'https://library.kiwix.org/catalog/v2/entries?count=-1' # 3/21/2023 new OPDS endpoint
>>>
>>> test_entry = '637723c8-a736-6b1c-10d6-1b63d79ed5ea'
>>>
>>> r = requests.get(kiwix_root_catalog)
offset = r.text.find(test_entry)
print(r.text[offset:offset+100])

r = requests.get(kiwix_v2_catalog)
offset = r.text.find(test_entry)
print(r.text[offset:offset+100])
>>> offset = r.text.find(test_entry)
>>> print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea</id>
    <title>Matemáticas por Wikipedia</title>
    <updated>
>>>
>>> r = requests.get(kiwix_v2_catalog)
>>> offset = r.text.find(test_entry)
>>> print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea</id>
    <title>Matemáticas por Wikipedia</title>
    <updated>
>>> r.apparent_encoding
'utf-8'
>>>

holta commented 1 year ago

Can you try updating requests?

FWIW that didn't work for me on Ubuntu 23.04.

Even after I upgraded its requests from 2.8.1 (Jun 29, 2022) to 2.8.2 (Jan 12, 2023):

Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> print(requests.__version__)
2.28.2
>>> r = requests.get('https://library.kiwix.org/catalog/v2/entries?count=-1')
>>> offset = r.text.find('637723c8-a736-6b1c-10d6-1b63d79ed5ea')
>>> print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea</id>
    <title>MatemÃ¡ticas por Wikipedia</title>
    <updated
>>> r.apparent_encoding
'Windows-1254'

Anything else I should try above?

rgaudin commented 1 year ago

Ah ok so maybe requests can't pick it and defaulted to system encoding. Bs4 should though. I'll look into your script on Monday

holta commented 1 year ago

Python 3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>>
>>> kiwix_root_catalog = 'https://library.kiwix.org/catalog/root.xml' # OPDS catalog
>>> kiwix_v2_catalog = 'https://library.kiwix.org/catalog/v2/entries?count=-1' # 3/21/2023 new OPDS endpoint
>>>
>>> test_entry = '637723c8-a736-6b1c-10d6-1b63d79ed5ea'

...

>>> r = requests.get(kiwix_v2_catalog)
>>> offset = r.text.find(test_entry)
>>> print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea</id>
    <title>Matemáticas por Wikipedia</title>
    <updated>
>>> r.apparent_encoding
'utf-8'

@rgaudin is there anything special about your OS / environment / conditions above?

(So far others cannot reproduce your output above — despite trying on several different "stock" freshly installed Linux OS's like Ubuntu, Debian and Raspberry Pi OS — any suggestions or tips?)

tim-moody commented 1 year ago

so we can work around this by explicitly decoding:

utf8_xlm = r.content.decode("utf-8")
offset = utf8_xlm.find(test_entry)
print(utf8_xlm[offset:offset+100])

Still, library.xml, library_zims.xml, and root.xml can all be read by a browser, and this is usefull for troubleshooting.

rgaudin commented 1 year ago

Yes ; seems quite obvious that using r.text was using the requests-decoded string (and thus System one) ; leading to the issue.

https://github.com/iiab/iiab-admin-console/blob/6138f5217a4ce4f19c6d130a8ebfc65c2116ae1d/roles/cmdsrv/templates/scripts/get_kiwix_catalog_opds_v2.py#L94

I'd suggest you pass r.content directly to BS instead of decoding to UTF-8. There's no chance we'll switch to another encoding but by passing bytes, BS will use the declared encoding.

@tim-moody @kelson42 can we close this ticket?

kelson42 commented 1 year ago

@rgaudin Would that not be better (than current situation) if HTTP header gives charset as well?

rgaudin commented 1 year ago

@rgaudin Would that not be better (than current situation) if HTTP header gives charset as well?

I think not but for lousy reasons:

By respecting the spec's Content-Types we're preventing other tools that would expect this exact CTs from breaking if they didn't account for an extra charset specification. HTTP allows us to specify the charset but it's not well known.
This would mean maintaining the same information twice. Can surely be refactored so we only specify it once once use it both in the XML Declaration and the various endpoints with the different CTs though.

All in all, I think it's a better practice. In the example above, previous code was correct because the HTTP header was sent so it used it but the new one was not because it was not reading the encoding anywhere. What if there's a discrepency between HTTP and XML? One should use XML so by not setting the charset in HTTP header, we're enforcing the better practice 😉

I can live with both situations though. Your call (or libkiwix developers')

veloman-yunkan commented 1 year ago

It was a trivial fix but I used the opportunity for some minor refactoring

tim-moody commented 1 year ago

Are we still talking about https://library.kiwix.org/catalog/v2/entries?count=-1? that when viewed in chrome has:

<entry>
    <id>urn:uuid:225744bb-9488-085c-cfa8-19bf352c9e44</id>
    <title>MatemÃ¡ticas por Wikipedia</title>
    <updated>2023-04-23T00:00:00Z</updated>
    <summary>Una selecciÃ³n de artÃculos de Wikipedia sobre matemÃ¡ticas</summary>
    <language>spa</language>
    <name>wikipedia_es_mathematics</name>
    <flavour>maxi</flavour>
    <category>wikipedia</category>
    <tags>wikipedia;_category:wikipedia;_pictures:yes;_videos:no;_details:yes;_ftindex:yes</tags>
    <articleCount>6474</articleCount>
    <mediaCount>82062</mediaCount>
    <link rel="http://opds-spec.org/image/thumbnail"
          href="/catalog/v2/illustration/225744bb-9488-085c-cfa8-19bf352c9e44/?size=48"
          type="image/png;width=48;height=48;scale=1"/>
    <link type="text/html" href="/content/wikipedia_es_mathematics_maxi_2023-04" />
    <author>
      <name>Wikipedia</name>
    </author>
    <publisher>
      <name>Kiwix</name>
    </publisher>
    <dc:issued>2023-04-23T00:00:00Z</dc:issued>
    <link rel="http://opds-spec.org/acquisition/open-access" type="application/x-zim" href="https://download.kiwix.org/zim/wikipedia/wikipedia_es_mathematics_maxi_2023-04.zim.meta4" length="693760000" />
  </entry>

rgaudin commented 1 year ago

@tim-moody It's been merged but not deployed. Nightlies are deployed everyday at https://dev.library.kiwix.org/ but library.kiwix.org only changes after kiwix-tools releases

kiwix / libkiwix

`catalog/v2` Kiwix Server end-point does not specify anymore the charset in the HTTP headers #941