Closed kelson42 closed 1 year ago
A little background to shed additional light on this issue, and which I find puzzling.
IIAB reads the Kiwix catalog in python using requests.get and then parses it using Beautiful Soup in order to produce an equivalent json file for use in generating lists of available zims.
There are only two lines that are different between the code to read root.xml and the code to read v2/entries?count=-1
# kiwix_catalog_url = 'https://library.kiwix.org/catalog/root.xml' # OPDS catalog
kiwix_catalog_url = 'https://library.kiwix.org/catalog/v2/entries?count=-1' # 3/21/2023 new OPDS endpoint
https://github.com/iiab/iiab-admin-console/blob/master/roles/cmdsrv/templates/scripts/get_kiwix_catalog_rootxml.py https://github.com/iiab/iiab-admin-console/blob/master/roles/cmdsrv/templates/scripts/get_kiwix_catalog_opds_v2.py
Does requests honor the charset directive?
Hum, it's an XML document and the encoding is properly declary in the XML Declaration Node (very first line)
❯ curl https://library.kiwix.org/catalog/v2/entries?count=1
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
xmlns:dc="http://purl.org/dc/terms/"
xmlns:opds="https://specs.opds.io/opds-1.2"
xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">
<id>a072cd94-2127-291e-d54a-b8fc5f9dee62</id>
<link rel="self"
href="/catalog/v2/entries?count=1"
type="application/atom+xml;profile=opds-catalog;kind=acquisition"/>
<link rel="start"
href="/catalog/v2/root.xml"
type="application/atom+xml;profile=opds-catalog;kind=navigation"/>
<link rel="up"
href="/catalog/v2/root.xml"
type="application/atom+xml;profile=opds-catalog;kind=navigation"/>
<title>Filtered Entries (count=1)</title>
<updated>2023-04-22T14:38:18Z</updated>
<totalResults>4116</totalResults>
<startIndex>0</startIndex>
<itemsPerPage>1</itemsPerPage>
<entry>
<id>urn:uuid:00099ae9-499a-15b4-2e48-27fe7b56e28f</id>
<title>Reverse Engineering</title>
<updated>2022-11-09T00:00:00Z</updated>
<summary>Q&A for researchers and developers who explore the principles of a system through analysis of its structure, function, and operation</summary>
<language>eng</language>
<name>reverseengineering.stackexchange.com_en_all</name>
<flavour></flavour>
<category>stack_exchange</category>
<tags>_category:stack_exchange;stack_exchange;_ftindex:no;_pictures:yes;_videos:yes;_details:yes</tags>
<articleCount>16304</articleCount>
<mediaCount>8262</mediaCount>
<link rel="http://opds-spec.org/image/thumbnail"
href="/catalog/v2/illustration/00099ae9-499a-15b4-2e48-27fe7b56e28f/?size=48"
type="image/png;width=48;height=48;scale=1"/>
<link type="text/html" href="/content/reverseengineering.stackexchange.com_en_all_2022-11" />
<author>
<name>Stack Exchange</name>
</author>
<publisher>
<name>Kiwix</name>
</publisher>
<dc:issued>2022-11-09T00:00:00Z</dc:issued>
<link rel="http://opds-spec.org/acquisition/open-access" type="application/x-zim" href="https://download.kiwix.org/zim/stack_exchange/reverseengineering.stackexchange.com_en_all_2022-11.zim.meta4" length="107168768" />
</entry>
</feed>
@tim-moody beautifulsoup is supposed to parse and use the declared encoding. Can you share a short piece of code that shows it not working?
Can you share a short piece of code that shows it not working?
@rgaudin as I said above the only difference between the code that works and the code that doesn't is the url that is read. root.xml works and v2/entries doesn't. (The links to the code are above.)
But let's keep it simple:
import requests
kiwix_root_catalog = 'https://library.kiwix.org/catalog/root.xml' # OPDS catalog
kiwix_v2_catalog = 'https://library.kiwix.org/catalog/v2/entries?count=-1' # 3/21/2023 new OPDS endpoint
test_entry = '637723c8-a736-6b1c-10d6-1b63d79ed5ea'
r = requests.get(kiwix_root_catalog)
offset = r.text.find(test_entry)
print(r.text[offset:offset+100])
r = requests.get(kiwix_v2_catalog)
offset = r.text.find(test_entry)
print(r.text[offset:offset+100])
The first prints
print(r.text[offset:offset+100]) 637723c8-a736-6b1c-10d6-1b63d79ed5ea
Matemáticas por Wikipedia
The second
print(r.text[offset:offset+100]) 637723c8-a736-6b1c-10d6-1b63d79ed5ea
Matemáticas por Wikipedia <updated
@rgaudin as I said above the only difference between the code that works and the code that doesn't is the url that is read. root.xml works and v2/entries doesn't. (The links to the code are above.)
Well pardon me for being focused on our end of the issue and not yours 😅 It's clear from your initial message that the charset is not conveyed via HTTP. That doesn't make it a problem since we are sending the proper Content-Type as per the spec and the charset is specified in the document.
Now if you're looking at requests only, it would be just fair that encoding isn't picked up since it's in the document and requests here is being used for the transport only.
As I wrote before, beautifulsoup being the xml parser (with lxml), it states in its doc that it reads the XMLDecl and adjusts for it.
That said, I see those strings properly decoded here. And the response's apparent_encoding
is correct. Can you try updating requests
?
Python 3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>>
>>> kiwix_root_catalog = 'https://library.kiwix.org/catalog/root.xml' # OPDS catalog
>>> kiwix_v2_catalog = 'https://library.kiwix.org/catalog/v2/entries?count=-1' # 3/21/2023 new OPDS endpoint
>>>
>>> test_entry = '637723c8-a736-6b1c-10d6-1b63d79ed5ea'
>>>
>>> r = requests.get(kiwix_root_catalog)
offset = r.text.find(test_entry)
print(r.text[offset:offset+100])
r = requests.get(kiwix_v2_catalog)
offset = r.text.find(test_entry)
print(r.text[offset:offset+100])
>>> offset = r.text.find(test_entry)
>>> print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea</id>
<title>Matemáticas por Wikipedia</title>
<updated>
>>>
>>> r = requests.get(kiwix_v2_catalog)
>>> offset = r.text.find(test_entry)
>>> print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea</id>
<title>Matemáticas por Wikipedia</title>
<updated>
>>> r.apparent_encoding
'utf-8'
>>>
Can you try updating
requests
?
FWIW that didn't work for me on Ubuntu 23.04.
Even after I upgraded its requests from 2.8.1 (Jun 29, 2022) to 2.8.2 (Jan 12, 2023):
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> print(requests.__version__)
2.28.2
>>> r = requests.get('https://library.kiwix.org/catalog/v2/entries?count=-1')
>>> offset = r.text.find('637723c8-a736-6b1c-10d6-1b63d79ed5ea')
>>> print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea</id>
<title>Matemáticas por Wikipedia</title>
<updated
>>> r.apparent_encoding
'Windows-1254'
Anything else I should try above?
Ah ok so maybe requests can't pick it and defaulted to system encoding. Bs4 should though. I'll look into your script on Monday
Python 3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import requests >>> >>> kiwix_root_catalog = 'https://library.kiwix.org/catalog/root.xml' # OPDS catalog >>> kiwix_v2_catalog = 'https://library.kiwix.org/catalog/v2/entries?count=-1' # 3/21/2023 new OPDS endpoint >>> >>> test_entry = '637723c8-a736-6b1c-10d6-1b63d79ed5ea'
...
>>> r = requests.get(kiwix_v2_catalog) >>> offset = r.text.find(test_entry) >>> print(r.text[offset:offset+100]) 637723c8-a736-6b1c-10d6-1b63d79ed5ea</id> <title>Matemáticas por Wikipedia</title> <updated> >>> r.apparent_encoding 'utf-8'
@rgaudin is there anything special about your OS / environment / conditions above?
(So far others cannot reproduce your output above — despite trying on several different "stock" freshly installed Linux OS's like Ubuntu, Debian and Raspberry Pi OS — any suggestions or tips?)
so we can work around this by explicitly decoding:
utf8_xlm = r.content.decode("utf-8")
offset = utf8_xlm.find(test_entry)
print(utf8_xlm[offset:offset+100])
Still, library.xml, library_zims.xml, and root.xml can all be read by a browser, and this is usefull for troubleshooting.
Yes ; seems quite obvious that using r.text
was using the requests-decoded string (and thus System one) ; leading to the issue.
I'd suggest you pass r.content
directly to BS instead of decoding to UTF-8. There's no chance we'll switch to another encoding but by passing bytes, BS will use the declared encoding.
@tim-moody @kelson42 can we close this ticket?
@rgaudin Would that not be better (than current situation) if HTTP header gives charset as well?
@rgaudin Would that not be better (than current situation) if HTTP header gives charset as well?
I think not but for lousy reasons:
All in all, I think it's a better practice. In the example above, previous code was correct because the HTTP header was sent so it used it but the new one was not because it was not reading the encoding anywhere. What if there's a discrepency between HTTP and XML? One should use XML so by not setting the charset in HTTP header, we're enforcing the better practice 😉
I can live with both situations though. Your call (or libkiwix developers')
It was a trivial fix but I used the opportunity for some minor refactoring
Are we still talking about https://library.kiwix.org/catalog/v2/entries?count=-1? that when viewed in chrome has:
<entry>
<id>urn:uuid:225744bb-9488-085c-cfa8-19bf352c9e44</id>
<title>Matemáticas por Wikipedia</title>
<updated>2023-04-23T00:00:00Z</updated>
<summary>Una selección de artÃculos de Wikipedia sobre matemáticas</summary>
<language>spa</language>
<name>wikipedia_es_mathematics</name>
<flavour>maxi</flavour>
<category>wikipedia</category>
<tags>wikipedia;_category:wikipedia;_pictures:yes;_videos:no;_details:yes;_ftindex:yes</tags>
<articleCount>6474</articleCount>
<mediaCount>82062</mediaCount>
<link rel="http://opds-spec.org/image/thumbnail"
href="/catalog/v2/illustration/225744bb-9488-085c-cfa8-19bf352c9e44/?size=48"
type="image/png;width=48;height=48;scale=1"/>
<link type="text/html" href="/content/wikipedia_es_mathematics_maxi_2023-04" />
<author>
<name>Wikipedia</name>
</author>
<publisher>
<name>Kiwix</name>
</publisher>
<dc:issued>2023-04-23T00:00:00Z</dc:issued>
<link rel="http://opds-spec.org/acquisition/open-access" type="application/x-zim" href="https://download.kiwix.org/zim/wikipedia/wikipedia_es_mathematics_maxi_2023-04.zim.meta4" length="693760000" />
</entry>
@tim-moody It's been merged but not deployed. Nightlies are deployed everyday at https://dev.library.kiwix.org/ but library.kiwix.org only changes after kiwix-tools releases
Following remark of @tim-moody, I have run the following test:
and then
The new end-point does not specify anymore the
charset=utf-8
in thecontent-type
header. Any good reason or we have just forgotten?