Closed benoit74 closed 1 month ago
@benoit74 it looks like a scraper bugs:
❯ curl -I https://dev.library.kiwix.org/raw/cest-pas-sorcier_fr_astronomie_2024-10/meta/Tags
HTTP/2 404
https://dev.library.kiwix.org/raw/cest-pas-sorcier_fr_astronomie_2024-10/meta/tags contains the tag
metadata entry which is not the same as Tags
and is not part of our convention.
zim.metadata_keys
['Counter',
'Creator',
'Date',
'Description',
'Illustration_48x48@1',
'Language',
'LongDescription',
'Name',
'Publisher',
'Title',
'scraper',
'tags']
zim.get_metadata("tags")
b'Astronomie;Cosmologie;_videos:yes'
zim.get_tags()
['']
zim.get_tags(libkiwix=True)
['_ftindex:no', '_pictures:yes', '_videos:yes', '_details:yes']
The _ftindex:no;_pictures:yes;_videos:yes;_details:yes
comes from scraperlib and given there's nothing else, that's what's put into the XML library which feeds the OPDS catalog.
curl https://dev.library.kiwix.org/catalog/v2/entry/fc792025-1503-ee93-71db-f6675d6bf47b
<?xml version="1.0" encoding="UTF-8"?>
<entry>
<id>urn:uuid:fc792025-1503-ee93-71db-f6675d6bf47b</id>
<title>L'espace, c'est pas sorcier</title>
<updated>2024-10-17T00:00:00Z</updated>
<summary>Magazine télévisuel de vulgarisation scientifique destiné aux enfants</summary>
<language>fra</language>
<name>cest-pas-sorcier_fr_astronomie</name>
<flavour></flavour>
<category></category>
<tags>_ftindex:no;_pictures:yes;_videos:yes;_details:yes</tags>
<articleCount>30</articleCount>
<mediaCount>62</mediaCount>
<link rel="http://opds-spec.org/image/thumbnail"
href="/catalog/v2/illustration/fc792025-1503-ee93-71db-f6675d6bf47b/?size=48"
type="image/png;width=48;height=48;scale=1"/>
<link type="text/html" href="/content/cest-pas-sorcier_fr_astronomie_2024-10" />
<author>
<name>Youtube Channel “C'est pas sorcier”</name>
</author>
<publisher>
<name>openZIM</name>
</publisher>
<dc:issued>2024-10-17T00:00:00Z</dc:issued>
<link rel="http://opds-spec.org/acquisition/open-access" type="application/x-zim" href="https://mirror.download.kiwix.org/zim/.hidden/dev/cest-pas-sorcier_fr_astronomie_2024-10.zim.meta4" length="919883776" />
</entry>
# cat dev_library.xml |grep fc792025-1503-ee93-71db-f6675d6bf47b
<book
id="fc792025-1503-ee93-71db-f6675d6bf47b"
size="898324"
url="https://mirror.download.kiwix.org/zim/.hidden/dev/cest-pas-sorcier_fr_astronomie_2024-10.zim.meta4"
mediaCount="62"
articleCount="30"
favicon="iVBORw0KGgoAAAANSUhEUgAAADAAAAAwCAIAAADYYG7QAAAIgklEQVR4nO2Yf1BVxxXHz9m9991333s0CAKdqhQUECxa5Uc601KNY/wVUVsZtbbOtDGmOtGqRFvFJkNmOiYVf5SMaTrTjj+CNaLRZtqqMWrQVjsNBkEQHq0/+BG0KqII77373rs/9vSP+0Ck6ZA/SJrO8P1r796zu589u3vOzuLNtlb4Ion9rwH6awhoIA0BDaQhoMdE/1n1OQFhfwzW96Mv12AB4X8pPxqWen8hERAAAiAB9msgDQ4PRXolexC0J947c6LIoNRjKyJE2J9+kICQAAAI0S4TAEOUHEQEfVkiRrZXGAKRMIAi4DiYQH1FIBgquk/tbEYwEbDHeyLiIWQICCAESsHo5LAjqo8vBwkosmIRP1nIXdjZ8uDQeicEgQiALJDCPAoAFPJzYQEigAihy73oV/iVLDBC0LN2gwNk70+ytywgAcjCiDZuq+QDZECWX7jjlrwIZPnKX/IwjRBBiBB/AsAIArA+B23wlqxndxIxAkREYBKRJJAz4LpwwJe/jmSaIBGXBXIEi1BCQASg3kMxiECPdgES9nwSIiASICcdQ10AwIQJiIhESACAwAAQCHvn8xlsaiCCyMkXRAgECExYZOqAHECIiEPs2BQx7tVgAtFjIQ4BEHv8hmRxBEBAS0Sio71aPa4c/CUD6B9vgCyJhB2IFBQB7/sAqHJTEkwQIZLtJPZYq0EEIgYobCxGlgAMkhtAAAJaBKRpVRWA6CY0LAU4B7KCoCqRMNpnVjfbWi1LcM4AwLIsxhgAICL1CVZ2bLX/2vWMsV5jIQTnHASJSMIgzrlsanD3qswEIDo4e3C+PCE3DzncOnc8ftqysOAAQgA3h6fqkhtI2AMxxrCttcXldhmGSSScihLWw4xxXdc555IkEZGwLCISRB5PlK6HZVnWdT0UCkV5ogzTYIwxxvx+v1NRGOeMcVniwVAoFA5HD4sVQpAQDqdqNX0Y/mcF6QFl7DRMnSwskzGJhBXUArLEkTHGmBDC7/djZ0f7mQ/Obnn1Nb/ft/GnGzxul9fr7XzYNW3608eOHRubNra5pUmSpM2bNu8oLX3vxHtpaWlFRRszMjIOHT785hu/kRV53dq18+bNLS4urrpUHQ7rY0aPLixcm5aWduzYsdLS14VlvbDqhR8sXrhuxY9MXX+lpLTwZ0V379z1+QOTJ+fNnDGjZNsO3dBdLtfSpUtnz5oFVxu9ccPjUlJSn/zGkwBwpPzt7a9t2VBYePRw+bSnply48NdXt/xi/1t79+7ZAwB5ed8CgBnTn2701ssOeVzGuDFjUt0ud9ON69OnTweAvG/mAcD3lyxp+7jF7XYlJyenpqQAQG31peycnMSk0c1N12OHxcTEDH9qypSfF20q27cPALKycmJihkdHD7tSWys1tbTc67hXXPzy3LlzSl/flZk5vquru/Hqtdyc7JkzZ546dSo+drhTUS78/UNZlveXlR0/fry7u6vy4kVDN3bu3NHd7Vu0eJHX641ye0aNSDx//i/Z2TmN3n9cqroUCGjFrxRPmjjhjV2/Ni3DoyohVZUl2SJaMH/e7r27Aayyffs552/u2uVt9C5b/tytf7Ux07IQkciKi4t/+aXNiclJFpDHpdZervU2NDgVxRMVFRX1pQkTxhuGkf/MXKdTKSoq0nwBRHQ4HE6njIiBQIBx5vf7d+/Z3dTSHJ8QZ3fLAFNTUreWbE1NS9OCIUISZKmqs/Ji5aqVK8+8f1pVnJZlnThx/OTJkwDg8XgYA05EQGgYRjAUMkwzIT6+pvoyIMbExk6dOrW5qbmysrKgoKB0504tFFr+/Io1q9coTicRRbIEERE5ZLmz68Hy53+sacEN6wsFWUTEGTMMPRgMCCEQEUhYluVwyC2tLWUHDly7cZ1JHAC2lpQcPvLOypUrMjIymB0FFFXhnNdUXfZ1+e/cuZOamnLixPErdXUXzv+tpqYmKSmpra0tf+4zDQ1XsiZmV5w9p+s6Isqcc8YRkUssFNSSE5PXrP4JIowcNQoJEFHinCFrutHk9wcYY0DAudTV1TVvTv7DBx3Llj3b7fMBwPoNL7rd7ujoaJfLxeLiYwCgouLstpLt+fPn1dXVqKrr9p32UYlJPr/fNPWFixfW19dv21qSnpF5+vRp3dBNSySOHEFEH5w5W1FxjohGjRxpWBYgLChYEA6HD5UfThmTQkSVFz/avn3ntydPvXr1qqq6LEEgUJIct+/e+eOf/lx1qVqWJcZYbm5ubnZO+cHy7m4fdNy7u27dWjsu5efP8fsebi/55TsHD+773W+Xfm9xwXfmr1296rlnf/j7src8Ho8diDdv2tjdeX/2rBl2q4LvLtACvqlTJsc8EX2zrTUjIz05cVRH++1FCxfaBrNnz+q8fy9r0sSU0aObr10bOWJEpH7m9ENvHwCAd/9wZMf2bQDw7tGjeOvmx7Is19fXh8PhCePHyw5He3u7LEkhLZiQkNBx/35ra3N8QkJaenpjQ2NzS2tcXExm5ngEFgoHq6svcebIzpkkSVJjgzcY1CZmZTU3Nd9rv5uVlc0lfrHqI8sSubk5TofzSl2tEJSRnn6lod4wTSSKjY0dNiymwesdl/k1hyzXXa5N+moy3mxrJSJVVRljmqYJIRwOBxEhommakiQ5FNk0zGAwqDrdilMxTV3TNABEZG63GwACAT8ROVWVMxYIBBRFkR0OLRAgoh6DABGpLhcCaMGgqqoMEREN0zQNw+VyaZpmG4fCIbQfrIQQdiqxk1EkhQESEJFAREQmBBER4iMzuxXnvG8PQgggYpzb6e+TDQDsOSOiEKI3JzLGpN6s1i+V9jIh8h4b7HvhQUR7pH499O1qQIO+NnZh6LFhIH0aIILH71CfqT7NjfETHg8+O/1fLtnnqiGggTQENJCGgAbSFw7o35xNI7JZd56ZAAAAAElFTkSuQmCC"
title="L'espace, c'est pas sorcier"
description="Magazine télévisuel de vulgarisation scientifique destiné aux enfants"
language="fra"
creator="Youtube Channel “C'est pas sorcier”"
publisher="openZIM"
name="cest-pas-sorcier_fr_astronomie"
tags="_ftindex:no;_pictures:yes;_videos:yes;_details:yes"
date="2024-10-17"
faviconMimeType="image/png"
path="/data/dev/cest-pas-sorcier_fr_astronomie_2024-10.zim"/>
@rgaudin Where is the related issue (looks like one should be open if this is a scraper bug)?
I assume @benoit74 will open one once he reads my answer ; as this ticket was a question.
Thank you for the precise analysis!
@rgaudin OK, in gneral we should be cautious to not close before the other issue is open. In general this allows to avoid to forget something.... but more concretly this is really easier to me to follow things.
cest-pas-sorcier_fr_astronomie_2024-10.zim
has just been published to dev library by https://farm.openzim.org/pipeline/8e629d81-229a-4895-88e4-60b11bdde0f9Looking at the dev-library.xml, I see that tags referenced in the catalog are
_ftindex:no;_pictures:yes;_videos:yes;_details:yes
, as can be seen at https://dev.library.kiwix.org/catalog/v2/entry/fc792025-1503-ee93-71db-f6675d6bf47b for instance.Looking at the ZIM, I have different tags
Astronomie;Cosmologie;_videos:yes
, as can be seen at https://dev.library.kiwix.org/raw/cest-pas-sorcier_fr_astronomie_2024-10/meta/tagsWhere does this discrepancy comes from? Is it a scraper bug? A bug in library generation script?