kiwix / operations

Kiwix Kubernetes Cluster
http://charts.k8s.kiwix.org/
7 stars 0 forks source link

Tags of `cest-pas-sorcier_fr_astronomie_2024-10` are not populated properly #286

Closed benoit74 closed 1 month ago

benoit74 commented 1 month ago

cest-pas-sorcier_fr_astronomie_2024-10.zim has just been published to dev library by https://farm.openzim.org/pipeline/8e629d81-229a-4895-88e4-60b11bdde0f9

Looking at the dev-library.xml, I see that tags referenced in the catalog are _ftindex:no;_pictures:yes;_videos:yes;_details:yes, as can be seen at https://dev.library.kiwix.org/catalog/v2/entry/fc792025-1503-ee93-71db-f6675d6bf47b for instance.

Looking at the ZIM, I have different tags Astronomie;Cosmologie;_videos:yes, as can be seen at https://dev.library.kiwix.org/raw/cest-pas-sorcier_fr_astronomie_2024-10/meta/tags

Where does this discrepancy comes from? Is it a scraper bug? A bug in library generation script?

rgaudin commented 1 month ago

@benoit74 it looks like a scraper bugs:

❯ curl -I https://dev.library.kiwix.org/raw/cest-pas-sorcier_fr_astronomie_2024-10/meta/Tags
HTTP/2 404

https://dev.library.kiwix.org/raw/cest-pas-sorcier_fr_astronomie_2024-10/meta/tags contains the tag metadata entry which is not the same as Tags and is not part of our convention.

zim.metadata_keys
['Counter',
 'Creator',
 'Date',
 'Description',
 'Illustration_48x48@1',
 'Language',
 'LongDescription',
 'Name',
 'Publisher',
 'Title',
 'scraper',
 'tags']

zim.get_metadata("tags")
b'Astronomie;Cosmologie;_videos:yes'

zim.get_tags()
['']

zim.get_tags(libkiwix=True)
['_ftindex:no', '_pictures:yes', '_videos:yes', '_details:yes']

The _ftindex:no;_pictures:yes;_videos:yes;_details:yes comes from scraperlib and given there's nothing else, that's what's put into the XML library which feeds the OPDS catalog.

curl https://dev.library.kiwix.org/catalog/v2/entry/fc792025-1503-ee93-71db-f6675d6bf47b

<?xml version="1.0" encoding="UTF-8"?>
  <entry>
    <id>urn:uuid:fc792025-1503-ee93-71db-f6675d6bf47b</id>
    <title>L&apos;espace, c&apos;est pas sorcier</title>
    <updated>2024-10-17T00:00:00Z</updated>
    <summary>Magazine télévisuel de vulgarisation scientifique destiné aux enfants</summary>
    <language>fra</language>
    <name>cest-pas-sorcier_fr_astronomie</name>
    <flavour></flavour>
    <category></category>
    <tags>_ftindex:no;_pictures:yes;_videos:yes;_details:yes</tags>
    <articleCount>30</articleCount>
    <mediaCount>62</mediaCount>
    <link rel="http://opds-spec.org/image/thumbnail"
          href="/catalog/v2/illustration/fc792025-1503-ee93-71db-f6675d6bf47b/?size=48"
          type="image/png;width=48;height=48;scale=1"/>
    <link type="text/html" href="/content/cest-pas-sorcier_fr_astronomie_2024-10" />
    <author>
      <name>Youtube Channel “C&apos;est pas sorcier”</name>
    </author>
    <publisher>
      <name>openZIM</name>
    </publisher>
    <dc:issued>2024-10-17T00:00:00Z</dc:issued>
    <link rel="http://opds-spec.org/acquisition/open-access" type="application/x-zim" href="https://mirror.download.kiwix.org/zim/.hidden/dev/cest-pas-sorcier_fr_astronomie_2024-10.zim.meta4" length="919883776" />
  </entry>

# cat dev_library.xml |grep fc792025-1503-ee93-71db-f6675d6bf47b

<book
    id="fc792025-1503-ee93-71db-f6675d6bf47b"
    size="898324"
    url="https://mirror.download.kiwix.org/zim/.hidden/dev/cest-pas-sorcier_fr_astronomie_2024-10.zim.meta4"
    mediaCount="62"
    articleCount="30"
    favicon="iVBORw0KGgoAAAANSUhEUgAAADAAAAAwCAIAAADYYG7QAAAIgklEQVR4nO2Yf1BVxxXHz9m9991333s0CAKdqhQUECxa5Uc601KNY/wVUVsZtbbOtDGmOtGqRFvFJkNmOiYVf5SMaTrTjj+CNaLRZtqqMWrQVjsNBkEQHq0/+BG0KqII77373rs/9vSP+0Ck6ZA/SJrO8P1r796zu589u3vOzuLNtlb4Ion9rwH6awhoIA0BDaQhoMdE/1n1OQFhfwzW96Mv12AB4X8pPxqWen8hERAAAiAB9msgDQ4PRXolexC0J947c6LIoNRjKyJE2J9+kICQAAAI0S4TAEOUHEQEfVkiRrZXGAKRMIAi4DiYQH1FIBgquk/tbEYwEbDHeyLiIWQICCAESsHo5LAjqo8vBwkosmIRP1nIXdjZ8uDQeicEgQiALJDCPAoAFPJzYQEigAihy73oV/iVLDBC0LN2gwNk70+ytywgAcjCiDZuq+QDZECWX7jjlrwIZPnKX/IwjRBBiBB/AsAIArA+B23wlqxndxIxAkREYBKRJJAz4LpwwJe/jmSaIBGXBXIEi1BCQASg3kMxiECPdgES9nwSIiASICcdQ10AwIQJiIhESACAwAAQCHvn8xlsaiCCyMkXRAgECExYZOqAHECIiEPs2BQx7tVgAtFjIQ4BEHv8hmRxBEBAS0Sio71aPa4c/CUD6B9vgCyJhB2IFBQB7/sAqHJTEkwQIZLtJPZYq0EEIgYobCxGlgAMkhtAAAJaBKRpVRWA6CY0LAU4B7KCoCqRMNpnVjfbWi1LcM4AwLIsxhgAICL1CVZ2bLX/2vWMsV5jIQTnHASJSMIgzrlsanD3qswEIDo4e3C+PCE3DzncOnc8ftqysOAAQgA3h6fqkhtI2AMxxrCttcXldhmGSSScihLWw4xxXdc555IkEZGwLCISRB5PlK6HZVnWdT0UCkV5ogzTYIwxxvx+v1NRGOeMcVniwVAoFA5HD4sVQpAQDqdqNX0Y/mcF6QFl7DRMnSwskzGJhBXUArLEkTHGmBDC7/djZ0f7mQ/Obnn1Nb/ft/GnGzxul9fr7XzYNW3608eOHRubNra5pUmSpM2bNu8oLX3vxHtpaWlFRRszMjIOHT785hu/kRV53dq18+bNLS4urrpUHQ7rY0aPLixcm5aWduzYsdLS14VlvbDqhR8sXrhuxY9MXX+lpLTwZ0V379z1+QOTJ+fNnDGjZNsO3dBdLtfSpUtnz5oFVxu9ccPjUlJSn/zGkwBwpPzt7a9t2VBYePRw+bSnply48NdXt/xi/1t79+7ZAwB5ed8CgBnTn2701ssOeVzGuDFjUt0ud9ON69OnTweAvG/mAcD3lyxp+7jF7XYlJyenpqQAQG31peycnMSk0c1N12OHxcTEDH9qypSfF20q27cPALKycmJihkdHD7tSWys1tbTc67hXXPzy3LlzSl/flZk5vquru/Hqtdyc7JkzZ546dSo+drhTUS78/UNZlveXlR0/fry7u6vy4kVDN3bu3NHd7Vu0eJHX641ye0aNSDx//i/Z2TmN3n9cqroUCGjFrxRPmjjhjV2/Ni3DoyohVZUl2SJaMH/e7r27Aayyffs552/u2uVt9C5b/tytf7Ux07IQkciKi4t/+aXNiclJFpDHpdZervU2NDgVxRMVFRX1pQkTxhuGkf/MXKdTKSoq0nwBRHQ4HE6njIiBQIBx5vf7d+/Z3dTSHJ8QZ3fLAFNTUreWbE1NS9OCIUISZKmqs/Ji5aqVK8+8f1pVnJZlnThx/OTJkwDg8XgYA05EQGgYRjAUMkwzIT6+pvoyIMbExk6dOrW5qbmysrKgoKB0504tFFr+/Io1q9coTicRRbIEERE5ZLmz68Hy53+sacEN6wsFWUTEGTMMPRgMCCEQEUhYluVwyC2tLWUHDly7cZ1JHAC2lpQcPvLOypUrMjIymB0FFFXhnNdUXfZ1+e/cuZOamnLixPErdXUXzv+tpqYmKSmpra0tf+4zDQ1XsiZmV5w9p+s6Isqcc8YRkUssFNSSE5PXrP4JIowcNQoJEFHinCFrutHk9wcYY0DAudTV1TVvTv7DBx3Llj3b7fMBwPoNL7rd7ujoaJfLxeLiYwCgouLstpLt+fPn1dXVqKrr9p32UYlJPr/fNPWFixfW19dv21qSnpF5+vRp3dBNSySOHEFEH5w5W1FxjohGjRxpWBYgLChYEA6HD5UfThmTQkSVFz/avn3ntydPvXr1qqq6LEEgUJIct+/e+eOf/lx1qVqWJcZYbm5ubnZO+cHy7m4fdNy7u27dWjsu5efP8fsebi/55TsHD+773W+Xfm9xwXfmr1296rlnf/j7src8Ho8diDdv2tjdeX/2rBl2q4LvLtACvqlTJsc8EX2zrTUjIz05cVRH++1FCxfaBrNnz+q8fy9r0sSU0aObr10bOWJEpH7m9ENvHwCAd/9wZMf2bQDw7tGjeOvmx7Is19fXh8PhCePHyw5He3u7LEkhLZiQkNBx/35ra3N8QkJaenpjQ2NzS2tcXExm5ngEFgoHq6svcebIzpkkSVJjgzcY1CZmZTU3Nd9rv5uVlc0lfrHqI8sSubk5TofzSl2tEJSRnn6lod4wTSSKjY0dNiymwesdl/k1hyzXXa5N+moy3mxrJSJVVRljmqYJIRwOBxEhommakiQ5FNk0zGAwqDrdilMxTV3TNABEZG63GwACAT8ROVWVMxYIBBRFkR0OLRAgoh6DABGpLhcCaMGgqqoMEREN0zQNw+VyaZpmG4fCIbQfrIQQdiqxk1EkhQESEJFAREQmBBER4iMzuxXnvG8PQgggYpzb6e+TDQDsOSOiEKI3JzLGpN6s1i+V9jIh8h4b7HvhQUR7pH499O1qQIO+NnZh6LFhIH0aIILH71CfqT7NjfETHg8+O/1fLtnnqiGggTQENJCGgAbSFw7o35xNI7JZd56ZAAAAAElFTkSuQmCC"
    title="L'espace, c'est pas sorcier"
    description="Magazine t&#xE9;l&#xE9;visuel de vulgarisation scientifique destin&#xE9; aux enfants"
    language="fra"
    creator="Youtube Channel &#x201C;C'est pas sorcier&#x201D;"
    publisher="openZIM"
    name="cest-pas-sorcier_fr_astronomie"
    tags="_ftindex:no;_pictures:yes;_videos:yes;_details:yes"
    date="2024-10-17"
    faviconMimeType="image/png"
    path="/data/dev/cest-pas-sorcier_fr_astronomie_2024-10.zim"/>
kelson42 commented 1 month ago

@rgaudin Where is the related issue (looks like one should be open if this is a scraper bug)?

rgaudin commented 1 month ago

I assume @benoit74 will open one once he reads my answer ; as this ticket was a question.

benoit74 commented 1 month ago

Thank you for the precise analysis!

kelson42 commented 1 month ago

@rgaudin OK, in gneral we should be cautious to not close before the other issue is open. In general this allows to avoid to forget something.... but more concretly this is really easier to me to follow things.