kiwix / kiwix-tools

Command line Kiwix tools: kiwix-serve, kiwix-manage, ...
https://download.kiwix.org/release/kiwix-tools/
GNU General Public License v3.0
428 stars 83 forks source link

library_zim.xml and root.xml show incorrect articleCount for recent ZIM files #536

Closed holta closed 2 years ago

holta commented 2 years ago

Both .xml "catalogs" below contain erroneous articleCount numbers — numbers that are inflated by about 2X or 3X higher than the correct number — for most (or every?) new ZIM file that was published in recent months:

Example:

Both above .xml catalogs incorrectly show articleCount="16229464" — which is almost 3X more than the correct number for: http://download.kiwix.org/zim/wikipedia/wikipedia_en_all_maxi_2021-12.zim

The correct number being text/html=6422607 according to these:

Summary Question: Can this articleCount number be fixed in one-or-both of the above .xml's ?

kelson42 commented 2 years ago

@holta This files are gnerated automatically. We can not fix the manually @veloman-yunkan Looks strange, but seems somehow a regression has been introduced here.

veloman-yunkan commented 2 years ago

@kelson42 What tool is used to generate the library.xml file? kiwix-manage?

veloman-yunkan commented 2 years ago

If so then the value of articleCount that kiwix::Manager registers in library.xml for a book is obtained via zim::Archive::getArticleCount():

  entry_index_type Archive::getArticleCount() const
  {
    if (m_impl->hasFrontArticlesIndex()) {
      return m_impl->getFrontEntryCount().v;
    } else if (m_impl->hasNewNamespaceScheme()) {
      return m_impl->getNamespaceEntryCount('C').v;
    } else {
      return m_impl->getNamespaceEntryCount('A').v;
    }
  }

That version is available in master since Jun 17 2021.

Does our ZIM generation flow mark front articles? If not then the total number of entries from the 'C' namespace is returned as article count.

kelson42 commented 2 years ago

@veloman-yunkan Yes, this is kiwix-mange and the values of library.xml should have been delivered by Libkiwix reader::getArticleCount() and reader::getMediaCount()... Looks like someone has changed something around this!

tim-moody commented 2 years ago

As another data point here is the fragment from root.xml for the 2022/1/15 mdwiki zim:

<entry>
    <id>urn:uuid:e5566481-35cc-7c17-9b9f-2e69785b153b</id>
    <title>MDWiki Medical Encyclopedia</title>
    <updated>2022-01-15T00:00:00Z</updated>
    <summary>Healthcare articles curated by WikiProjectMed</summary>
    <language>eng</language>
    <name>mdwiki_en_all</name>
    <flavour>maxi</flavour>
    <category></category>
    <tags>mdwiki;_pictures:yes;_videos:no;_details:yes;_ftindex:yes</tags>
    <articleCount>297614</articleCount>
    <mediaCount>79777</mediaCount>
    <link rel="http://opds-spec.org/image/thumbnail"
          href="/catalog/v2/illustration/mdwiki_en_all_maxi_2022-01/?size=48"
          type="image/png;width=48;height=48;scale=1"/>
    <link type="text/html" href="/mdwiki_en_all_maxi_2022-01" />
    <author>
      <name>Offline</name>
    </author>
    <publisher>
      <name>WikiProjectMed</name>
    </publisher>
    <link rel="http://opds-spec.org/acquisition/open-access" type="application/x-zim" href="https://download.kiwix.org/zim/other/mdwiki_en_all_maxi_2022-01.zim.meta4" length="1666467840" />
  </entry>

I don't think there are 297614 articles, but articles + redirects would be that order of magnitude.

mgautierfr commented 2 years ago

https://library.kiwix.org/raw/mdwiki_en_all_maxi_2022-01/meta/Counter returns : text/plain=10;text/css=30;application/javascript=28;image/png=5;text/html=61072;image/webp=75709;image/svg+xml; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/SVG/1.0.0"=4034;undefined=1566;image/svg+xml=22;image/gif=7 so 142 483. There is a delta of 297617-142483=55131, it could be indeed redirects which are the same order of magnitude that text/html.

The question is : does the zim files have a front articles list ? It has been created with mwoffliner 1.11.10 Which version of libzim as been used ?

kelson42 commented 2 years ago

@mwoffliner is still not updated to libzim7 and it seems the values printed in the XML are not related (based on) the values in M/Counter. I believe you follow the wrong track here.

holta commented 2 years ago

ASIDE — tangentially related issues:

mgautierfr commented 2 years ago

If the file has been created by libzim 6, the value returned by getArticleCount is the number of items in A namespace.

The zim header says there is 379028 items in the file. If from M/Counter we suppose there is around (10+30+28+5+75709+4034+1566+22+7 = 81389) items (media) not in A namespace, it left 379028-81389=297639 items, which seems pretty coherent with the 297617 articles in the xml. There is a delta of 22 items, which can probably be explained by metadata or xapian databases.

kelson42 commented 2 years ago

@mgautierfr AFAIK, this is a regression. It was not working like this before (and should not for pre-libzim7 files).

holta commented 2 years ago

With every recently published large ZIM file that I've checked, articleCount is off by a factor of 2X or 3X — this is not a rounding error (-:

@tim-moody suggests this might possibly be a result of redirects accidentally being included in the total?

mgautierfr commented 2 years ago

Previous version was using the M/Counter to return the sum of all articles with mimetype starting by text/html : https://github.com/kiwix/libkiwix/blob/9.4.1/src/reader.cpp#L119-L133

tim-moody commented 2 years ago

when mwoffliner prints its summary it breaks out articles from redirects. This would be nice to have in the metadata,

mgautierfr commented 2 years ago

when mwoffliner prints its summary it breaks out articles from redirects. This would be nice to have in the metadata,

We don't have this information. We may add it, but it is another issue.

kelson42 commented 2 years ago

Looks like this regression has been created by @mgautierfr, reassigning.