openzim / python-libzim

Libzim binding for Python: read/write ZIM files in Python
https://pypi.org/project/libzim/
GNU General Public License v3.0
62 stars 20 forks source link

unknown mime type code 65535 #87

Closed amirouche closed 3 years ago

amirouche commented 3 years ago

I am trying to read simple english zim file, I get the following error:

$ python babelia-zim2wet.py 
Traceback (most recent call last):
  File "babelia-zim2wet.py", line 25, in <module>
    if article.mimetype != "text/html":
  File "libzim/wrapper.pyx", line 299, in libzim.wrapper.ReadArticle.mimetype.__get__
RuntimeError: unknown mime type code 65535

Versions:

% python --version
Python 3.8.5
% pip install libzim
Requirement already satisfied: libzim in /home/amirouche/.local/share/virtualenvs/arew-KWAEN1E-/lib/python3.8/site-packages (0.0.3.post0)
kelson42 commented 3 years ago

@amirouche would be good to have the exact command and source code and filename.

amirouche commented 3 years ago

Here is the whole program I use with try/except around article.mimetype to workaround the behavior described in the ticket:

#!/usr/bin/env python3
from io import BytesIO
from warcio.warcwriter import WARCWriter
from html2text import HTML2Text
from libzim.reader import File as ZIMFile
from urllib.parse import quote

handler = HTML2Text()
handler.ignore_links = True
handler.images_to_alt = True
html2text = handler.handle

with open('example.warc.wet.gz', 'wb') as output:
    writer = WARCWriter(output, gzip=True)
    with ZIMFile("data/wikipedia_en_simple_all_nopic_2020-12.zim") as reader:
        for uid in range(0, reader.article_count):
            if uid % 10_000 == 0:
                print("{} out of {}".format(uid, reader.article_count))

            article = reader.get_article_by_id(uid)
            try:
                if article.mimetype != "text/html":
                    continue
            except RuntimeError:
                continue

            if article.is_redirect:
                continue

            url = 'https://simple.wikipedia.org/wiki/{}'.format(quote(article.url))
            html = bytes(article.content).decode('utf8')
            text = html2text(html)
            payload = BytesIO(text.encode('utf8'))

            record = writer.create_warc_record(
                url,
                'conversion',
                payload=payload,
            )

            writer.write_record(record)

Here are the requirements for the above script:

cython==0.29.21
html2text==2020.1.16
libzim==0.0.3.post0
six==1.15.0; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'
warcio==1.7.4

The zim file is wikipedia_en_simple_all_nopic_2020-12.zim, here is a checksum:

% sha512sum data/wikipedia_en_simple_all_nopic_2020-12.zim 
11dc1105825137c101b8fc7a8ec9afff89215921afb85101ef03bb618600a81668b85b9dfbf851f133ee2ef0ceda3b24768f8a5c68f503cfe758530a7f514f93  data/wikipedia_en_simple_all_nopic_2020-12.zim
amirouche commented 3 years ago

Here is a minimal program to reproduce the problem:

#!/usr/bin/env python3
from libzim.reader import File as ZIMFile

with ZIMFile("data/wikipedia_en_simple_all_nopic_2020-12.zim") as reader:
    article = reader.get_article_by_id(60)
    print(article.url)
    print(len(article.content), bytes(article.content).decode('utf8'))
    print(article.mimetype)

The output is:

!
0 
Traceback (most recent call last):
  File "test.py", line 9, in <module>
    print(article.mimetype)
  File "libzim/wrapper.pyx", line 299, in libzim.wrapper.ReadArticle.mimetype.__get__
RuntimeError: unknown mime type code 65535
mgautierfr commented 3 years ago

You should test if a article is a redirect before reading its mimetype. Redirect articles have no mimetype.

This is the flaw in the current API, it will be fixed with next api.