openzim / python-libzim

Libzim binding for Python: read/write ZIM files in Python
https://pypi.org/project/libzim/
GNU General Public License v3.0
67 stars 25 forks source link
binding library libzim offline python webscraping

python-libzim

libzim module allows you to read and write ZIM files in Python. It provides a shallow python interface on top of the C++ libzim library.

It is primarily used in openZIM scrapers like sotoki or youtube2zim.

Build Status CodeFactor License: GPL v3 PyPI version shields.io PyPI - Python Version codecov

Installation

pip install libzim

Our PyPI wheels bundle a recent release of the C++ libzim and are available for the following platforms:

Wheels are available for CPython only (but can be built for Pypy).

Users on other platforms can install the source distribution (see Building below).

Contributions

git clone git@github.com:openzim/python-libzim.git && cd python-libzim
# hatch run test:coverage

See CONTRIBUTING.md for additional details then Open a ticket or submit a Pull Request on Github 🤗!

Usage

Read a ZIM file

from libzim.reader import Archive
from libzim.search import Query, Searcher
from libzim.suggestion import SuggestionSearcher

zim = Archive("test.zim")
print(f"Main entry is at {zim.main_entry.get_item().path}")
entry = zim.get_entry_by_path("home/fr")
print(f"Entry {entry.title} at {entry.path} is {entry.get_item().size}b.")
print(bytes(entry.get_item().content).decode("UTF-8"))

# searching using full-text index
search_string = "Welcome"
query = Query().set_query(search_string)
searcher = Searcher(zim)
search = searcher.search(query)
search_count = search.getEstimatedMatches()
print(f"there are {search_count} matches for {search_string}")
print(list(search.getResults(0, search_count)))

# accessing suggestions
search_string = "kiwix"
suggestion_searcher = SuggestionSearcher(zim)
suggestion = suggestion_searcher.suggest(search_string)
suggestion_count = suggestion.getEstimatedMatches()
print(f"there are {suggestion_count} matches for {search_string}")
print(list(suggestion.getResults(0, suggestion_count)))

Write a ZIM file

from libzim.writer import Creator, Item, StringProvider, FileProvider, Hint

class MyItem(Item):
    def __init__(self, title, path, content = "", fpath = None):
        super().__init__()
        self.path = path
        self.title = title
        self.content = content
        self.fpath = fpath

    def get_path(self):
        return self.path

    def get_title(self):
        return self.title

    def get_mimetype(self):
        return "text/html"

    def get_contentprovider(self):
        if self.fpath is not None:
            return FileProvider(self.fpath)
        return StringProvider(self.content)

    def get_hints(self):
        return {Hint.FRONT_ARTICLE: True}

content = """<html><head><meta charset="UTF-8"><title>Web Page Title</title></head>
<body><h1>Welcome to this ZIM</h1><p>Kiwix</p></body></html>"""

item = MyItem("Hello Kiwix", "home", content)
item2 = MyItem("Bonjour Kiwix", "home/fr", None, "home-fr.html")

with Creator("test.zim").config_indexing(True, "eng") as creator:
    creator.set_mainpath("home")
    creator.add_item(item)
    creator.add_item(item2)
    illustration = pathlib.Path("icon48x48.png").read_bytes()
    creator.add_illustration(48, illustration)
    for name, value in {
        "creator": "python-libzim",
        "description": "Created in python",
        "name": "my-zim",
        "publisher": "You",
        "title": "Test ZIM",
        "language": "eng",
        "date": "2024-06-30"
    }.items():

        creator.add_metadata(name.title(), value)

Thread safety

The reading part of the libzim is most of the time thread safe. Searching and creating part are not. libzim documentation

python-libzim disables the GIL on most of C++ libzim calls. You must prevent concurrent access yourself. This is easily done by wrapping all creator calls with a threading.Lock()

lock = threading.Lock()
with Creator("test.zim") as creator:

    # Thread #1
    with lock:
        creator.add_item(item1)

    # Thread #2
    with lock:
        creator.add_item(item2)

Type hints

libzim being a binary extension, there is no Python source to provide types information. We provide them as type stub files. When using pyright, you would normally receive a warning when importing from libzim as there could be discrepencies between actual sources and the (manually crafted) stub files.

You can disable the warning via reportMissingModuleSource = "none".

Building

libzim package building offers different behaviors via environment variables

Variable Example Use case
LIBZIM_DL_VERSION 8.1.1 or 2023-04-14 Specify the C++ libzim binary version to download and bundle. Either a release version string or a date, in which case it downloads a nightly
USE_SYSTEM_LIBZIM 1 Uses LDFLAG and CFLAGS to find the libzim to link against. Resulting wheel won't bundle C++ libzim.
DONT_DOWNLOAD_LIBZIM 1 Disable downloading of C++ libzim. Place headers in include/ and libzim dylib/so in libzim/ if no using system libzim. It will be bundled in wheel.
PROFILE 0 Enable profile tracing in Cython extension. Required for Cython code coverage reporting.
SIGN_APPLE 1 Set to sign and notarize the extension for macOS. Requires following informations
APPLE_SIGNING_IDENTITY Developer ID Application: OrgName (ID) Required for signing on macOS
APPLE_SIGNING_KEYCHAIN_PATH /tmp/build.keychain Path to the Keychain containing the certificate to sign for macOS with
APPLE_SIGNING_KEYCHAIN_PROFILE build Name of the profile in the specified Keychain

Building on Windows

On Windows, built wheels needs to be fixed post-build to move the bundled DLLs (libzim and libicu) next to the wrapper (Windows does not support runtime path).

After building you wheel, run

python setup.py repair_win_wheel --wheel=dist/xxx.whl --destdir wheels\

Similarily, if you install as editable (pip install -e .), you need to place those DLLs at the root of the repo.

Move-Item -Force -Path .\libzim\*.dll -Destination .\

Examples

Default: downloading and bundling most appropriate libzim release binary
python3 -m build

Using system libzim (brew, debian or manually installed) - not bundled

# using system-installed C++ libzim
brew install libzim  # macOS
apt-get install libzim-devel  # debian
dnf install libzim-dev  # fedora
USE_SYSTEM_LIBZIM=1 python3 -m build --wheel

# using a specific C++ libzim
USE_SYSTEM_LIBZIM=1 \
CFLAGS="-I/usr/local/include" \
LDFLAGS="-L/usr/local/lib"
DYLD_LIBRARY_PATH="/usr/local/lib" \
LD_LIBRARY_PATH="/usr/local/lib" \
python3 -m build --wheel

Other platforms

On platforms for which there is no official binary available, you'd have to compile C++ libzim from source first then either use DONT_DOWNLOAD_LIBZIM or USE_SYSTEM_LIBZIM.

License

GPLv3 or later, see LICENSE for more details.