openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
363 stars 25 forks source link
docker scraper webscraping zim

Zimit

Zimit is a scraper allowing to create ZIM file from any Web site.

CodeFactor License: GPL v3 Docker

Zimit adheres to openZIM's Contribution Guidelines.

Zimit has implemented openZIM's Python bootstrap, conventions and policies v1.0.1.

Capabilities and known limitations

While we would like to support as many websites as possible, making an offline archive of any website with a versatile tool obviously has some limitations.

Most capabilities and known limitations are documented in warc2zim README. There are also some limitations in Browsertrix Crawler (used to fetch the website) and wombat (used to properly replay dynamic web requests), but these are not (yet?) clearly documented.

Technical background

Zimit runs a fully automated browser-based crawl of a website property and produces a ZIM of the crawled content. Zimit runs in a Docker container.

The system:

The zimit.py is the entrypoint for the system.

After the crawl is done, warc2zim is used to write a zim to the /output directory, which should be mounted as a volume to not loose the ZIM created when container stops.

Using the --keep flag, the crawled WARCs and few other artifacts will also be kept in a temp directory inside /output

Usage

zimit is intended to be run in Docker. Docker image is published at https://github.com/orgs/openzim/packages/container/package/zimit.

The image accepts the following parameters, as well as any of the warc2zim ones; useful for setting metadata, for instance:

Example command:

docker run ghcr.io/openzim/zimit zimit --help
docker run ghcr.io/openzim/zimit warc2zim --help
docker run  -v /output:/output ghcr.io/openzim/zimit zimit --url URL --name myzimfile

Note: Image automatically filters out a large number of ads by using the 3 blocklists from anudeepND. If you don't want this filtering, disable the image's entrypoint in your container (docker run --entrypoint="" ghcr.io/openzim/zimit ...).

To re-build the Docker image locally run:

docker build -t ghcr.io/openzim/zimit .

FAQ

The Zimit contributor's team maintains a page with most Frequently Asked Questions.

Nota bene

While Zimit 1.x relied on a Service Worker to display the ZIM content, this is not anymore the case since Zimit 2.x which does not have any special requirements anymore.

It should also be noted that a first version of a generic HTTP scraper was created in 2016 during the Wikimania Esino Lario Hackathon.

That version is now considered outdated and archived in 2016 branch.

License

GPLv3 or later, see LICENSE for more details.