warc2zim converts WARC files to ZIM file. The resulting ZIM contains all WARC records, with "programming" records (HTML/CSS/JS/...) rewriten for proper offline operation.
The resulting ZIM is self-contained and can render properly in offline situations.
Since warc2zim 2.0.0, service workers and HTTPs are not needed anymore for proper ZIM rendering (this was a big constraint of ZIM produced by warc2zim 1.x).
WARC format being an archive of any website property, warc2zim is the perfect companion to turn any website into an offline content (see e.g. https://www.github.com/openzim/zimit for a scraper bundling the approach, transform a website URL into an offline ZIM content in a single command).
While we would like to support as many websites as possible, making an offline archive of a website obviously has some limitations.
Scenario which are known to work well:
http://www.acme.com:80/resource1
and http://www.acme.com:8080/resource1
both exist AND lead to different resources, the scraper will include in the ZIM only the first resource fetched and silently ignore all other resources in conflicthttp://www.acme.com/resource1
and https://www.acme.com/resource1
both exist AND lead to different resources, the scraper will include in the ZIM only the first resource fetched and silently ignore all other resources in conflictContent-Disposition: attachment
response header are expected to be automatically saved by the browser. This does not happen for now (see https://github.com/openzim/warc2zim/issues/288).2xx
range, only 200
, 201
, 202
and 203
are supported ; others are simply ignored3xx
range, only 301
, 302
, 306
and 307
are supported if they redirect to a payload which is present in the WARC ; others are simply ignored1xx
(not supposed to exist in WARC files anyway), 4xx
and 5xx
ranges are ignored&
, <
, >
, '
and "
) are escaped-back.<img src="https://github.com/openzim/warc2zim/raw/main/image.png?param1=value1¶m2=value2">
is transformed into <img src="https://github.com/openzim/warc2zim/raw/main/image.png%3Fparam1%3Dvalue1%C2%B6m2%3Dvalue2">
because URL was supposed to be image.png?param1=value1¶m2=value2
because ¶
has been decoded to ¶
. HTML should have been <img src="https://github.com/openzim/warc2zim/raw/main/image.png?param1=value1&param2=value2">
for the URL to be image.png?param1=value1¶m2=value2
meta http-equiv
are not yet supported (see https://github.com/openzim/warc2zim/issues/237)It is also important to note that warc2zim is inherently limited to what is present inside the WARC. A bad WARC can only produce a bad ZIM. Garbage in, garbage out.
It is hence very important to properly configure the system used to create the WARC. If zimit is used (and hence WebRecorder Browsertrix crawler), it is very important to properly configure scope type, mobile device used, behaviors (including custom ones needed on some sites) and login profile.
Adding a custom CSS is also strongly recommended to hide features which won't work offline (e.g. search box which relies on a live search server).
Example:
warc2zim ./path/to/myarchive.warc --output /output --name myarchive.zim -u https://example.com/
The above will create a ZIM file /output/myarchive.zim
with https://example.com/
set as the main page.
python3 -m venv ./env # creates a virtual python environment in ./env folder
./env/bin/pip install -U pip # upgrade pip (package manager). recommended
./env/bin/pip install -U warc2zim # install/upgrade warc2zim inside virtualenv
# direct access to in-virtualenv warc2zim binary, without shell-attachment
./env/bin/warc2zim --help
# alternatively, attach virtualenv to shell
source env/bin/activate
warc2zim --help
deactivate # unloads virtualenv from shell
By default, all URLs found in the WARC files are included unless the --include-domains
/ -i
flag is set.
To filter URLs that may be out of scope (eg. ads, social media trackers), use the --include-domains
/ -i
flag to specify each domain you want to include.
Other URLs will be filtered and not pushed to the ZIM.
Note that the domain passed and all its subdomains are included.
Eg. if main page is on a subdomain https://subdomain.example.com/
but all URLs from *.example.com
should be included, use:
warc2zim myarchive.warc --name myarchive -i example.com -u https://subdomain.example.com/starting/page.html
If main page is on a subdomain, https://subdomain.example.com/
and only URLs from subdomain.example.com
should be included, use:
warc2zim myarchive.warc --name myarchive -i subdomain.example.com -u https://subdomain.example.com/starting/page.html
If main page is on a subdomain, https://subdomain1.example.com/
and only URLs from subdomain1.example.com
and subdomain2.example.com
should be included, use:
warc2zim myarchive.warc --name myarchive -i subdomain1.example.com -i subdomain2.example.com -u https://subdomain1.example.com/starting/page.html
--custom-css
allows passing an URL or a path to a CSS file that gets added to the ZIM and gets included on every HTML article at the very end of </head>
(if it exists).
When an item fails to be converted into the ZIM and --verbose
flag is passed, the failed item content is stored on the filesystem for easier analysis. The directory where this file is saved can be customized with --failed-items
. File name is a random UUID4 which is output in the logs.
For developement purpose, it is possible to ask to continue on WARC record processing errors with --continue-on-error
.
See warc2zim -h
for other options.
We have documentation about the functional architecture, the technical architecture and the software architecture.
Requirements:
First, clone this repository.
If you do not already have it on your system, install hatch to build the software and manage virtual environments (you might be interested by our detailed Developer Setup as well).
pip3 install hatch
Start a hatch shell: this will install software including dependencies in an isolated virtual environment.
hatch shell
wombatSetup.js is the JS code used to setup wombat when the ZIM is used.
It is normally retrieved by Python build process (see openzim.toml for details).
Recommended solution to develop this JS code is to install Node.JS on your system, and then
cd javascript
yarn build-dev # or yarn build-prod
Should you want to regenerate this code without install Node.JS, you might simply run following command.
docker run -v $PWD/src/warc2zim/statics:/output -v $PWD/rules:/src/rules -v $PWD/javascript:/src/javascript -v $PWD/build_js.sh:/src/build_js.sh -it --rm --entrypoint /src/build_js.sh node:20-bookworm
It will install Python3 on-top of Node.JS in a Docker container, generate JS fuzzy rules and bundle JS code straight to /src/warc2zim/statics/wombatSetup.js
where the file is expected to be placed.