openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
44 stars 4 forks source link

Are WARC files generated by GNU wget compatible with warc2zim? #95

Closed wsdookadr closed 2 years ago

wsdookadr commented 2 years ago

Background:

The following program is a lightweight and minimalistic version of wayback-machine-downloader.

wayback_dl.sh ``` #!/bin/bash # # # purpose: # this program does a prefix search on the wayback server server # and acquires all files for a specific site which it then downloads # and packages into a WARC file # # note: # wayback cdx server docs: https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md # # URL="$1" rm -rf wb_temp/* mkdir wb_temp mkdir wb_temp/o curl "http://web.archive.org/cdx/search/xd?url=$URL*&fl=timestamp,original,mimetype&collapse=digest&gzip=false&filter=statuscode:200&output=json&fastLatest=true" | \ jq -c -r '.[] | (.[0]+"\t"+.[1]+"\t"+.[2])' | \ perl -ne 'print if $. > 1' | \ rg -v "wp-login.php|xmlrpc.php" > wb_temp/obj_raw_urls.txt cat wb_temp/obj_raw_urls.txt | \ perl -ne 'push @x,[split(m{\t})]; END { print(join("\t",@$_)) for (sort { ($a->[1] cmp $b->[1]) || (-1 * ($a->[0] cmp $b->[0])) } @x); }' | \ perl -F'\t' -ane 'BEGIN{$h={};}; print if !exists $h->{$F[1]}; $h->{$F[1]} = 1;' > wb_temp/obj_urls.txt cat wb_temp/obj_urls.txt | \ parallel --colsep="\t" --no-notice -j4 'wget --warc-file=wb_temp/o/{#} "https://web.archive.org/web/{1}/{2}" -O /dev/null' find wb_temp/o/ -mindepth 1 -name "*.warc.gz" -exec bash -c 'zcat {} 2>>gzip_err.txt' \; > wb_temp/site.warc ```

In short, it does a prefix search on the wayback cdx server, retrieves all files for a website and generates a WARC file from them.

In detail, it does the following:

  1. does a prefix search on the wayback cdx server (compliant with the docs) which returns a big list of rows, each containing a complete url, a timestamp and the mimetype of that resource, but only for those resources that had HTTP status code 200 at crawling time
  2. at the previous step, the list may contain duplicates (the same file being crawled at different moments in time), and in order to address this, the list is sorted by url and then by timestamp and then filtered such that only the most recent timestamp for each url is retained and the rest dropped
  3. uses GNU Wget and GNU Parallel to retrieve all files in the previous point which are written to disk in .warc.gz format
  4. unpacks the contents of the archives at the previous point and concatenates the contents into a large WARC file with all resources for the entire website (web pages, images, stylesheets, js, etc)

Problem:

After using this program to produce a WARC for a specific website, and after feeding the WARC into wrac2zim to generate a ZIM archive, kiwix-serve will serve webpages from that ZIM, but the stylesheets/js/images are not being served. kiwix-serve claims those resources are not part of the ZIM archive (even though they are part of the WARC used to produce that ZIM archive).

Example:

This image is retrieved and part of the WARC archive, but kiwix-serve will say it's not part of the ZIM.

https://web.archive.org/web/20190208180108im_/http://www.phailed.me/wp-content/ql-cache/quicklatex.com-4e77e87694d179a40b2d96f210423ce8_l3.png

Question(s):

Should this procedure generate a valid WARC file? Is warc2zim or replayweb.page compatible with wget-generated WARC files?

Please find a copy of site.warc.gz attached to this issue. It can be fed into either warc2zim or replayweb.page to reproduce these findings.

site.warc.gz

Versions of software used:

ikreymer commented 2 years ago

There are two issues with the sample warc: 1) It appears to not be compressed correctly. Each record needs to be compressed individually, and then concatenated. This file appears to not have that. Fortunately, the warcio tool can fix this: warcio recompress site.warc.gz site-fixed.warc.gz The fixed WARC will then load in replayweb.page

2) The URLs in the WARC are not the original ones, so it'll have wayback machine lines, rather than original urls. To get the original data requires using the id_ modifier

I've worked on several approaches to this style of extraction, probably simplest for now is using express.archiveweb.page, for example, you can load: https://express.archiveweb.page/#19960101000000/http://geocities.com/ and then browser around, and then download the WACZ file (which you can use in replayweb.page), from which you can get the WARC, and then convert to ZIM with warc2zim

wsdookadr commented 2 years ago

To get the original data requires using the id_ modifier

This turned out to be very important

probably simplest for now [..]

I like both express.archiveweb.page and replayweb.page but from what I can see they both operate on just one web page, whereas I need to grab the entire site.

Each record needs to be compressed individually, and then concatenated

So everything between two WARC/1.0 lines is considered to be a record, and these records are compressed individually? I guess this part I find harder to understand. Could you please elaborate on this or maybe provide a pointer to WARC spec and possibly some examples?

After some hesitation I decided to just go for it and adapt the program above according to what you said. Along the way I found out that:

It does do what I need it to.

UPDATE: as this is not a bug, but more related to how internet archive works, I'll close this

wayback_dl.sh ``` #!/bin/bash # # # purpose: # this program does a prefix search on the wayback server server # and acquires all files for a specific site which it then downloads # and packages into a WARC file # # note: # wayback cdx server docs: https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md # # # requires: gnu wget, jq, curl, gnu parallel, xmlstarlet, perl # # # # STAGE1: prefix search, dedupe, and retrieve all resources # URL="$1" rm -rf wb_temp/* mkdir wb_temp mkdir wb_temp/o curl "http://web.archive.org/cdx/search/xd?url=$URL*&fl=timestamp,original,mimetype&collapse=digest&gzip=false&filter=statuscode:200&output=json&fastLatest=true" | \ jq -c -r '.[] | (.[0]+"\t"+.[1]+"\t"+.[2])' | \ perl -ne 'print if $. > 1' | \ rg -v "wp-login.php|xmlrpc.php" > wb_temp/obj_raw_urls.txt cat wb_temp/obj_raw_urls.txt | \ perl -ne 'push @x,[split(m{\t})]; END { print(join("\t",@$_)) for (sort { ($a->[1] cmp $b->[1]) || (-1 * ($a->[0] cmp $b->[0])) } @x); }' | \ perl -F'\t' -ane 'BEGIN{$h={};}; print if !exists $h->{$F[1]}; $h->{$F[1]} = 1;' > wb_temp/obj_urls.txt cat wb_temp/obj_urls.txt | \ parallel --colsep="\t" --no-notice -j4 'eval wget -nv --warc-file=wb_temp/o/{#} "https://web.archive.org/web/{1}id_/{2}" -O /dev/null' find wb_temp/o/ -mindepth 1 -name "*.warc.gz" -exec bash -c 'zcat {} 2>>gzip_err.txt' \; > wb_temp/site.warc cat wb_temp/site.warc | \ perl -pne 's{https?://web.archive.org/web/\d+id_/}{}; if(m{^WARC-Target-URI: <(.*)>}) {$_="WARC-Target-URI: $1\r\n";}' | \ perl -ne 'if(m{^WARC/1.0}) {print "$b" if defined($b) && $b ne "" && $b =~ m{WARC-Type: (request|response)}; $b=$_; } else {$b.=$_;} ; END { print "$b" if $b =~ m{WARC-Type: (request|response)}; };' \ > wb_temp/site2.warc # # STAGE2: link extraction, determining missing resources and retrieving the absent ones # cat wb_temp/site2.warc | perl -ne ' BEGIN{ use File::Temp qw/tempfile/ }; if(m{^WARC/1.0}) { if(defined($b) && $b ne "" && $b =~ m{Content-Type: text/html} && $b =~ m{WARC-Type: response} ) { ($fh, $filename) = tempfile("temp_XXXX", DIR=>"wb_temp"); print $fh $b; close($fh); print "$filename\n"; }; $b=$_; } else { $b.=$_; } ' | \ parallel --no-notice -j2 'cat {} | xmlstarlet format -H --recover 2>/dev/null | xmlstarlet sel -t -v '\''//link/@href'\'' 2>/dev/null ; rm {};' | \ rg -v "/wp-json/oembed|wp-login.php|http://wp.me/|^//" | \ sort | uniq > wb_temp/all_links.txt fetch_cb() { curl -L -s "http://web.archive.org/cdx/search/xd?url=$1&fl=timestamp,original,mimetype&collapse=digest&gzip=false&filter=statuscode:200&output=json&fastLatest=true" | \ jq -c -r '.[] | (.[0]+"\t"+.[1]+"\t"+.[2])' | \ rg -v "wp-login.php|xmlrpc.php" | \ perl -ne 'print if $. > 1' | \ perl -ne 'push @x,[split(m{\t})]; END { print(join("\t",@$_)) for (sort { ($a->[1] cmp $b->[1]) || (-1 * ($a->[0] cmp $b->[0])) } @x); }' | \ perl -F'\t' -ane 'BEGIN{$h={};}; print if !exists $h->{$F[1]}; $h->{$F[1]} = 1;' } export -f fetch_cb cat wb_temp/all_links.txt | \ perl -ne 's{&}{&}g; print;' | \ parallel --no-notice -j8 eval fetch_cb ::: | tee wb_temp/obj2.txt rm -f wb_temp/o2/* mkdir wb_temp/o2/ cat wb_temp/obj_urls.txt | sort > wb_temp/r1.txt cat wb_temp/obj2.txt | sort > wb_temp/r2.txt comm -1 -3 wb_temp/r{1,2}.txt > wb_temp/r3.txt cat wb_temp/r3.txt | \ parallel --colsep="\t" --no-notice -j4 'eval wget -nv --warc-file=wb_temp/o2/{#} "https://web.archive.org/web/{1}id_/{2}" -O /dev/null' find wb_temp/o2/ -mindepth 1 -name "*.warc.gz" -exec bash -c 'zcat {} 2>>gzip_err.txt' \; > wb_temp/site3.warc cat wb_temp/site2.warc wb_temp/site3.warc | \ perl -pne 's{https?://web.archive.org/web/\d+id_/}{}; if(m{^WARC-Target-URI: <(.*)>}) {$_="WARC-Target-URI: $1\r\n";}' | \ perl -ne 'if(m{^WARC/1.0}) {print "$b" if defined($b) && $b ne "" && $b =~ m{WARC-Type: (request|response)}; $b=$_; } else {$b.=$_;} ; END { print "$b" if $b =~ m{WARC-Type: (request|response)}; };' \ > wb_temp/site4.warc ```
ikreymer commented 2 years ago

I like both express.archiveweb.page and replayweb.page but from what I can see they both operate on just one web page, whereas I need to grab the entire site.

Yes, express.archiveweb.page is currently focused on a single page, though replayweb.page can replay WARC/WACZ files of any size.

Each record needs to be compressed individually, and then concatenated

So everything between two WARC/1.0 lines is considered to be a record, and these records are compressed individually? I guess this part I find harder to understand. Could you please elaborate on this or maybe provide a pointer to WARC spec and possibly some examples?

Yes. Because they are accessed individually. There's a copy of the standard are relevant section here: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#record-at-time-compression

After some hesitation I decided to just go for it and adapt the program above according to what you said. Along the way I found out that:

  • a 2nd pass over the WARC is required to pull missing resources (this was unexpected as I thought the CDX prefix search would pull everything needed)
  • a search & replace turned out to be sufficient to restore the original site urls almost everywhere

It does do what I need it to.

wayback_dl.sh

Great! It's a bit hard to follow (my Perl is a bit rusty), but if it works, it works! For more complex sites, we tend to do extraction via the browser to ensure things aren't missed, but sounds like this may be sufficient for your use case.