Closed wsdookadr closed 2 years ago
There are two issues with the sample warc:
1) It appears to not be compressed correctly. Each record needs to be compressed individually, and then concatenated. This file appears to not have that. Fortunately, the warcio tool can fix this: warcio recompress site.warc.gz site-fixed.warc.gz
The fixed WARC will then load in replayweb.page
2) The URLs in the WARC are not the original ones, so it'll have wayback machine lines, rather than original urls. To get the original data requires using the id_
modifier
I've worked on several approaches to this style of extraction, probably simplest for now is using express.archiveweb.page, for example, you can load: https://express.archiveweb.page/#19960101000000/http://geocities.com/ and then browser around, and then download the WACZ file (which you can use in replayweb.page), from which you can get the WARC, and then convert to ZIM with warc2zim
To get the original data requires using the
id_
modifier
This turned out to be very important
probably simplest for now [..]
I like both express.archiveweb.page and replayweb.page but from what I can see they both operate on just one web page, whereas I need to grab the entire site.
Each record needs to be compressed individually, and then concatenated
So everything between two WARC/1.0
lines is considered to be a record, and these records are compressed individually? I guess this part I find harder to understand. Could you please elaborate on this or maybe provide a pointer to WARC spec and possibly some examples?
After some hesitation I decided to just go for it and adapt the program above according to what you said. Along the way I found out that:
It does do what I need it to.
UPDATE: as this is not a bug, but more related to how internet archive works, I'll close this
I like both express.archiveweb.page and replayweb.page but from what I can see they both operate on just one web page, whereas I need to grab the entire site.
Yes, express.archiveweb.page is currently focused on a single page, though replayweb.page can replay WARC/WACZ files of any size.
Each record needs to be compressed individually, and then concatenated
So everything between two
WARC/1.0
lines is considered to be a record, and these records are compressed individually? I guess this part I find harder to understand. Could you please elaborate on this or maybe provide a pointer to WARC spec and possibly some examples?
Yes. Because they are accessed individually. There's a copy of the standard are relevant section here: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#record-at-time-compression
After some hesitation I decided to just go for it and adapt the program above according to what you said. Along the way I found out that:
- a 2nd pass over the WARC is required to pull missing resources (this was unexpected as I thought the CDX prefix search would pull everything needed)
- a search & replace turned out to be sufficient to restore the original site urls almost everywhere
It does do what I need it to.
wayback_dl.sh
Great! It's a bit hard to follow (my Perl is a bit rusty), but if it works, it works! For more complex sites, we tend to do extraction via the browser to ensure things aren't missed, but sounds like this may be sufficient for your use case.
Background:
The following program is a lightweight and minimalistic version of wayback-machine-downloader.
wayback_dl.sh
``` #!/bin/bash # # # purpose: # this program does a prefix search on the wayback server server # and acquires all files for a specific site which it then downloads # and packages into a WARC file # # note: # wayback cdx server docs: https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md # # URL="$1" rm -rf wb_temp/* mkdir wb_temp mkdir wb_temp/o curl "http://web.archive.org/cdx/search/xd?url=$URL*&fl=timestamp,original,mimetype&collapse=digest&gzip=false&filter=statuscode:200&output=json&fastLatest=true" | \ jq -c -r '.[] | (.[0]+"\t"+.[1]+"\t"+.[2])' | \ perl -ne 'print if $. > 1' | \ rg -v "wp-login.php|xmlrpc.php" > wb_temp/obj_raw_urls.txt cat wb_temp/obj_raw_urls.txt | \ perl -ne 'push @x,[split(m{\t})]; END { print(join("\t",@$_)) for (sort { ($a->[1] cmp $b->[1]) || (-1 * ($a->[0] cmp $b->[0])) } @x); }' | \ perl -F'\t' -ane 'BEGIN{$h={};}; print if !exists $h->{$F[1]}; $h->{$F[1]} = 1;' > wb_temp/obj_urls.txt cat wb_temp/obj_urls.txt | \ parallel --colsep="\t" --no-notice -j4 'wget --warc-file=wb_temp/o/{#} "https://web.archive.org/web/{1}/{2}" -O /dev/null' find wb_temp/o/ -mindepth 1 -name "*.warc.gz" -exec bash -c 'zcat {} 2>>gzip_err.txt' \; > wb_temp/site.warc ```In short, it does a prefix search on the wayback cdx server, retrieves all files for a website and generates a WARC file from them.
In detail, it does the following:
Problem:
After using this program to produce a WARC for a specific website, and after feeding the WARC into wrac2zim to generate a ZIM archive, kiwix-serve will serve webpages from that ZIM, but the stylesheets/js/images are not being served. kiwix-serve claims those resources are not part of the ZIM archive (even though they are part of the WARC used to produce that ZIM archive).
Example:
This image is retrieved and part of the WARC archive, but kiwix-serve will say it's not part of the ZIM.
https://web.archive.org/web/20190208180108im_/http://www.phailed.me/wp-content/ql-cache/quicklatex.com-4e77e87694d179a40b2d96f210423ce8_l3.png
Question(s):
Should this procedure generate a valid WARC file? Is warc2zim or replayweb.page compatible with wget-generated WARC files?
Please find a copy of site.warc.gz attached to this issue. It can be fed into either warc2zim or replayweb.page to reproduce these findings.
site.warc.gz
Versions of software used: