ukwa / ukwa-manage

Shepherding our web archives from crawl to access.
Apache License 2.0
10 stars 5 forks source link

CDX Indexing failing on weird data #105

Open anjackson opened 1 year ago

anjackson commented 1 year ago

The CDX backfill is hitting problems. When submitting to OutbackCDX, we see:

Exception: Failed with 400 Bad Request
At line: uk,gov,bracknell-forest,democratic)/mgmeetingattendance.aspx?id=3246 20180614210133 https://democratic.bracknell-forest.gov.uk/mgMeetingAttendance.aspx?ID=3246 - html> VSNDFFRW22AHHOIFLZDLLMITL3O2JTGO - - 2479 598530890 /heritrix/output/warcs/weekly/20180611080023/BL-20180614200107699-01321-63~ukwa-h3-pulse-weekly~8443.warc.gz
java.lang.NumberFormatException: For input string: "html>"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:580)
        at java.lang.Integer.parseInt(Integer.java:615)
        at outbackcdx.Capture.fromCdxLine(Capture.java:224)
        at outbackcdx.Webapp.post(Webapp.java:242)
        at outbackcdx.Webapp.lambda$new$3(Webapp.java:95)
        at outbackcdx.Web$Route.handle(Web.java:301)
        at outbackcdx.Web$Router.handle(Web.java:225)
        at outbackcdx.Webapp.handle(Webapp.java:584)
        at outbackcdx.Web$Server.serve(Web.java:52)
        at java.lang.Thread.run(Thread.java:745)

For comparison, a good CDX line looks like this:

uk,gov,bracknell-forest,democratic)/mglistdeclarationsofinterest.aspx?uid=1058 20180626012653 http://democratic.bracknell-forest.gov.uk/mgListDeclarationsOfInterest.aspx?UID=1058 text/html 200 PQLF3STCAARNAULYBZNGLDJ6VBN5NKZJ - - 5636 207792976 /heritrix/output/warcs/weekly/20180625080108/BL-20180626012250679-00163-63~ukwa-h3-pulse-weekly~8443.warc.gz

So, we can see that the content type is missed - and then a malformed content type is where the status code should be.

The WARC record from BL-20180614200107699-01321-63~ukwa-h3-pulse-weekly~8443.warc.gz at 598530890 compressed length 2479 looks like:

anjackson commented 1 year ago
uk,gov,bracknell-forest,democratic)/mgmeetingattendance.aspx?id=3246 20180614210133 {"url": "https://democratic.bracknell-forest.gov.uk/mgMeetingAttendance.aspx?ID=3246", "status": "html>", "digest": "sha1:VSND
FFRW22AHHOIFLZDLLMITL3O2JTGO", "length": "2479", "offset": "598530890", "filename": "BL-20180614200107699-01321-63~ukwa-h3-pulse-weekly~8443.warc.gz"}
uk,gov,maidstone)/home/primary-services/council-and-democracy/primary-areas/your-councillors?sq_content_src=+dxjspwh0dhbzjtnbjtjgjtjgbwvldgluz3mubwfpzhn0b25llmdvdi51ayuyrmrvy3vtzw50cyuyrnm0otg5myuyrljlzmvyzw5jz
suymhrvjtiwq291bmnpbcuymep1bhklmjaymde2jtiwlsuymfryywluaw5nlnbkzizhbgw9mq== 20180614210134 {"url": "http://www.maidstone.gov.uk/home/primary-services/council-and-democracy/primary-areas/your-councillors?sq_cont
ent_src=%2BdXJsPWh0dHBzJTNBJTJGJTJGbWVldGluZ3MubWFpZHN0b25lLmdvdi51ayUyRmRvY3VtZW50cyUyRnM0OTg5MyUyRlJlZmVyZW5jZSUyMHRvJTIwQ291bmNpbCUyMEp1bHklMjAyMDE2JTIwLSUyMFRyYWluaW5nLnBkZiZhbGw9MQ%3D%3D", "mime": "applica
tion/pdf", "status": "200", "digest": "sha1:EQKO7KF6EX5OCMK3MGHA6LDLX2MLX5OW", "length": "44796", "offset": "598534660", "filename": "BL-20180614200107699-01321-63~ukwa-h3-pulse-weekly~8443.warc.gz"}

Noting that 598534660 - 598530890 = 3770 which is not consistent with the prior record length of 2479.

anjackson commented 1 year ago

Hmm, worringly, also a failure from a different WARC.

At line: uk,co,faze3)/puzzles/other-puzzles/tiger-animal-tile-puzzle?limit=75&order=desc&sort=p.model 20191124171556 http://www.faze3.co.uk/puzzles/other-puzzles/tiger-animal-tile-puzzle?sort=p.model&am
p;order=DESC&limit=75 text/html [132.232.68.172:62319-40#APVH_cumbriasmuseumofmilitarylife.org] IKXUQDKY3JUSZDYH3U3DWLANEPLXPMED - - 6355 682961982 /heritrix/output/dc2019/20191117161727/warcs/BL-NPLD-20191124151629315-51162-106~npld-dc-heritrix3-worker-1~8443.warc.gz
java.lang.NumberFormatException: For input string: "[132.232.68.172:62319-40#APVH_cumbriasmuseumofmilitarylife.org]"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:580)
        at java.lang.Integer.parseInt(Integer.java:615)
        at outbackcdx.Capture.fromCdxLine(Capture.java:224)
        at outbackcdx.Webapp.post(Webapp.java:242)
        at outbackcdx.Webapp.lambda$new$3(Webapp.java:95)
        at outbackcdx.Web$Route.handle(Web.java:301)
        at outbackcdx.Web$Router.handle(Web.java:225)
        at outbackcdx.Webapp.handle(Webapp.java:584)
        at outbackcdx.Web$Server.serve(Web.java:52)
        at outbackcdx.NanoHTTPD$HTTPSession.execute(NanoHTTPD.java:848)
        at outbackcdx.NanoHTTPD$1$1.run(NanoHTTPD.java:207)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Ah, well this at least seems to be a less troubling problem:

WARC/1.0^M
WARC-Type: response^M
WARC-Target-URI: http://www.faze3.co.uk/puzzles/other-puzzles/tiger-animal-tile-puzzle?sort=p.model&order=DESC&limit=75^M
WARC-Date: 2019-11-24T17:15:56Z^M
WARC-IP-Address: 5.134.13.89^M
WARC-Payload-Digest: sha1:IKXUQDKY3JUSZDYH3U3DWLANEPLXPMED^M
WARC-Record-ID: <urn:uuid:1e67d4f9-0523-4d3c-8c59-d9267d8c801e>^M
Content-Type: application/http; msgtype=response^M
Content-Length: 24334^M
^M
078de2e92c074308c2dd2b334371fc68b955450e3] [132.232.68.172:62319-40#APVH_cumbriasmuseumofmilitarylife.org] Content len: 255, Request line: 'POST /xmlrpc.php HTTP/1.1'
2019-11-24 17:15:56.056244 [INFO] [6553] [132.232.68.172:62319-40#APVH_cumbriasmuseumofmilitarylife.org] File not found [/home/m111t4ry/public_html/403.shtml]
HTTP/1.0 200 OK^M
Connection: close^M
Expires: Thu, 19 Nov 1981 08:52:00 GMT^M
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0^M
Pragma: no-cache^M
Set-Cookie: language=en; expires=Tue, 24-Dec-2019 17:15:56 GMT; Max-Age=2592000; path=/; domain=www.faze3.co.uk^M
Set-Cookie: currency=GBP; expires=Tue, 24-Dec-2019 17:15:56 GMT; Max-Age=2592000; path=/; domain=www.faze3.co.uk^M
Content-Type: text/html; charset=utf-8^M
Date: Sun, 24 Nov 2019 17:15:56 GMT^M
Server: LiteSpeed^M
^M
<!DOCTYPE html>
anjackson commented 1 year ago

Another one:

 uk,co,alexread)/wp-content/uploads/2015/03/site-img97.jpg 20191114112732 http://alexread.co.uk/wp-content/uploads/2015/03/site-img97.jpg application/x-www-form-urlencoded GÃ<82>^\^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@7:32 TISLVVVFSFEOPWCAYS2VXED25PUQOPKB - - 20715 160709208 /heritrix/output/dc2019/20191112120728/warcs/BL-NPLD-20191114101835417-46437-106~npld-dc-heritrix3-worker-1~8443.warc.gz

Could use some improved diagnostic tools for these records, e.g. are the WARC record length/digest headers consistent with the problematic payload? Or has something else gone wrong somehow? Are the GZ blocks either side of the broken one okay? etc.

anjackson commented 1 year ago

And another!

 uk,co,topofthedogs)/1815396-liangechinus.wf 20191111065355 http://topofthedogs.co.uk/1815396-liangechinus.wf text/html +0000] 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 620 835614237 /heritrix/output/dc2019/20191108122506/warcs/BL-NPLD-20191111032353224-44964-106~npld-dc-heritrix3-worker-1~8443.warc.gz

Adding to the excluded set.

anjackson commented 1 year ago

Another

 uk,co,castlegatestationers)/wp-content/uploads/2016/12/airfix-1.72-hawker-siddley-harrier-gr.1-starter-kit.jpg 20191023043029 http://www.castlegatestationers.co.uk/wp-content/uploads/2016/12/AIRFIX-1.7
2-HAWKER-SIDDLEY-HARRIER-GR.1-STARTER-KIT.jpg application/x-www-form-urlencoded +0100] 365TPC2NUX4WXGC2UNKPAAT3N2GP5VQ6 - - 457678 955471599 /heritrix/output/dc2019/20191016083028/warcs/BL-NPLD-20191022125549694-40000-106~npld-dc-heritrix3-worker-1~8443.warc.gz
anjackson commented 1 year ago

Another

 uk,co,fancyratsforum)/viewtopic.php?&amp;f=42&amp;t=417 20191019112804 http://fancyratsforum.co.uk/viewtopic.php?f=42&amp;t=417&amp;sid=b2865ff6aa8911fa9a85c3daef097a6b application/x-www-form-urlencode
d 35.246.140.151 ZZXB3IJCLBPKNQVBQXSKI2BSDGKI6SL4 - - 10908 469226952 /heritrix/output/dc2019/20191016083028/warcs/BL-NPLD-20191019091941288-38651-106~npld-dc-heritrix3-worker-1~8443.warc.gz
anjackson commented 1 year ago

Another

uk,co,naughtyjessica)/images/x.jpg 20191017020204 http://www.naughtyjessica.co.uk/images/x.jpg
application/xml -0400] MMHHHZKBVBVOHAUOBYBWCMMYS5NJOIRG - - 6589 618214173 /heritrix/output/dc2019/20191016083028/warcs/BL-NPLD-20191016234040463-37582-106~npld-dc-heritrix3-worker-1~8443.warc.gz
anjackson commented 1 year ago
 uk,org,colossusrebuild)/documents/cryptdict/page46.htm 20191010055304 http://www.colossusrebuild.org.uk/documents/cryptdict/page46.htm application/x-www-form-urlencoded +0100] 5VXFQHA7WAZRBBSIFWCXXZQIMEBAMDEX - - 2739 209596167 /heritrix/output/dc2019/20190929212337/warcs/BL-NPLD-20191010014642203-35755-71~npld-dc-heritrix3-worker-1~8443.warc.gz
anjackson commented 1 year ago

A different type of error:

+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-yAhWWaHHry.66.232.91~8443.warc.gz hdfs:///heritrix/output/warcs/quarterly/20161001111030/BL-2016101
6184849538-01094-2025~194.66.232.91~8443.warc.gz
Traceback (most recent call last):
  File &quot;mr_cdx_pywb_job.py&quot;, line 287, in &lt;module&gt;
    MRCDXIndexer.run()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/mrjob.zip/mrjob/job.py&quot;, line 616, in run
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/mrjob.zip/mrjob/job.py&quot;, line 675, in execute
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/mrjob.zip/mrjob/job.py&quot;, line 760, in run_mapper
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/mrjob.zip/mrjob/job.py&quot;, line 826, in map_pairs
  File &quot;mr_cdx_pywb_job.py&quot;, line 135, in mapper_raw
    cdx11.process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 214, in process_a
ll
    super().process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 33, in process_all
    self.process_one(fh, out, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 250, in process_o
ne
    self.process_index_entry(it, record, filename, output)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 53, in process_index
_entry
    self._write_line(output, index, record, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 284, in _write_li
ne
    ts = iso_date_to_timestamp(dt)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/venv/lib/python3.6/site-packages/warcio/timeutils.py&quot;, line 155, in iso_date_t
o_timestamp
    return datetime_to_timestamp(iso_date_to_datetime(string))
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624068/attempt_202109081729_624068_m_000054_0/work/venv/lib/python3.6/site-packages/warcio/timeutils.py&quot;, line 52, in iso_date_to
_datetime
    nums = DATE_TIMESPLIT.split(string)
TypeError: expected string or bytes-like object

while reading input from hdfs:///heritrix/output/warcs/quarterly/20161001111030/BL-20161016184849538-01094-2025~194.66.232.91~8443.warc.gz
anjackson commented 1 year ago
uk,co,crossrider)/details/7872bef351e4ab6f0eb452e1e423f13b527edec7/corel+draw+x7+32+64 20181205003832 http://crossrider.co.uk/details/7872BEF351E4AB6F0EB452E1E423F13B527EDEC7/Corel+Draw+X7+32+64 text/html ; P7TF657XV6UAZ2OL5Y74SCAHPMCPNC5Z - - 6739 964876266 /heritrix/output/dc2018/20181015150658-h3-7/warcs/BL-20181204153758815-01915-15398~h3-7~8443.warc.gz
uk,co,crossrider)/details/813faa7ef751282e82964f95f2a0d1c6187c3139/rust+1971+9+03+2017 20181204235607 http://crossrider.co.uk/details/813FAA7EF751282E82964F95F2A0D1C6187C3139/Rust+1971+9+03+2017 text/html ; 5XBZC7IKNPJITW3PF7ZARO24CEMYXXPB - - 6363 853398359 /heritrix/output/dc2018/20181015150658-h3-7/warcs/BL-20181204153758963-01917-15398~h3-7~8443.warc.gz
anjackson commented 1 year ago

Mapper failure:

+ exec
+ cd /mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work
++ cut -f 2
+ INPUT_URI=hdfs:///heritrix/output/warcs/weekly-fri0900/20140425092548/BL-20140425202743715-00042-5029~opera~8445.warc.gz
++ basename hdfs:///heritrix/output/warcs/weekly-fri0900/20140425092548/BL-20140425202743715-00042-5029~opera~8445.warc.gz
++ sed -e &apos;s/^[^.]*//&apos;
+ FILE_EXT=.warc.gz
++ mktemp ./input-XXXXXXXXXX.warc.gz
+ INPUT_PATH=./input-kuU6z99tFf.warc.gz
+ rm ./input-kuU6z99tFf.warc.gz
+ case $INPUT_URI in
+ hadoop fs -copyToLocal hdfs:///heritrix/output/warcs/weekly-fri0900/20140425092548/BL-20140425202743715-00042-5029~opera~8445.warc.gz ./input-kuU6z99tFf.warc.gz
+ case $INPUT_PATH in
+ set +e
+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-kuU6z99tFf.warc.gz hdfs:///heritrix/output/warcs/weekly-fri0900/20140425092548/BL-20140425202743715
-00042-5029~opera~8445.warc.gz
    WARNING: Record not followed by newline, perhaps Content-Length is invalid
    Offset: 713995418
    Remainder: b&apos;\t\tan&gt; fore e_8_y-_8)eiBefospan&gt;e&quot;m&gt;tion_&gt;&amp;p  8s-(e )-room&quot; tb8e_ &lt;opvk\n&apos;
Replacing spaces in invalid WARC-Target-URI:                                            &quot;  an&gt;                  i=dj&apos;bc8  bmibp
Traceback (most recent call last):
  File &quot;mr_cdx_pywb_job.py&quot;, line 287, in &lt;module&gt;
    MRCDXIndexer.run()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/mrjob.zip/mrjob/job.py&quot;, line 616, in run
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/mrjob.zip/mrjob/job.py&quot;, line 675, in execute
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/mrjob.zip/mrjob/job.py&quot;, line 760, in run_mapper
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/mrjob.zip/mrjob/job.py&quot;, line 826, in map_pairs
  File &quot;mr_cdx_pywb_job.py&quot;, line 135, in mapper_raw
    cdx11.process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 214, in process_a
ll
    super().process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 33, in process_all
    self.process_one(fh, out, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 248, in process_o
ne
    for record in wrap_it:
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/bufferiter.py&quot;, line 17, in buff
ering_record_iter
    for record in record_iter:
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/venv/lib/python3.6/site-packages/warcio/archiveiterator.py&quot;, line 112, in _ite
rate_records
    self._raise_invalid_gzip_err()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624334/attempt_202109081729_624334_m_000243_0/work/venv/lib/python3.6/site-packages/warcio/archiveiterator.py&quot;, line 153, in _rai
se_invalid_gzip_err
    raise ArchiveLoadFailed(msg)
warcio.exceptions.ArchiveLoadFailed:
    ERROR: non-chunked gzip file detected, gzip block continues
    beyond single record.

    This file is probably not a multi-member gzip but a single gzip file.

    To allow seek, a gzipped WARC/ARC must have each record compressed into
    a single gzip member and concatenated together.

    This file is likely still valid and can be fixed by running:

    warcio recompress &lt;path/to/file&gt; &lt;path/to/new_file&gt;
anjackson commented 1 year ago
+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-C4QWHMBVoj.warc.gz hdfs:///heritrix/output/warcs/weekly-fri0900/20140425092548/BL-20140425202743604
-00040-5029~opera~8445.warc.gz
    WARNING: Record not followed by newline, perhaps Content-Length is invalid
    Offset: 710075114
    Remainder: b&apos;rt0-? -\n&apos;
Traceback (most recent call last):
  File &quot;mr_cdx_pywb_job.py&quot;, line 287, in &lt;module&gt;
    MRCDXIndexer.run()
anjackson commented 1 year ago
At line: uk,ac,bath,opus)/38491/1/icsr2013_harpspositionpaper.pdf 20180716142635 http://opus.bath.ac.uk/38491/1/ICSR2013_HARPSPositionPaper.pdf application/pdf failed ERUX3RUXGUL4GRKUL5XJI6ERC5NRCA2Q - - 86520 207722996 /heritrix/output/dc2018/20180715072213-b4be5d382977/warcs/BL-20180716134615538-00005-11361~b4be5d382977~8443.warc.gz
java.lang.NumberFormatException: For input string: &quot;failed&quot;
anjackson commented 1 year ago
+ hadoop fs -copyToLocal hdfs:///heritrix/output/warcs/dc0-20170515/BL-20170928235826173-19653-20124~crawler04.bl.uk~8443.warc.gz ./input-auL81XHXxS.bl.uk~8443.warc.gz
+ case $INPUT_PATH in
+ set +e
+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-auL81XHXxS.bl.uk~8443.warc.gz hdfs:///heritrix/output/warcs/dc0-20170515/BL-20170928235826173-19653
-20124~crawler04.bl.uk~8443.warc.gz
Traceback (most recent call last):
  File &quot;mr_cdx_pywb_job.py&quot;, line 287, in &lt;module&gt;
    MRCDXIndexer.run()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624471/attempt_202109081729_624471_m_000065_0/work/mrjob.zip/mrjob/job.py&quot;, line 616, in run
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624471/attempt_202109081729_624471_m_000065_0/work/mrjob.zip/mrjob/job.py&quot;, line 675, in execute
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624471/attempt_202109081729_624471_m_000065_0/work/mrjob.zip/mrjob/job.py&quot;, line 760, in run_mapper
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624471/attempt_202109081729_624471_m_000065_0/work/mrjob.zip/mrjob/job.py&quot;, line 826, in map_pairs
  File &quot;mr_cdx_pywb_job.py&quot;, line 135, in mapper_raw
    cdx11.process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624471/attempt_202109081729_624471_m_000065_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 214, in process_a
ll
    super().process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624471/attempt_202109081729_624471_m_000065_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 33, in process_all
    self.process_one(fh, out, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624471/attempt_202109081729_624471_m_000065_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 250, in process_o
ne
    self.process_index_entry(it, record, filename, output)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624471/attempt_202109081729_624471_m_000065_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 47, in process_index
_entry
    value = self.get_field(record, field, it, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624471/attempt_202109081729_624471_m_000065_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 324, in get_field
    value = value.split(&quot;:&quot;)[-1]
AttributeError: &apos;NoneType&apos; object has no attribute &apos;split&apos;

while reading input from hdfs:///heritrix/output/warcs/dc0-20170515/BL-20170928235826173-19653-20124~crawler04.bl.uk~8443.warc.gz
anjackson commented 1 year ago
+ hadoop fs -copyToLocal hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170928030805213-20320-20207~crawler04.bl.uk~8444.warc.gz ./input-Odqu5taV78.bl.uk~8444.warc.gz
+ case $INPUT_PATH in
+ set +e
+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-Odqu5taV78.bl.uk~8444.warc.gz hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170928030805213-20320
-20207~crawler04.bl.uk~8444.warc.gz
Traceback (most recent call last):
  File &quot;mr_cdx_pywb_job.py&quot;, line 287, in &lt;module&gt;
    MRCDXIndexer.run()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624477/attempt_202109081729_624477_m_000138_0/work/mrjob.zip/mrjob/job.py&quot;, line 616, in run
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624477/attempt_202109081729_624477_m_000138_0/work/mrjob.zip/mrjob/job.py&quot;, line 675, in execute
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624477/attempt_202109081729_624477_m_000138_0/work/mrjob.zip/mrjob/job.py&quot;, line 760, in run_mapper
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624477/attempt_202109081729_624477_m_000138_0/work/mrjob.zip/mrjob/job.py&quot;, line 826, in map_pairs
  File &quot;mr_cdx_pywb_job.py&quot;, line 135, in mapper_raw
    cdx11.process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624477/attempt_202109081729_624477_m_000138_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 214, in process_a
ll
    super().process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624477/attempt_202109081729_624477_m_000138_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 33, in process_all
    self.process_one(fh, out, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624477/attempt_202109081729_624477_m_000138_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 250, in process_o
ne
    self.process_index_entry(it, record, filename, output)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624477/attempt_202109081729_624477_m_000138_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 47, in process_index
_entry
    value = self.get_field(record, field, it, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624477/attempt_202109081729_624477_m_000138_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 324, in get_field
    value = value.split(&quot;:&quot;)[-1]
AttributeError: &apos;NoneType&apos; object has no attribute &apos;split&apos;
anjackson commented 1 year ago
+ hadoop fs -copyToLocal hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170928031519226-2032320207~crawler04.bl.uk~8444.warc.gz ./input-p2dLnnAK54.bl.uk~8444.warc.gz
+ case $INPUT_PATH in
+ set +e
+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-p2dLnnAK54.bl.uk~8444.warc.gz hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170928031519226-20323
-20207~crawler04.bl.uk~8444.warc.gz
Traceback (most recent call last):
  File &quot;mr_cdx_pywb_job.py&quot;, line 287, in &lt;module&gt;
    MRCDXIndexer.run()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624496/attempt_202109081729_624496_m_000132_3/work/mrjob.zip/mrjob/job.py&quot;, line 616, in run
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624496/attempt_202109081729_624496_m_000132_3/work/mrjob.zip/mrjob/job.py&quot;, line 675, in execute
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624496/attempt_202109081729_624496_m_000132_3/work/mrjob.zip/mrjob/job.py&quot;, line 760, in run_mapper
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624496/attempt_202109081729_624496_m_000132_3/work/mrjob.zip/mrjob/job.py&quot;, line 826, in map_pairs
  File &quot;mr_cdx_pywb_job.py&quot;, line 135, in mapper_raw
    cdx11.process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624496/attempt_202109081729_624496_m_000132_3/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 214, in process_a
ll
    super().process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624496/attempt_202109081729_624496_m_000132_3/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 33, in process_all
    self.process_one(fh, out, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624496/attempt_202109081729_624496_m_000132_3/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 250, in process_o
ne
    self.process_index_entry(it, record, filename, output)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624496/attempt_202109081729_624496_m_000132_3/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 47, in process_index
_entry
    value = self.get_field(record, field, it, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624496/attempt_202109081729_624496_m_000132_3/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 324, in get_field
    value = value.split(&quot;:&quot;)[-1]
AttributeError: &apos;NoneType&apos; object has no attribute &apos;split&apos;
anjackson commented 1 year ago
+ hadoop fs -copyToLocal hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170928024026812-20317-20207~crawler04.bl.uk~8444.warc.gz ./input-xWwMzxi59r.bl.uk~8444.warc.gz
+ case $INPUT_PATH in
+ set +e
+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-xWwMzxi59r.bl.uk~8444.warc.gz hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170928024026812-20317
-20207~crawler04.bl.uk~8444.warc.gz
Traceback (most recent call last):
  File &quot;mr_cdx_pywb_job.py&quot;, line 287, in &lt;module&gt;
    MRCDXIndexer.run()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624499/attempt_202109081729_624499_m_000155_2/work/mrjob.zip/mrjob/job.py&quot;, line 616, in run
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624499/attempt_202109081729_624499_m_000155_2/work/mrjob.zip/mrjob/job.py&quot;, line 675, in execute
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624499/attempt_202109081729_624499_m_000155_2/work/mrjob.zip/mrjob/job.py&quot;, line 760, in run_mapper
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624499/attempt_202109081729_624499_m_000155_2/work/mrjob.zip/mrjob/job.py&quot;, line 826, in map_pairs
  File &quot;mr_cdx_pywb_job.py&quot;, line 135, in mapper_raw
    cdx11.process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624499/attempt_202109081729_624499_m_000155_2/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 214, in process_a
ll
    super().process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624499/attempt_202109081729_624499_m_000155_2/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 33, in process_all
    self.process_one(fh, out, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624499/attempt_202109081729_624499_m_000155_2/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 250, in process_o
ne
    self.process_index_entry(it, record, filename, output)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624499/attempt_202109081729_624499_m_000155_2/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 47, in process_index
_entry
    value = self.get_field(record, field, it, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624499/attempt_202109081729_624499_m_000155_2/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 324, in get_field
    value = value.split(&quot;:&quot;)[-1]
AttributeError: &apos;NoneType&apos; object has no attribute &apos;split&apos;

while reading input from hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170928024026812-20317-20207~crawler04.bl.uk~8444.warc.gz
anjackson commented 1 year ago
Failed with 400 Bad Request
At line: uk,gov,ipswich,ppc)/appndetails.asp?&amp;det_search_params=&amp;iappid=14/00642/ful&amp;pnladvancedopen=1&amp;prev_search_params=&amp;search_params=pagenumber=1&amp;stype=app&amp;txtvalenddate=01/08/20
14&amp;txtvalstartdate=28/07/2014 20170926145518 https://ppc.ipswich.gov.uk/appndetails.asp?iAppID=14/00642/FUL&amp;sType=APP&amp;search_params=pageNumber%3D1%26txtValStartDate%3D28%252F07%252F2014%26txtValEndD
ate%3D01%252F08%252F2014%26pnlAdvancedOpen%3D1%26&amp;prev_search_params=&amp;det_search_params= text/html = IVOVLF33YYYAIF7NPSZBR7L26BO6RDS4 - - 5855 181944679 /heritrix/output/warcs/dc2-20170515/BL-2017092614
5339886-14424-20288~crawler04.bl.uk~8445.warc.gz
java.lang.NumberFormatException: For input string: &quot;=&quot;
anjackson commented 1 year ago
At line: uk,gov,ipswich,ppc)/images/new_application_off.gif 20170926143801 https://ppc.ipswich.gov.uk/images/new_application_off.gif warc/revisit = BBPXBYN55CPDZMX3VEURJJT5M74TZCG4 - - 651 250168698 /heritrix/output/warcs/dc2-20170515/BL-20170926143356348-14418-20288~crawler04.bl.uk~8445.warc.gz
java.lang.NumberFormatException: For input string: &quot;=&quot;
anjackson commented 1 year ago
At line: uk,gov,ipswich,ppc)/images/view_app_doc_on.gif 20170926143442 https://ppc.ipswich.gov.uk/images/view_app_doc_on.gif warc/revisit = 5GTJ5TDUJ76TSPGATYNX4HABPDBHZSSG - - 648 995751434 /heritrix/output/warcs/dc2-20170515/BL-20170926141616430-14414-20288~crawler04.bl.uk~8445.warc.gz
java.lang.NumberFormatException: For input string: &quot;=&quot;
anjackson commented 1 year ago
At line: uk,gov,ipswich,ppc)/img/govuk.png 20170926143547 https://ppc.ipswich.gov.uk/img/govuk.png warc/revisit = EVWEZWEHO5CXXFCXHMNMQCRSZIPCG5BM - - 639 32826165 /heritrix/output/warcs/dc2-20170515/BL-20170926143531735-14420-20288~crawler04.bl.uk~8445.warc.gz
java.lang.NumberFormatException: For input string: &quot;=&quot;
anjackson commented 1 year ago
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624560/attempt_202109081729_624560_m_000215_4/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 324, in get_field
    value = value.split(&quot;:&quot;)[-1]
AttributeError: &apos;NoneType&apos; object has no attribute &apos;split&apos;

while reading input from hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170921073708269-19795-13403~crawler04.bl.uk~8444.warc.gz
anjackson commented 1 year ago
At line: com,forensicoutreach)/library/hiding-in-the-cloud-4-things-you-didnt-kn
ow-about-computer-forensics/feed 20170919232554 http://forensicoutreach.com/libr
ary/hiding-in-the-cloud-4-things-you-didnt-know-about-computer-forensics/feed/ a
pplication/rss+xml +0000|v1|52.87.232.174|www.adamsoftware.net|200|11133|35.197.
232.5:80|0.657|0.657|GET 4K6LQEI5J5G5XU6S2CZ2QRQXXW6PO5QS - - 1258 879566264 /heritrix/output/warcs/dc0-20170515/BL-20170919230803894-18667-13313~crawler04.bl.uk~8443.warc.gz
java.lang.NumberFormatException: For input string: &quot;+0000|v1|52.87.232.174|
www.adamsoftware.net|200|11133|35.197.232.5:80|0.657|0.657|GET&quot;

and

At line: org,worldutilitysummit)/wp-content/themes/wus-2018/font/721877/28961fbb
-c8e7-4647-84f1-1d0e25b6e854.eot 20170920003902 http://www.worldutilitysummit.or
g/wp-content/themes/wus-2018/font/721877/28961fbb-c8e7-4647-84f1-1d0e25b6e854.eo
t application/vnd.ms-fontobject +0000|v1|5.104.241.125|www.ilexinstant.com|304|0
|35.189.124.151:80|0.476|0.476|GET RHIA7YF6UP4ZZJAQIH46NI22IPIVKFZ4 - - 22891 95
4110908 /heritrix/output/warcs/dc0-20170515/BL-20170920002230701-18686-13313~crawler04.bl.uk~8443.warc.gz
java.lang.NumberFormatException: For input string: &quot;+0000|v1|5.104.241.125|
www.ilexinstant.com|304|0|35.189.124.151:80|0.476|0.476|GET&quot;
anjackson commented 1 year ago

Patching the indexer to skip and note the bad status codes... See 2.3.3 and 2.3.4.

anjackson commented 1 year ago

Mapper failure:

+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-sBHbZGMBQP.bl.uk~8444.warc.gz hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170801092009568-14888
-28126~crawler04.bl.uk~8444.warc.gz
Traceback (most recent call last):
  File &quot;mr_cdx_pywb_job.py&quot;, line 298, in &lt;module&gt;
    MRCDXIndexer.run()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624740/attempt_202109081729_624740_m_000215_0/work/mrjob.zip/mrjob/job.py&quot;, line 616, in run
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624740/attempt_202109081729_624740_m_000215_0/work/mrjob.zip/mrjob/job.py&quot;, line 675, in execute
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624740/attempt_202109081729_624740_m_000215_0/work/mrjob.zip/mrjob/job.py&quot;, line 760, in run_mapper
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624740/attempt_202109081729_624740_m_000215_0/work/mrjob.zip/mrjob/job.py&quot;, line 826, in map_pairs
  File &quot;mr_cdx_pywb_job.py&quot;, line 137, in mapper_raw
    cdx11.process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624740/attempt_202109081729_624740_m_000215_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 214, in process_a
ll
    super().process_all()
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624740/attempt_202109081729_624740_m_000215_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 33, in process_all
    self.process_one(fh, out, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624740/attempt_202109081729_624740_m_000215_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 250, in process_o
ne
    self.process_index_entry(it, record, filename, output)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624740/attempt_202109081729_624740_m_000215_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py&quot;, line 47, in process_index
_entry
    value = self.get_field(record, field, it, filename)
  File &quot;/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624740/attempt_202109081729_624740_m_000215_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py&quot;, line 324, in get_field
    value = value.split(&quot;:&quot;)[-1]
AttributeError: &apos;NoneType&apos; object has no attribute &apos;split&apos;

while reading input from hdfs:///heritrix/output/warcs/dc1-20170515/BL-20170801092009568-14888-28126~crawler04.bl.uk~8444.warc.gz
anjackson commented 1 year ago
+ hadoop fs -copyToLocal hdfs:///heritrix/output/warcs/dc3-20170515/BL-20170711153400659-09036-3963~crawler04.bl.uk~8446.warc.gz ./input-UHZQH5fYP7.bl.uk~8446.warc.gz
+ case $INPUT_PATH in
+ set +e
+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-UHZQH5fYP7.bl.uk~8446.warc.gz hdfs:///heritrix/output/warcs/dc3-20170515/BL-20170711153400659-09036-3963~crawler04.bl.uk~8446.warc.gz
Traceback (most recent call last):
  File "mr_cdx_pywb_job.py", line 298, in <module>
    MRCDXIndexer.run()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624781/attempt_202109081729_624781_m_000063_0/work/mrjob.zip/mrjob/job.py", line 616, in run
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624781/attempt_202109081729_624781_m_000063_0/work/mrjob.zip/mrjob/job.py", line 675, in execute
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624781/attempt_202109081729_624781_m_000063_0/work/mrjob.zip/mrjob/job.py", line 760, in run_mapper
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624781/attempt_202109081729_624781_m_000063_0/work/mrjob.zip/mrjob/job.py", line 826, in map_pairs
  File "mr_cdx_pywb_job.py", line 137, in mapper_raw
    cdx11.process_all()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624781/attempt_202109081729_624781_m_000063_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 214, in process_all
    super().process_all()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624781/attempt_202109081729_624781_m_000063_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py", line 33, in process_all
    self.process_one(fh, out, filename)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624781/attempt_202109081729_624781_m_000063_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 250, in process_one
    self.process_index_entry(it, record, filename, output)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624781/attempt_202109081729_624781_m_000063_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py", line 47, in process_index_entry
    value = self.get_field(record, field, it, filename)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624781/attempt_202109081729_624781_m_000063_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 324, in get_field
    value = value.split(":")[-1]
AttributeError: 'NoneType' object has no attribute 'split'
anjackson commented 1 year ago
+ hadoop fs -copyToLocal hdfs:///heritrix/output/warcs/dc0-20170515/BL-20170707071909562-07187-3719~crawler04.bl.uk~8443.warc.gz ./input-z1flw93ikq.bl.uk~8443.warc.gz
+ case $INPUT_PATH in
+ set +e
+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-z1flw93ikq.bl.uk~8443.warc.gz hdfs:///heritrix/output/warcs/dc0-20170515/BL-20170707071909562-07187-3719~crawler04.bl.uk~8443.warc.gz
Traceback (most recent call last):
  File "mr_cdx_pywb_job.py", line 298, in <module>
    MRCDXIndexer.run()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/mrjob.zip/mrjob/job.py", line 616, in run
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/mrjob.zip/mrjob/job.py", line 675, in execute
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/mrjob.zip/mrjob/job.py", line 760, in run_mapper
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/mrjob.zip/mrjob/job.py", line 826, in map_pairs
  File "mr_cdx_pywb_job.py", line 137, in mapper_raw
    cdx11.process_all()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 214, in process_all
    super().process_all()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/venv/lib/python3.6/site-packages/warcio/indexer.py", line 33, in process_all
    self.process_one(fh, out, filename)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 248, in process_one
    for record in wrap_it:
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/venv/lib/python3.6/site-packages/cdxj_indexer/bufferiter.py", line 17, in buffering_record_iter
    for record in record_iter:
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/venv/lib/python3.6/site-packages/warcio/archiveiterator.py", line 110, in _iterate_records
    self.record = self._next_record(self.next_line)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/venv/lib/python3.6/site-packages/warcio/archiveiterator.py", line 262, in _next_record
    self.check_digests)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/venv/lib/python3.6/site-packages/warcio/recordloader.py", line 143, in parse_record_stream
    http_headers = self.load_http_headers(rec_type, uri, stream, length)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624806/attempt_202109081729_624806_m_000096_2/work/venv/lib/python3.6/site-packages/warcio/recordloader.py", line 184, in load_http_headers
    if not uri.startswith(self.HTTP_SCHEMES):
AttributeError: 'NoneType' object has no attribute 'startswith'

while reading input from hdfs:///heritrix/output/warcs/dc0-20170515/BL-20170707071909562-07187-3719~crawler04.bl.uk~8443.warc.gz
anjackson commented 1 year ago
+ hadoop fs -copyToLocal hdfs:///heritrix/output/warcs/dc3-20170515/BL-20170624223904791-07771-3963~crawler04.bl.uk~8446.warc.gz ./input-gZM8tQCDog.bl.uk~8446.warc.gz
+ case $INPUT_PATH in
+ set +e
+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-gZM8tQCDog.bl.uk~8446.warc.gz hdfs:///heritrix/output/warcs/dc3-20170515/BL-20170624223904791-07771-3963~crawler04.bl.uk~8446.warc.gz
Traceback (most recent call last):
  File "mr_cdx_pywb_job.py", line 298, in <module>
    MRCDXIndexer.run()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624827/attempt_202109081729_624827_m_000161_0/work/mrjob.zip/mrjob/job.py", line 616, in run
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624827/attempt_202109081729_624827_m_000161_0/work/mrjob.zip/mrjob/job.py", line 675, in execute
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624827/attempt_202109081729_624827_m_000161_0/work/mrjob.zip/mrjob/job.py", line 760, in run_mapper
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624827/attempt_202109081729_624827_m_000161_0/work/mrjob.zip/mrjob/job.py", line 826, in map_pairs
  File "mr_cdx_pywb_job.py", line 137, in mapper_raw
    cdx11.process_all()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624827/attempt_202109081729_624827_m_000161_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 214, in process_all
    super().process_all()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624827/attempt_202109081729_624827_m_000161_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py", line 33, in process_all
    self.process_one(fh, out, filename)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624827/attempt_202109081729_624827_m_000161_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 250, in process_one
    self.process_index_entry(it, record, filename, output)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624827/attempt_202109081729_624827_m_000161_0/work/venv/lib/python3.6/site-packages/warcio/indexer.py", line 47, in process_index_entry
    value = self.get_field(record, field, it, filename)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624827/attempt_202109081729_624827_m_000161_0/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 324, in get_field
    value = value.split(":")[-1]
AttributeError: 'NoneType' object has no attribute 'split'

while reading input from hdfs:///heritrix/output/warcs/dc3-20170515/BL-20170624223904791-07771-3963~crawler04.bl.uk~8446.warc.gz
anjackson commented 1 year ago
+ hadoop fs -copyToLocal hdfs:///heritrix/output/warcs/dc3-20170515/BL-20170623074558463-07003-3963~crawler04.bl.uk~8446.warc.gz ./input-FnExuTNMDk.bl.uk~8446.warc.gz
+ case $INPUT_PATH in
+ set +e
+ python mr_cdx_pywb_job.py --step-num=0 --mapper --cdx-endpoint http://cdx.api.wa.bl.uk/data-heritrix ./input-FnExuTNMDk.bl.uk~8446.warc.gz hdfs:///heritrix/output/warcs/dc3-20170515/BL-20170623074558463-07003-3963~crawler04.bl.uk~8446.warc.gz
Traceback (most recent call last):
  File "mr_cdx_pywb_job.py", line 298, in <module>
    MRCDXIndexer.run()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624864/attempt_202109081729_624864_m_000036_1/work/mrjob.zip/mrjob/job.py", line 616, in run
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624864/attempt_202109081729_624864_m_000036_1/work/mrjob.zip/mrjob/job.py", line 675, in execute
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624864/attempt_202109081729_624864_m_000036_1/work/mrjob.zip/mrjob/job.py", line 760, in run_mapper
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624864/attempt_202109081729_624864_m_000036_1/work/mrjob.zip/mrjob/job.py", line 826, in map_pairs
  File "mr_cdx_pywb_job.py", line 137, in mapper_raw
    cdx11.process_all()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624864/attempt_202109081729_624864_m_000036_1/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 214, in process_all
    super().process_all()
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624864/attempt_202109081729_624864_m_000036_1/work/venv/lib/python3.6/site-packages/warcio/indexer.py", line 33, in process_all
    self.process_one(fh, out, filename)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624864/attempt_202109081729_624864_m_000036_1/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 250, in process_one
    self.process_index_entry(it, record, filename, output)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624864/attempt_202109081729_624864_m_000036_1/work/venv/lib/python3.6/site-packages/warcio/indexer.py", line 47, in process_index_entry
    value = self.get_field(record, field, it, filename)
  File "/mapred/local/dir/taskTracker/access/jobcache/job_202109081729_624864/attempt_202109081729_624864_m_000036_1/work/venv/lib/python3.6/site-packages/cdxj_indexer/main.py", line 324, in get_field
    value = value.split(":")[-1]
AttributeError: 'NoneType' object has no attribute 'split'

while reading input from hdfs:///heritrix/output/warcs/dc3-20170515/BL-20170623074558463-07003-3963~crawler04.bl.uk~8446.warc.gz
anjackson commented 1 year ago

Okay, this is still too many errors to handle manually. I'm creating ukwa/ukwa-manage:2.3.5 which catches the indexing exception and records it in TrackDB as a field called warc_cdx_indexing_exception_s (which is better than the current manually-managed process for recording skipped WARCs). See https://github.com/ukwa/ukwa-manage/commit/f705e1c5aa7ba8399327c9f05e77132c576684fe

anjackson commented 1 year ago

As Alex pointed out on the IIPC Slack, this actually looks like a problem with the web host company rather than Heritrix, thankfully. e.g. this fragment is a log of the crawl activity, appearing after the content:

</rss>
19/Sep/2017:03:35:41 +0000|v1|194.66.232.93|www.estiethirionphotography.co.za|200|1985|162.13.104.162:80|5.773|5.773|GET /2011/10/fransua-anne-louise-wedding/feed/ HTTP/1.0||
anjackson commented 1 year ago

Spotted a small error in 2.3.5 so creating 2.3.6.

anjackson commented 1 year ago

System skips errors now, but we still need to improve the CDX indexer and re-process the marked WARCs at some point. e.g. this query can be used to find the difficult cases:

http://solr8.api.wa.bl.uk/solr/tracking/select?facet.field=cdx_index_ss&facet.field=warc_cdx_indexing_exception_s&facet.field=warc_malformed_status_code_record_count_l&facet=on&q=kind_s%3Awarcs%20AND%20-collection_s%3Aselective&rows=0