nla / outbackcdx

Web archive index server based on RocksDB
Apache License 2.0
32 stars 20 forks source link

ArrayIndexOutOfBoundsException, NumberFormatException when loading indexed ARC-files #121

Closed thomasegense closed 10 months ago

thomasegense commented 11 months ago

I have indexed 6M ARC/WARC files with Jwarc into OutbackCDX v.0.11.1. CDX11 format and using absolute path to the ARC/WARC-files.

Here are some errors from the Outback log. I think these errors are from indexing and not playback. And they are all related to ARC-files. Here is a sample of and I tried include different error types. Tell me if you need to see the ARC-header or ARC-files for some of the errors. You can review the errors and see if it indeed is a bug in outbackCDX or expected behavior.

Since Jwarc could parse the ARC headers, the validation of crawltime format should already have parsed.

From OutbackCDX server log:

skipping bad cdx line: ErrorDocument%20403%20http://www.imagasin.no/feil.html java.lang.ArrayIndexOutOfBoundsException Jul 28, 2023 6:42:12 AM outbackcdx.Capture parseCdxTimestamp WARNING: Padding timestamp shorter then 14 chars: - skipping bad cdx line: ErrorDocument%20402%20http://www.imagasin.no/feil.html - 414 50573085 /netarkivet/005e/fildir/3-3-20050616090703-00172-kb-prod-har-002.kb.dk.arc.gz java.lang.NumberFormatException: For input string: "/netarkivet/005e/fildir/3-3-20050616090703-00172-kb-prod-har-002.kb.dk.arc.gz" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.base/java.lang.Integer.parseInt(Integer.java:638) at java.base/java.lang.Integer.parseInt(Integer.java:770) at outbackcdx.Capture.fromCdxLine(Capture.java:416) at outbackcdx.Webapp.post(Webapp.java:338) at outbackcdx.Webapp.lambda$new$3(Webapp.java:101) at outbackcdx.Web$Route.handle(Web.java:309) at outbackcdx.Web$Router.handle(Web.java:233) at outbackcdx.Webapp.handle(Webapp.java:674) at outbackcdx.Web$Server.serve(Web.java:47) at outbackcdx.NanoHTTPD$HTTPSession.execute(NanoHTTPD.java:849) at outbackcdx.NanoHTTPD$1$1.run(NanoHTTPD.java:208) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) skipping bad cdx line: no,imagasin)/feil.html%0derrordocument%20403%20http:/www.imagasin.no/feil.html%0derrordocument%20402%20http:/www.imagasin.no/feil.html 20050616095032 http://www.imagasin.no/feil.html%0DErrorDocument%20403%20http://www.imagasin.no/feil.html%0DErrorDocument%20402%20http://www.imagasin.

at Jul 29 14:20:20 CEST 2023 172.16.206.19 200 3340076 0.365s POST /index?badLines=skip skipping bad cdx line: org,progressiveportal)/robots.txt 20061118181455 http://www.progressiveportal.org/robots.txt text/html 302 - http://www.progressiveportal.org/errors/404.html java.lang.ArrayIndexOutOfBoundsException skipping bad cdx line: java.lang.ArrayIndexOutOfBoundsException skipping bad cdx line: IndexIgnore%20.htaccess%20/.??%20~%20#%20/HEADER%20/README%20/_vti java.lang.ArrayIndexOutOfBoundsException skipping bad cdx line: java.lang.ArrayIndexOutOfBoundsException skipping bad cdx line: <Limit%20GET%20POST>

ul 29, 2023 2:20:22 PM outbackcdx.Capture parseCdxTimestamp WARNING: Padding timestamp shorter then 14 chars: - skipping bad cdx line: AuthGroupFile%20/www/htdocs/domains/s5/00861/www.progressiveportal.com/webdocs/_vti_pvt/service.grp - 958 14069161 /netarkivet/006f/fildir/8524-14-20061118181055-00003-kb-prod-har-001.kb.dk.arc.gz java.lang.NumberFormatException: For input string: "/netarkivet/006f/fildir/8524-14-20061118181055-00003-kb-prod-har-001.kb.dk.arc.gz" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.base/java.lang.Integer.parseInt(Integer.java:638) at java.base/java.lang.Integer.parseInt(Integer.java:770) at outbackcdx.Capture.fromCdxLine(Capture.java:416) at outbackcdx.Webapp.post(Webapp.java:338) at outbackcdx.Webapp.lambda$new$3(Webapp.java:101) at outbackcdx.Web$Route.handle(Web.java:309) at outbackcdx.Web$Router.handle(Web.java:233) at outbackcdx.Webapp.handle(Webapp.java:674) at outbackcdx.Web$Server.serve(Web.java:47) at outbackcdx.NanoHTTPD$HTTPSession.execute(NanoHTTPD.java:849) at outbackcdx.NanoHTTPD$1$1.run(NanoHTTPD.java:208) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)

Jul 30, 2023 5:31:08 AM outbackcdx.Capture parseCdxTimestamp WARNING: Padding timestamp shorter then 14 chars: - skipping bad cdx line: RewriteRule%20(.*)%20/showpic.php?pic=$1 - 534 52316751 /netarkivet/006f/fildir/63562-102-20091028090527-00454-sb-prod-har-004.arc.gz java.lang.NumberFormatException: For input string: "/netarkivet/006f/fildir/63562-102-20091028090527-00454-sb-prod-har-004.arc.gz" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.base/java.lang.Integer.parseInt(Integer.java:638) at java.base/java.lang.Integer.parseInt(Integer.java:770) at outbackcdx.Capture.fromCdxLine(Capture.java:416) at outbackcdx.Webapp.post(Webapp.java:338) at outbackcdx.Webapp.lambda$new$3(Webapp.java:101) at outbackcdx.Web$Route.handle(Web.java:309) at outbackcdx.Web$Router.handle(Web.java:233) at outbackcdx.Webapp.handle(Webapp.java:674) at outbackcdx.Web$Server.serve(Web.java:47) at outbackcdx.NanoHTTPD$HTTPSession.execute(NanoHTTPD.java:849) at outbackcdx.NanoHTTPD$1$1.run(NanoHTTPD.java:208) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)

skipping bad cdx line: com,nabiki)/miss_led/rk/kenshin11.jpg 20110318062229 http://www.nabiki.com/miss_led/rk/kenshin11.jpg application/octet-stream 302 - /images/nolinks.jpg java.lang.ArrayIndexOutOfBoundsException skipping bad cdx line: ^L java.lang.ArrayIndexOutOfBoundsException Jul 31, 2023 11:36:19 AM outbackcdx.Capture parseCdxTimestamp WARNING: Padding timestamp shorter then 14 chars: - skipping bad cdx line: ^L - 205 157829 /netarkivet/013i/fildir/112447-30-20110318061755-00755-sb-prod-har-006.arc.gz java.lang.NumberFormatException: For input string: "/netarkivet/013i/fildir/112447-30-20110318061755-00755-sb-prod-har-006.arc.gz" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.base/java.lang.Integer.parseInt(Integer.java:638) at java.base/java.lang.Integer.parseInt(Integer.java:770) at outbackcdx.Capture.fromCdxLine(Capture.java:416) at outbackcdx.Webapp.post(Webapp.java:338) at outbackcdx.Webapp.lambda$new$3(Webapp.java:101) at outbackcdx.Web$Route.handle(Web.java:309) at outbackcdx.Web$Router.handle(Web.java:233) at outbackcdx.Webapp.handle(Webapp.java:674) at outbackcdx.Web$Server.serve(Web.java:47) at outbackcdx.NanoHTTPD$HTTPSession.execute(NanoHTTPD.java:849) at outbackcdx.NanoHTTPD$1$1.run(NanoHTTPD.java:208) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Mon Jul 31 11:36:19 CEST 2023 POST /index?badLines=skip Added 96 record

skipping bad cdx line: com,instagram,graph)/logging_client_events?__wb_method=post&access_token=936619743392459|3cdb3f896252a1db29679cb4554db266&message={"app_uid":"56974635280","app_id":"936619743392459" ,"app_ver":"1.0.0","data":[{"time":1682502720.792,"name":"instagram_web_media_impressions","extra":{"ig_userid":56974635280,"pk":56974635280,"rollout_hash":"1007379527","frontend_env":"c3","app_id":"93661 9743392459","original_referrer":null,"original_referrer_domain":"","referrer":null,"referrer_domain":"www.instagram.com","url":"/regeringdk/","nav_chain":"polarisprofileroot:profilepage:1:via_cold_start,p olarispostmodal:postpage:7:modallink","media_id":"2408315599488034008","media_type":"video","owner_id":"9272148702","surface":"profile"},"obj_type":"url","obj_id":"/p/cfsdkcmhcty/"},{"time":1682502720.815 ,"name":"video_should_start","extra":{ java.lang.ArrayIndexOutOfBoundsException

skipping bad cdx line: com,worldofkaos)/robots.txt 20100205164630 http://www.worldofkaos.com/robots.txt text/html 302 - http://www.worldofkaos.com/404.html java.lang.ArrayIndexOutOfBoundsException skipping bad cdx line: java.lang.ArrayIndexOutOfBoundsException skipping bad cdx line: java.lang.ArrayIndexOutOfBoundsException skipping bad cdx line: RewriteEngine%20On java.lang.ArrayIndexOutOfBoundsException skipping bad cdx line: java.lang.ArrayIndexOutOfBoundsException skipping bad cdx line: RewriteCond%20%{REQUEST_FILENAME}%20.jpg$|.gif$|.*png$%20[NC] java.lang.ArrayIndexOutOfBoundsException skipping bad cdx line: RewriteCond%20%{HTTP_REFERER}%20!^$%20 java.lang.ArrayIndexOutOfBoundsException skipping bad cdx line: RewriteCond%20%{HTTP_REFERER}%20!worldofkaos.com%20[NC]%20 java.lang.ArrayIndexOutOfBoundsException skipping bad cdx line: RewriteCond%20%{HTTP_REFERER}%20!livejournal.com%20[NC]%20%20 java.lang.ArrayIndexOutOfBoundsException skipping bad cdx line: RewriteCond%20%{HTTP_REFERER}%20!deepee.dk%20[NC] java.lang.ArrayIndexOutOfBoundsException skipping bad cdx line: RewriteCond%20%{HTTP_REFERER}%20!johnny-depp.org%20[NC]%20%20 java.lang.ArrayIndexOutOfBoundsException skipping bad cdx line: RewriteCond%20%{HTTP_REFERER}%20!google.%20[NC]%20 java.lang.ArrayIndexOutOfBoundsException skipping bad cdx line: RewriteCond%20%{HTTP_REFERER}%20!search\?q=cache%20[NC] java.lang.ArrayIndexOutOfBoundsException skipping bad cdx line: java.lang.ArrayIndexOutOfBoundsException

skipping bad cdx line: dk,gratis6)/redirect.php?qsgalleryurl=5rdq5jyab3phndwf2izcmacg3gafcozznrfnii7hlijuimvcrudubazzj+w9rl5xkaqbql6fua9bdlogjmzupnc= 20061209045946 http://gratis6.dk/redirect.php?qsGalleryURL=5RdQ5jyAB3PHnDWf2izcMacG3GaFcOzZnRfnII7hLiJuimVCrUDuBAZZJ+W9RL5xKaqbql6FuA9BDlOGjMZUpNc= text/html 302 - http://www.schoolforswingers.com/granny/083¼Ü] java.lang.ArrayIndexOutOfBoundsException Aug 25, 2023 9:14:34 PM outbackcdx.Capture parseCdxTimestamp WARNING: Padding timestamp shorter then 14 chars: - skipping bad cdx line: I*o^Y\<U+008D><U+008F>È­gûÞH<U+0093>; ü - 381 2763415 /netarkivet/007g/fildir/8462-31-20061209045922-00633-sb-prod-har-001.statsbiblioteket.dk.arc.gz java.lang.NumberFormatException: For input string: "/netarkivet/007g/fildir/8462-31-20061209045922-00633-sb-prod-har-001.statsbiblioteket.dk.arc.gz" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.base/java.lang.Integer.parseInt(Integer.java:638) at java.base/java.lang.Integer.parseInt(Integer.java:770) at outbackcdx.Capture.fromCdxLine(Capture.java:416)

ato commented 10 months ago

This looks to me like newlines in some of the fields is corrupting the field boundaries in the CDX records. I think OutbackCDX is therefore doing the right thing in rejecting them as bogus.

I've changed jwarc in v0.28.4 to percent encode spaces, newlines and nulls in all string fields which should at least keep the CDX field boundaries intact.

thomasegense commented 10 months ago

Thanks, I will upgrade jwarc. I hope I have to time to try do some testing also with the ARC files that resulted in these bugs.