Closed ianmilligan1 closed 7 years ago
Unfortunately the most recent commits that fixed #244 don't fix this error.
I believe this has been fixed.
Here's the modified script:
import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
import org.warcbase.spark.matchbox.{RemoveHTML, RecordLoader, ExtractBoilerpipeText}
import StringUtils._
RecordLoader.loadArchives("i2millig-UVIC-broken-warc", sc)
.keepValidPages()
.map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).removePrefixWWW(), ExtractDomain(f._2).removePrefixWWW())))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
.saveAsTextFile("test-link-extraction")
Note the use of ExtractDomain(f._1).removePrefixWWW()
, which requires import StringUtils._
; removePrefixWWW
does a bit more error checking vs. a raw regular expression.
Sounds great! Will give it a test right now.
@lintool should i update the docs with this documentation, or is this a use case only if we receive this error dump?
New script works, too, btw! Thanks.
@ianmilligan1 Please update the documentation to use .removePrefixWWW()
instead of actual regular expressions.
For the record, here's what happened: ExtractDomain
returns null
if there's an error (garbage in, garbage out), and trying to apply a regexp to it causes an NPE. The removePrefixWWW
method does some error checking and passing along null
if the input is null
.
Closing issue.
As we continue to run large numbers of WARCs through warcbase, a new error on two collections our project has received from the University of Victoria. The smaller collection is 17GB so happy to
scp
over to camalon if you're interested @lintool.Script that we are running:
Can generate plain text and URLs, but fails on link extraction with a java.lang.NullPointerException.
Short error trace here:
Full log here.