lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

loadWarc crashing on link extraction, while loadArc works #185

Closed ianmilligan1 closed 8 years ago

ianmilligan1 commented 8 years ago

I've been running tests extracting all links from a large collection. The following script works:

import org.warcbase.spark.matchbox.{ExtractLinks, RecordLoader}
import org.warcbase.spark.rdd.RecordRDD._

RecordLoader.loadArc("/mnt/vol1/data_sets/cpp_arcs", sc)
  .keepValidPages()
  .map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, f._1.replaceAll("^.*www\\.", ""), f._2.replaceAll("^.*www\\.", ""))))
  .filter(r => r._2 != null && r._3 != null)
  .countItems()
  .filter(r => r._2 > 10)
  .saveAsTextFile("/mnt/vol1/derivative_data/cpp_arc_links")

However, this script crashes:

import org.warcbase.spark.matchbox.{ExtractLinks, RecordLoader}
import org.warcbase.spark.rdd.RecordRDD._

RecordLoader.loadWarc("/mnt/vol1/data_sets/cpp_warcs", sc)
  .keepValidPages()
  .map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, f._1.replaceAll("^.*www\\.", ""), f._2.replaceAll("^.*www\\.", ""))))
  .filter(r => r._2 != null && r._3 != null)
  .countItems()
  .filter(r => r._2 > 10)
  .saveAsTextFile("/mnt/vol1/derivative_data/cpp_warc_links")

The error dump is:

ERROR WarcRecordUtils - Read 0 bytes but expected 146 bytes. Continuing...
ERROR WarcRecordUtils - Read 0 bytes but expected 6796 bytes. Continuing...
ERROR WarcRecordUtils - Read 0 bytes but expected 16039 bytes. Continuing...
ERROR WarcRecordUtils - Read 0 bytes but expected 15537 bytes. Continuing...
ERROR WarcRecordUtils - Read 0 bytes but expected 33977 bytes. Continuing...
ERROR WarcRecordUtils - Read 0 bytes but expected 32349 bytes. Continuing...
ERROR WarcRecordUtils - Read 0 bytes but expected 115270 bytes. Continuing...
ERROR WarcRecordUtils - Read 0 bytes but expected 387 bytes. Continuing...
ERROR WarcRecordUtils - Read 0 bytes but expected 146 bytes. Continuing...
ERROR WarcRecordUtils - Read 0 bytes but expected 760 bytes. Continuing...
ERROR WarcRecordUtils - Read 0 bytes but expected 16480 bytes. Continuing...
ERROR WarcRecordUtils - Read 0 bytes but expected 30205 bytes. Continuing...
ERROR WarcRecordUtils - Read 0 bytes but expected 15537 bytes. Continuing...

Producing an empty part-00000 file. Any idea?

ianmilligan1 commented 8 years ago

Actually, I'm not convinced that the loadArc is working either. @aliceranzhou, can you share me the scripts that you used for this? Thanks. I may have almost certainly missed an API update.

aliceranzhou commented 8 years ago

Hmm, looking.

On Thu, Dec 3, 2015 at 4:44 AM Ian Milligan notifications@github.com wrote:

Actually, I'm not convinced that the loadArc is working either. @aliceranzhou https://github.com/aliceranzhou, can you share me the scripts that you used for this? Thanks. I may have missed an API update.

— Reply to this email directly or view it on GitHub https://github.com/lintool/warcbase/issues/185#issuecomment-161569727.

ianmilligan1 commented 8 years ago

Thanks! This is on the York Rho server, BTW.

aliceranzhou commented 8 years ago

Thanks. I've duplicated the warc error. It seems to be an issue with getContentString on the WARCRecord. Still looking..

For arc, I forgot to use the domain instead of the full url. Perhaps there are less than 10 links from urls per day. I've updated the script here https://github.com/lintool/warcbase/wiki/Spark:-Analysis-of-Site-Link-Structure to use domains instead and also lower the count filter to 5.

On Thu, Dec 3, 2015 at 8:50 AM Ian Milligan notifications@github.com wrote:

This is on the York Rho server, BTW.

— Reply to this email directly or view it on GitHub https://github.com/lintool/warcbase/issues/185#issuecomment-161645315.

ianmilligan1 commented 8 years ago

Thanks, @aliceranzhou. Glad to hear you could reproduce it.

ianmilligan1 commented 8 years ago

Re-running on the ARC dataset, I'll keep you posted.

aliceranzhou commented 8 years ago

Great, thanks!

I think I found the issue.. try adding import org.warcbase.spark.matchbox.RecordTransformers._ to the imports.

On Thu, Dec 3, 2015 at 9:29 AM Ian Milligan notifications@github.com wrote:

Re-running on the ARC dataset, I'll keep you posted.

— Reply to this email directly or view it on GitHub https://github.com/lintool/warcbase/issues/185#issuecomment-161655936.

aliceranzhou commented 8 years ago

Actually, it needs to be included in the RecordLoader. I'll push a fix.

On Thu, Dec 3, 2015 at 9:41 AM Alice Zhou alice.zhou@gmail.com wrote:

Great, thanks!

I think I found the issue.. try adding import org.warcbase.spark.matchbox.RecordTransformers._ to the imports.

On Thu, Dec 3, 2015 at 9:29 AM Ian Milligan notifications@github.com wrote:

Re-running on the ARC dataset, I'll keep you posted.

— Reply to this email directly or view it on GitHub https://github.com/lintool/warcbase/issues/185#issuecomment-161655936.

aliceranzhou commented 8 years ago

^that wasn't the issue. Content can't be read twice, so I changed the functions to variables, only to be set once.

Merged to master now, please let me know if there are any issues.

ianmilligan1 commented 8 years ago

OK, testing now. Will let you know in a few hours!

lintool commented 8 years ago

Closing... Ian, please re-open if you still find bugs...

ianmilligan1 commented 8 years ago

Should have responded earlier - is working perfectly. FYI, we've uploaded the derivative dataset here.