lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Prototype fluent Spark API for manipulating archive data #146

Closed lintool closed 8 years ago

lintool commented 9 years ago

I had in mind something like this:

WarcRecords.load("/path/to/warc")
 .keepMimeTypes(["text/html"])
 .keepDate("20130102")
 .keepUrl("http://foo.bar.baz/*") 

And we could have methods like the following:

Finally, we might have something like extractUrlAndBody would generate tuples of (URL, body).

This could be implemented by sub-classing RDD. The methods would essentially be syntactic sugar over filter, map, and standard Spark operations over RDDs.

lintool commented 9 years ago

Per @ianmilligan1's suggestion, we should have extractCrawldateUrlBody in addition to extractUrlBody (removed the And)

ianmilligan1 commented 9 years ago

Also, we should make sure the full archival URL is present in the extracted plain text, as well as the domain. Perhaps output should look like per record:

(YYYYMMDD,greenparty.ca,http://greenparty.ca/exactURL,plaintext)
aliceranzhou commented 9 years ago

Added domain and full archival url, as well as crawl date. Named methods as extractDomainUrlBody() and extractCrawldateDomainUrlBody() for clarity.

Also, to re: @ianmilligan1, to run multi-line commands, one can run :paste in spark-shell. I've added this to the documentation under (this page)[https://github.com/lintool/warcbase/wiki/Building-and-Running-Warcbase-Under-OS-X]. Would it make more sense to separate the Spark API information as another page?

lintool commented 9 years ago

@aliceranzhou how about something like instead of a proliferation of .extractFoo methods, how about something like .extract(["crawldate", "domain", "url", "body"]). The method should be smart about checking for fields names.

lintool commented 9 years ago

10/28 meeting notes

lintool commented 9 years ago

Breaking down into separate issues for better organizer:

Move org.warcbase.spark.matchbox.ArcRecords into package org.warcbase.spark.rdd to match Spark class hierarchy: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.rdd.RDD (we also want API documentation along those lines)

ianmilligan1 commented 9 years ago

I'm using warcbase on a server and running into error messages (that don't appear locally). When running this script:

import org.warcbase.spark.matchbox.ArcRecords
import org.warcbase.spark.matchbox.ArcRecords._

val r = ArcRecords.load("/mnt/vol1/data_sets/cpp_arcs", sc)
  .keepMimeTypes(Set("text/html"))
  .discardDate(null)
  .keepDomains(Set("greenparty.ca"))
  .extractUrlAndBody()
r.saveAsTextFile("/mnt/vol1/derivative_data/cpp.ndp/")

Receive the error:

<console>:28: error: value extractUrlAndBody is not a member of org.apache.spark.rdd.RDD[org.archive.io.arc.ARCRecord]
possible cause: maybe a semicolon is missing before `value extractUrlAndBody'?
                .extractUrlAndBody()
                 ^

Builds all claim to be done successfully.

lintool commented 9 years ago

@ianmilligan1 I think the method got renamed to extractDomainUrlBody() https://github.com/lintool/warcbase/blob/master/src/main/scala/org/warcbase/spark/matchbox/ArcRecords.scala#L51

ianmilligan1 commented 9 years ago

:+1: Works like a charm - will update docs when I have a chance.

ianmilligan1 commented 9 years ago

Also reminder to self to document difference between extractDomainUrlBody() and extractCrawldateDomainUrlBody. Just running tests right now so I can provide example data.

lintool commented 9 years ago

I believe @aliceranzhou is going to refactor all of that into something like .extract(["domain", "body", ...]) so we don't need a method for each field.

aliceranzhou commented 9 years ago

Refactored the above method into .extract(["domain", "body"..] only with enums

Example: .extract(ToExtract.DOMAIN, ToExtract.CRAWLDATE)

Also, refactored code a little. There's a RecordLoader that has loadArc() and loadWarc methods, and both can be used as a WARecordRDD.

I've kept these changes under the matchbox branch, as they're still unstable and I'm trying to determine how to make scalatest play nice with maven.

lintool commented 9 years ago

Nice!

aliceranzhou commented 8 years ago

After offline discussion with @lintool, we've decided to remove the .extract() proliferations. Instead, use .map(r => (r.getDate, r.getDomain, r.getMimeType, r.getRawBodyContent))

lintool commented 8 years ago

Yes, it'll just be up to us to write good documentation on what the API for manipulating records looks like.

And if we fix issue #160 we'll give the user a easy way to pluck individual records and play with them... e.g., extract links, named entities, etc.

lintool commented 8 years ago

@aliceranzhou has merged the initial implementation into master as part of commit 80605c8595012e33088e8f40d1d7b8cb4c078173

Closing this issue. More API change requests please open new issue.