Prototype fluent Spark API for manipulating archive data

lintool commented 9 years ago

I had in mind something like this:

WarcRecords.load("/path/to/warc")
 .keepMimeTypes(["text/html"])
 .keepDate("20130102")
 .keepUrl("http://foo.bar.baz/*")

And we could have methods like the following:

.discardMimeTypes()
.discardDate()

Finally, we might have something like extractUrlAndBody would generate tuples of (URL, body).

This could be implemented by sub-classing RDD. The methods would essentially be syntactic sugar over filter, map, and standard Spark operations over RDDs.

lintool commented 9 years ago

Per @ianmilligan1's suggestion, we should have extractCrawldateUrlBody in addition to extractUrlBody (removed the And)

ianmilligan1 commented 9 years ago

Also, we should make sure the full archival URL is present in the extracted plain text, as well as the domain. Perhaps output should look like per record:

(YYYYMMDD,greenparty.ca,http://greenparty.ca/exactURL,plaintext)

aliceranzhou commented 9 years ago

Added domain and full archival url, as well as crawl date. Named methods as extractDomainUrlBody() and extractCrawldateDomainUrlBody() for clarity.

Also, to re: @ianmilligan1, to run multi-line commands, one can run :paste in spark-shell. I've added this to the documentation under (this page)[https://github.com/lintool/warcbase/wiki/Building-and-Running-Warcbase-Under-OS-X]. Would it make more sense to separate the Spark API information as another page?

lintool commented 9 years ago

@aliceranzhou how about something like instead of a proliferation of .extractFoo methods, how about something like .extract(["crawldate", "domain", "url", "body"]). The method should be smart about checking for fields names.

lintool commented 9 years ago

10/28 meeting notes

do something like new list... to tuple for API above
add basic test cases
add scala doc
converting each Pig script into equivalent Spark script
extend API to WARC
run sanity test on Trantor to make sure it works at scale
convert Pig UDFs into Scale matchbox

lintool commented 9 years ago

Breaking down into separate issues for better organizer:

149 Port Pig test cases to Spark
150 Port Pig UDFs over to Spark

Move org.warcbase.spark.matchbox.ArcRecords into package org.warcbase.spark.rdd to match Spark class hierarchy: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.rdd.RDD (we also want API documentation along those lines)

ianmilligan1 commented 9 years ago

I'm using warcbase on a server and running into error messages (that don't appear locally). When running this script:

import org.warcbase.spark.matchbox.ArcRecords
import org.warcbase.spark.matchbox.ArcRecords._

val r = ArcRecords.load("/mnt/vol1/data_sets/cpp_arcs", sc)
  .keepMimeTypes(Set("text/html"))
  .discardDate(null)
  .keepDomains(Set("greenparty.ca"))
  .extractUrlAndBody()
r.saveAsTextFile("/mnt/vol1/derivative_data/cpp.ndp/")

Receive the error:

<console>:28: error: value extractUrlAndBody is not a member of org.apache.spark.rdd.RDD[org.archive.io.arc.ARCRecord]
possible cause: maybe a semicolon is missing before `value extractUrlAndBody'?
                .extractUrlAndBody()
                 ^

Builds all claim to be done successfully.

lintool commented 9 years ago

@ianmilligan1 I think the method got renamed to extractDomainUrlBody() https://github.com/lintool/warcbase/blob/master/src/main/scala/org/warcbase/spark/matchbox/ArcRecords.scala#L51

ianmilligan1 commented 9 years ago

:+1: Works like a charm - will update docs when I have a chance.

ianmilligan1 commented 9 years ago

Also reminder to self to document difference between extractDomainUrlBody() and extractCrawldateDomainUrlBody. Just running tests right now so I can provide example data.

lintool commented 9 years ago

I believe @aliceranzhou is going to refactor all of that into something like .extract(["domain", "body", ...]) so we don't need a method for each field.

aliceranzhou commented 9 years ago

Refactored the above method into .extract(["domain", "body"..] only with enums

Example: .extract(ToExtract.DOMAIN, ToExtract.CRAWLDATE)

Also, refactored code a little. There's a RecordLoader that has loadArc() and loadWarc methods, and both can be used as a WARecordRDD.

I've kept these changes under the matchbox branch, as they're still unstable and I'm trying to determine how to make scalatest play nice with maven.

lintool commented 9 years ago

Nice!

aliceranzhou commented 8 years ago

After offline discussion with @lintool, we've decided to remove the .extract() proliferations. Instead, use .map(r => (r.getDate, r.getDomain, r.getMimeType, r.getRawBodyContent))

lintool commented 8 years ago

Yes, it'll just be up to us to write good documentation on what the API for manipulating records looks like.

And if we fix issue #160 we'll give the user a easy way to pluck individual records and play with them... e.g., extract links, named entities, etc.

lintool commented 8 years ago

@aliceranzhou has merged the initial implementation into master as part of commit 80605c8595012e33088e8f40d1d7b8cb4c078173

Closing this issue. More API change requests please open new issue.

lintool / warcbase

Prototype fluent Spark API for manipulating archive data #146

149 Port Pig test cases to Spark

150 Port Pig UDFs over to Spark