Closed lintool closed 8 years ago
Per @ianmilligan1's suggestion, we should have extractCrawldateUrlBody
in addition to extractUrlBody
(removed the And
)
Also, we should make sure the full archival URL is present in the extracted plain text, as well as the domain. Perhaps output should look like per record:
(YYYYMMDD,greenparty.ca,http://greenparty.ca/exactURL,plaintext)
Added domain and full archival url, as well as crawl date. Named methods as extractDomainUrlBody()
and extractCrawldateDomainUrlBody()
for clarity.
Also, to re: @ianmilligan1, to run multi-line commands, one can run :paste in spark-shell. I've added this to the documentation under (this page)[https://github.com/lintool/warcbase/wiki/Building-and-Running-Warcbase-Under-OS-X]. Would it make more sense to separate the Spark API information as another page?
@aliceranzhou how about something like instead of a proliferation of .extractFoo
methods, how about something like .extract(["crawldate", "domain", "url", "body"])
. The method should be smart about checking for fields names.
10/28 meeting notes
Breaking down into separate issues for better organizer:
Move org.warcbase.spark.matchbox.ArcRecords
into package org.warcbase.spark.rdd
to match Spark class hierarchy: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.rdd.RDD (we also want API documentation along those lines)
I'm using warcbase on a server and running into error messages (that don't appear locally). When running this script:
import org.warcbase.spark.matchbox.ArcRecords
import org.warcbase.spark.matchbox.ArcRecords._
val r = ArcRecords.load("/mnt/vol1/data_sets/cpp_arcs", sc)
.keepMimeTypes(Set("text/html"))
.discardDate(null)
.keepDomains(Set("greenparty.ca"))
.extractUrlAndBody()
r.saveAsTextFile("/mnt/vol1/derivative_data/cpp.ndp/")
Receive the error:
<console>:28: error: value extractUrlAndBody is not a member of org.apache.spark.rdd.RDD[org.archive.io.arc.ARCRecord]
possible cause: maybe a semicolon is missing before `value extractUrlAndBody'?
.extractUrlAndBody()
^
Builds all claim to be done successfully.
@ianmilligan1 I think the method got renamed to extractDomainUrlBody()
https://github.com/lintool/warcbase/blob/master/src/main/scala/org/warcbase/spark/matchbox/ArcRecords.scala#L51
:+1: Works like a charm - will update docs when I have a chance.
Also reminder to self to document difference between extractDomainUrlBody()
and extractCrawldateDomainUrlBody
. Just running tests right now so I can provide example data.
I believe @aliceranzhou is going to refactor all of that into something like .extract(["domain", "body", ...])
so we don't need a method for each field.
Refactored the above method into .extract(["domain", "body"..]
only with enums
Example: .extract(ToExtract.DOMAIN, ToExtract.CRAWLDATE)
Also, refactored code a little. There's a RecordLoader
that has loadArc()
and loadWarc
methods, and both can be used as a WARecordRDD
.
I've kept these changes under the matchbox branch, as they're still unstable and I'm trying to determine how to make scalatest play nice with maven.
Nice!
After offline discussion with @lintool, we've decided to remove the .extract()
proliferations. Instead, use .map(r => (r.getDate, r.getDomain, r.getMimeType, r.getRawBodyContent))
Yes, it'll just be up to us to write good documentation on what the API for manipulating records looks like.
And if we fix issue #160 we'll give the user a easy way to pluck individual records and play with them... e.g., extract links, named entities, etc.
@aliceranzhou has merged the initial implementation into master as part of commit 80605c8595012e33088e8f40d1d7b8cb4c078173
Closing this issue. More API change requests please open new issue.
I had in mind something like this:
And we could have methods like the following:
.discardMimeTypes()
.discardDate()
Finally, we might have something like
extractUrlAndBody
would generate tuples of (URL, body).This could be implemented by sub-classing RDD. The methods would essentially be syntactic sugar over
filter
,map
, and standard Spark operations over RDDs.