Closed aliceranzhou closed 8 years ago
I think we had a use case documented at #154, in the latter few messages - extracting plain text from a specific moment in time?
Sorry, I was unclear. This branch actually adds two things: first there's the date extraction code, which works and is ready to go (this addresses #154). Then there's a function called tabDelimit
that converts an iterator of tuples (consisting of Strings, Ints, or other tuples) to a tab-separated string of the flattened tuple's values. I assume it's meant to be used to output extracted records in TSV format instead of tuples.
Currently, we need to convert a tuple to an iterator as seen below in the second map()
.
import org.warcbase.spark.rdd.RecordRDD._
import org.warcbase.spark.matchbox.{RecordLoader, RemoveHTML}
import org.warcbase.spark.matchbox.TupleFormatter._
RecordLoader.loadWarc("./example.warc.gz", sc)
.keepValidPages()
.map(r => (r.getCrawldate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.map(r => tabDelimit(Iterator(r)))
.saveAsTextFile("out/")
I was just wondering if @aliceranzhou had a different use scenario in mind – i.e., a scenario where an iterator of tuples occurs naturally, possibly with nesting. If not, tabDelimit
could be written more simply.
e.g.,
def tabDelimit2(p: Product): String = {
p.productIterator.mkString("\t")
}
Yes, I think the tuples were nested when I originally wrote tabDelimit..
Your version of tabDelimit
is much cleaner though – let's go with that! If nested tuples don't occur, then let's just deprecate the first version.
Do you want to make the change @jrwiebe, or should I?
I ended up using shapeless to deal with the issue of taking as arguments to tabDelimit
or flatten
tuples of any size. (I did end up keeping the flattening after all, since I saw we do have example scripts in the docs where there is nesting to eliminate.) Shapeless permits us to make Scala behave dynamically in kind of a clever way, but after writing some example code to update the documentation I am beginning to wonder if it is robust enough. Specifically, while a call of map(tabDelimit(_))
applied to a RDD works (as in the example here), a call of something like map(r => tabDelimit((r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString))))
fails on resolving tabDelimit's implicit parameters. Perhaps a variant of @aliceranzhou's tabDelimit
would ultimately be the better solution. We would want to add cases all the way up to Tuple22 for completeness, and a method defined for a Product parameter (i.e., any tuple) so that the user doesn't have to convert their tuple to an Iterator.
Since this pull request is closed, I'm leaving this comment mostly for myself in case this concern is elevated into an issue.
The date extraction code looks good. Is there a reason why
tabDelimit
should only work with tuples of size 2-4, though? Also, what are some use cases for the function?