Add dateExtract and tabDelimit

jrwiebe commented 8 years ago

The date extraction code looks good. Is there a reason why tabDelimit should only work with tuples of size 2-4, though? Also, what are some use cases for the function?

ianmilligan1 commented 8 years ago

I think we had a use case documented at #154, in the latter few messages - extracting plain text from a specific moment in time?

jrwiebe commented 8 years ago

Sorry, I was unclear. This branch actually adds two things: first there's the date extraction code, which works and is ready to go (this addresses #154). Then there's a function called tabDelimit that converts an iterator of tuples (consisting of Strings, Ints, or other tuples) to a tab-separated string of the flattened tuple's values. I assume it's meant to be used to output extracted records in TSV format instead of tuples.

Currently, we need to convert a tuple to an iterator as seen below in the second map().

import org.warcbase.spark.rdd.RecordRDD._
import org.warcbase.spark.matchbox.{RecordLoader, RemoveHTML}
import org.warcbase.spark.matchbox.TupleFormatter._

RecordLoader.loadWarc("./example.warc.gz", sc)
.keepValidPages()
.map(r => (r.getCrawldate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.map(r => tabDelimit(Iterator(r)))
.saveAsTextFile("out/")

I was just wondering if @aliceranzhou had a different use scenario in mind – i.e., a scenario where an iterator of tuples occurs naturally, possibly with nesting. If not, tabDelimit could be written more simply.

e.g.,

  def tabDelimit2(p: Product): String = {
    p.productIterator.mkString("\t")
  }

aliceranzhou commented 8 years ago

Yes, I think the tuples were nested when I originally wrote tabDelimit..

Your version of tabDelimit is much cleaner though – let's go with that! If nested tuples don't occur, then let's just deprecate the first version.

Do you want to make the change @jrwiebe, or should I?

jrwiebe commented 8 years ago

I ended up using shapeless to deal with the issue of taking as arguments to tabDelimit or flatten tuples of any size. (I did end up keeping the flattening after all, since I saw we do have example scripts in the docs where there is nesting to eliminate.) Shapeless permits us to make Scala behave dynamically in kind of a clever way, but after writing some example code to update the documentation I am beginning to wonder if it is robust enough. Specifically, while a call of map(tabDelimit(_)) applied to a RDD works (as in the example here), a call of something like map(r => tabDelimit((r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))) fails on resolving tabDelimit's implicit parameters. Perhaps a variant of @aliceranzhou's tabDelimit would ultimately be the better solution. We would want to add cases all the way up to Tuple22 for completeness, and a method defined for a Product parameter (i.e., any tuple) so that the user doesn't have to convert their tuple to an Iterator.

Since this pull request is closed, I'm leaving this comment mostly for myself in case this concern is elevated into an issue.

lintool / warcbase

Add dateExtract and tabDelimit #193