lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

New getCrawlmonth function #221

Closed ianmilligan1 closed 8 years ago

ianmilligan1 commented 8 years ago

A common use case is to group link structures, URLs, etc. (like we do in our crawl-sites viz). We generally do so by month, as crawls span several days.

Right now, our only documented case is to use the getCrawldate function and carry out a few complicated lines of filtering, etc.

This pull request introduces a new function, getCrawlmonth. It returns YYYYMM instead of YYYYMMDD by default. While it duplicates some functionality, I think this is more user friendly for our base. It will also help us generate crawl-sites visualizations easier.

Use example:

val r =
RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/warcbase-resources/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlmonth, ExtractDomain(r.getUrl)))
.countItems()
.take(10)
ianmilligan1 commented 8 years ago

Once we have something like this in the codebase, I'd also like to document how to do our crawl-sites viz on all other collections.

ianmilligan1 commented 8 years ago

Tested:

RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/warcbase-resources/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlmonth, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)
  .take(20)

Is equivalent to:

RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/warcbase-resources/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1.substring(0,6), ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != null && r._3 != null)
  .countItems()
  .filter(r => r._2 > 5)
  .groupBy(_._1._1)
  .flatMap(r => r._2)
  .take(20)

While it adds a new function, it just simplifies things for users which is something I think we should be aiming for. Will be testing this in production on the WALK project.

jrwiebe commented 8 years ago

This looks good. I think it would be better to capitalize the function as getCrawlMonth, though, and changed getCrawldate to getCrawlDate while we're at it.

ianmilligan1 commented 8 years ago

Thanks, @jrwiebe – made the changes, and makes sense to me.

Think we're ready to merge @lintool? Once merged, I will also update warcbase-docs.

ianmilligan1 commented 8 years ago

Oops, fixing build fail.

ianmilligan1 commented 8 years ago

And all checks passed – Travis CI is a great tool. Ready to merge on my end.