Closed ianmilligan1 closed 8 years ago
Once we have something like this in the codebase, I'd also like to document how to do our crawl-sites viz on all other collections.
Tested:
RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/warcbase-resources/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlmonth, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
.take(20)
Is equivalent to:
RecordLoader.loadArchives("/Users/ianmilligan1/dropbox/git/warcbase-resources/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
.keepValidPages()
.map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1.substring(0,6), ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != null && r._3 != null)
.countItems()
.filter(r => r._2 > 5)
.groupBy(_._1._1)
.flatMap(r => r._2)
.take(20)
While it adds a new function, it just simplifies things for users which is something I think we should be aiming for. Will be testing this in production on the WALK project.
This looks good. I think it would be better to capitalize the function as getCrawlMonth
, though, and changed getCrawldate
to getCrawlDate
while we're at it.
Thanks, @jrwiebe – made the changes, and makes sense to me.
Think we're ready to merge @lintool? Once merged, I will also update warcbase-docs.
Oops, fixing build fail.
And all checks passed – Travis CI is a great tool. Ready to merge on my end.
A common use case is to group link structures, URLs, etc. (like we do in our crawl-sites viz). We generally do so by month, as crawls span several days.
Right now, our only documented case is to use the
getCrawldate
function and carry out a few complicated lines of filtering, etc.This pull request introduces a new function,
getCrawlmonth
. It returns YYYYMM instead of YYYYMMDD by default. While it duplicates some functionality, I think this is more user friendly for our base. It will also help us generate crawl-sites visualizations easier.Use example: