Closed ianmilligan1 closed 8 years ago
I added a UDF to do this, called WriteGDF
. The below example demonstrates how it is called after generating an RDD of links (à la http://lintool.github.io/warcbase-docs/Spark-Analysis-of-Site-Link-Structure/).
import org.warcbase.spark.matchbox.RecordTransformers._
import org.warcbase.spark.matchbox.{ExtractTopLevelDomain, ExtractLinks, RecordLoader, WriteGDF}
import org.warcbase.spark.rdd.RecordRDD._
val links = RecordLoader.loadArc("/collections/webarchives/CanadianPoliticalParties/arc/", sc)
.discardDate(null)
.keepMimeTypes(Set("text/html"))
.map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, ExtractTopLevelDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractTopLevelDomain(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
WriteGDF(links, "all-links.gdf")
@jrwiebe two suggestions - can you merge with --no-ff
to create a new merge point. It doesn't matter as much for this base, but if you have a multi-commit patch history is better preserved. See: http://stackoverflow.com/questions/9069061/what-is-the-difference-between-git-merge-and-git-merge-no-ff
If you'd like to benefit from code review, create a patch and ask for feedback from others (before merging back to master). If it's straightforward, no need.
Thanks for the reminder about --no-ff
. Regarding patching, are you
referring to format-patch
?
On Tue, Dec 8, 2015 at 8:17 AM, Jimmy Lin notifications@github.com wrote:
@jrwiebe https://github.com/jrwiebe two suggestions - can you merge with --no-ff to create a new merge point. It doesn't matter as much for this base, but if you have a multi-commit patch history is better preserved. See: http://stackoverflow.com/questions/9069061/what-is-the-difference-between-git-merge-and-git-merge-no-ff
If you'd like to benefit from code review, create a patch and ask for feedback from others (before merging back to master). If it's straightforward, no need.
— Reply to this email directly or view it on GitHub https://github.com/lintool/warcbase/issues/175#issuecomment-162876570.
Sorry, re: code review, I meant create a pull request, which provides a structure around which people can give feedback.
Following up from project meeting today: our Gephi: Converting Site Link Structure into Dynamic Visualization relies upon
pig2gdf.py
to pipe content into Gephi. With Pig's death, we'll need to revisit this. Bake GDF export right into scala?I know it was ages ago, but as @jrwiebe is the expert on this (having written
pig2gdf.py
), do you want to add this to your ever-growing to-do list? :grin: