lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Pig2Gdf.py deprecated? Switch to native GDF exporter #175

Closed ianmilligan1 closed 8 years ago

ianmilligan1 commented 8 years ago

Following up from project meeting today: our Gephi: Converting Site Link Structure into Dynamic Visualization relies upon pig2gdf.py to pipe content into Gephi. With Pig's death, we'll need to revisit this. Bake GDF export right into scala?

I know it was ages ago, but as @jrwiebe is the expert on this (having written pig2gdf.py), do you want to add this to your ever-growing to-do list? :grin:

jrwiebe commented 8 years ago

I added a UDF to do this, called WriteGDF. The below example demonstrates how it is called after generating an RDD of links (à la http://lintool.github.io/warcbase-docs/Spark-Analysis-of-Site-Link-Structure/).

import org.warcbase.spark.matchbox.RecordTransformers._
import org.warcbase.spark.matchbox.{ExtractTopLevelDomain, ExtractLinks, RecordLoader, WriteGDF}
import org.warcbase.spark.rdd.RecordRDD._

val links = RecordLoader.loadArc("/collections/webarchives/CanadianPoliticalParties/arc/", sc)
  .discardDate(null)
  .keepMimeTypes(Set("text/html"))
  .map(r => (r.getCrawldate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractTopLevelDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractTopLevelDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGDF(links, "all-links.gdf")
lintool commented 8 years ago

@jrwiebe two suggestions - can you merge with --no-ff to create a new merge point. It doesn't matter as much for this base, but if you have a multi-commit patch history is better preserved. See: http://stackoverflow.com/questions/9069061/what-is-the-difference-between-git-merge-and-git-merge-no-ff

If you'd like to benefit from code review, create a patch and ask for feedback from others (before merging back to master). If it's straightforward, no need.

jrwiebe commented 8 years ago

Thanks for the reminder about --no-ff. Regarding patching, are you referring to format-patch?

On Tue, Dec 8, 2015 at 8:17 AM, Jimmy Lin notifications@github.com wrote:

@jrwiebe https://github.com/jrwiebe two suggestions - can you merge with --no-ff to create a new merge point. It doesn't matter as much for this base, but if you have a multi-commit patch history is better preserved. See: http://stackoverflow.com/questions/9069061/what-is-the-difference-between-git-merge-and-git-merge-no-ff

If you'd like to benefit from code review, create a patch and ask for feedback from others (before merging back to master). If it's straightforward, no need.

— Reply to this email directly or view it on GitHub https://github.com/lintool/warcbase/issues/175#issuecomment-162876570.

lintool commented 8 years ago

Sorry, re: code review, I meant create a pull request, which provides a structure around which people can give feedback.