lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

UDF for extracting image links #203

Closed lintool closed 8 years ago

lintool commented 8 years ago

Needed for a hackathon team: UDF for extracting image links.

ianmilligan1 commented 8 years ago

Is this UDF ready to merge to master?

ExtractImageLinks.scala seems good to me.

lintool commented 8 years ago

Fixed and merged commit 04d105ded330e116da09f93f34e375c8262cd2f9

ianmilligan1 commented 8 years ago

Do we have an example script that calls this? If so, happy to write-up and include in docs.

lintool commented 8 years ago

This is the script I wrote for Ryan Deschamps

import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._

val links = RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc)
  .keepValidPages()
  .flatMap(r => ExtractImageLinks(r.getUrl, r.getContentString))
  .countItems() 
ianmilligan1 commented 8 years ago

OK. Testing and will write up.

ianmilligan1 commented 8 years ago

Put a stub in here, please feel free to tweak and refine: http://lintool.github.io/warcbase-docs/Spark-Image-Analysis/