lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Built-in Image URL building from wayback #232

Closed greebie closed 7 years ago

greebie commented 8 years ago

The image URL process is excellent. I would like to suggest adding another object to Warcbase that gives you pre-processed urls to get the images from the wayback machine. (Suggested code at bottom). Basically, it works the same as the regular ExtractImageLinks, except there is one additional parameter for a date string. Output is a Seq of urls that can be flatMapped and used to scrape images from wayback for research purposes.

Possible additional development could be to include an image cloud for the collection (eg. counts determine image size).

Ryan. . ..

package org.warcbase.spark.matchbox
import java.io.IOException
import org.jsoup.Jsoup
import org.jsoup.select.Elements
import scala.collection.mutable

object ExtractImageLinksWBFormatted {
  /**
    * @param src the src link.
    * @param date a date for the content (eg. a crawl date). Way back will resolve truncated dates.
    * @param html the content from which links are to be extracted.
    *
    * Returns a sequence of image links
    */
  def apply(src: String, date: String, html: String): Seq[String] = {
    if (html.isEmpty || date.isEmpty) return Nil
    try {

      val output = mutable.MutableList[String]()
      val doc = Jsoup.parse(html)
      val links: Elements = doc.select("img[src]")
      val it = links.iterator()
      while (it.hasNext) {
        val link = it.next()
        val prefix: String = "http://wayback.archive-it.org/227/"
        val suffix: String = "im_/"
        link.setBaseUri(src)
        val target =  link.attr("abs:src")
        output += (prefix+date+suffix+target)
      }
      output
    } catch {
      case e: Exception =>
        throw new IOException("Caught exception processing input ", e);
    }
  }
}
ianmilligan1 commented 8 years ago

Thoughts @lintool?

ianmilligan1 commented 8 years ago

We're currently working to get images right out of the WARC files - save the trip to the Wayback Machine.

lintool commented 8 years ago

Here's my question: What would you want to do with the extracted images? Save to a file so you can look at them? But that's potentially dangerous, because what if you extracted 500k images and wrote 500k files to disk? Would you want a collage?

greebie commented 8 years ago

Main purpose has been to run them through some kind of object detection api (eg. detect major politicians in cpp, male vs female etc.). Watson has some ability to do some interesting things in that direction too. I suppose integration with the api would be better than capturing the photos? I agree it could be dangerous. Grabbing the image links might be safer, if less complete.

ianmilligan1 commented 7 years ago

Just popping in there.

@lintool, do we have any movement on grabbing images right out of the WARCs w/ warcbase. I remember we spoke about it, but not sure if it actually made its way into an issue or not.

greebie commented 7 years ago

Closing for now based on larger discussion about how to handle images. If a front end test case emerges that requires images, then maybe an appropriate solution can be developed.