Closed greebie closed 7 years ago
Thoughts @lintool?
We're currently working to get images right out of the WARC files - save the trip to the Wayback Machine.
Here's my question: What would you want to do with the extracted images? Save to a file so you can look at them? But that's potentially dangerous, because what if you extracted 500k images and wrote 500k files to disk? Would you want a collage?
Main purpose has been to run them through some kind of object detection api (eg. detect major politicians in cpp, male vs female etc.). Watson has some ability to do some interesting things in that direction too. I suppose integration with the api would be better than capturing the photos? I agree it could be dangerous. Grabbing the image links might be safer, if less complete.
Just popping in there.
@lintool, do we have any movement on grabbing images right out of the WARCs w/ warcbase. I remember we spoke about it, but not sure if it actually made its way into an issue or not.
Closing for now based on larger discussion about how to handle images. If a front end test case emerges that requires images, then maybe an appropriate solution can be developed.
The image URL process is excellent. I would like to suggest adding another object to Warcbase that gives you pre-processed urls to get the images from the wayback machine. (Suggested code at bottom). Basically, it works the same as the regular ExtractImageLinks, except there is one additional parameter for a date string. Output is a Seq of urls that can be flatMapped and used to scrape images from wayback for research purposes.
Possible additional development could be to include an image cloud for the collection (eg. counts determine image size).
Ryan. . ..