mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
279 stars 87 forks source link

define key researcher use-cases for story image extraction and storage #708

Open rahulbot opened 4 years ago

rahulbot commented 4 years ago

To make some technical decisions I think we need to more concretely design the primary use cases we have in mind so far. Here's my stab at a list, and the underlying tech feature it might rely on:

  1. review a summary of visual language across a small corpus (ie. top stories) - maybe use image tree map
  2. review a summary of visual language across a large corpus (ie. a timespan) - some high-level view of clusters, like Leon's mosaic does
  3. trace the appearance of an image over time in a topic - search by image similarity
  4. search for stories using images similar to one the researcher identifies - search by image similarity

This is the thinking shining that led me to #658.

hroberts commented 4 years ago

these all look great to me. is there some way we can produce each of these on a one off basis to evaluate before building them into the platform? we have arguably already done #1.

alternatively, we could make a bet that this is the set of products we want and build the minimal platform to deliver them.

a key difference I see is that the first two only require us to collect and process a small subset of the images, whereas the last two require us to process all images in a topic and also build an indexing system to be able to find them. maybe start with the first two and build from there?

-hal

On Wed, May 20, 2020 at 1:01 PM rahulbot notifications@github.com wrote:

To make some technical decisions I think we need to more concretely design the primary use cases we have in mind so far. Here's my stab at a list, and the underlying tech feature it might rely on:

  • review a summary of visual language across a small corpus (ie. top stories) - maybe use image tree map
  • review a summary of visual language across a large corpus (ie. a timespan) - some high-level view of clusters, like Leon's mosaic does
  • trace the appearance of an image over time in a topic - search by image similarity
  • search for stories using images similar to one the researcher identifies - search by image similarity

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_708&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=ZG5zQSvWqtnTH0wRvyFYC_d6zrfVvEgDE9i6fIBNFxc&s=LMWy1F37DNQeugeIm30z3dpJrkAyP4vPYMs_TQjPaTQ&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T7NEETPBVOQCV72CD3RSQLH3ANCNFSM4NGF4OKQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=ZG5zQSvWqtnTH0wRvyFYC_d6zrfVvEgDE9i6fIBNFxc&s=G3Iw3qj3NrHpJO4DR1C1pTrb1pDu9USBLHrroFCzA0U&e= .

cindyloo commented 4 years ago

note: we also have the potential to analyze by facial detection and identification..

I think we've proved the desire and feasibility for use case #1. Minimally surfacing/storing the image and url at least regarding 1 and 2 would make for a flexible initial implementation

the ability to search by image similarity would be an incredible capability as there is little out there to do such things, but no trivial implementation

rahulbot commented 4 years ago

Glad this list feels like a good start. I think #2 has been fairly validated as useful too (see @cindyloo repo MediaCloud-Image-Tests).

I think you're right that this argues for extracting and surfacing the URL of the top image as a way to get started with 1 & 2. It would also let us try out some out-of-band approaches to 3 and 4 more quickly (with the top image at least). We kind of discussed this in #593, but also more recently.

To be concrete: I'm proposing we take a first step towards image support by adding a pipeline stage to every story in a topic that extracts and stores the top image URL (via Newspaper3k because we have validated that). This should be returned in topic-story-list results so it can be used easily. I can split this off to a new issue to discuss details if folks generally agree.

The key point this is pushing me towards is that separating URLs from images can help us implement a first stage faster and give us a non-critical-path playground to more easily try out solutions for some of these features.