mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
280 stars 87 forks source link

architecture/api for "similar stories" subtopic definition #246

Closed rahulbot closed 6 years ago

rahulbot commented 6 years ago

We are almost ready to create a new subtopic creation technique (a new "focal technique") - codename similar stories. This lets you upload a small training set of manually coded articles to train a model to find similar stories in your topic. The end result of this user flow will be a model lets us tag each story as similar to the training set of not. This is implemented as a naive bayes classifier using TF-IDF as the feature (sklearn); thresholded at 0.5 to decide if a story is "similar" or not (thx @ColCarroll).

Current Out-Of-Band Support

With @beckybell13 we've done multiple rounds of iteration using Rebekah's "Right to be Forgotten" topic. Our current out-of-band solution is a manual process that accepts a CSV of coded articles, generates a model, tags stories, and then creates boolean subtopics referring to those tags. For example, if you are looking to identify articles using a "toxic masculinity" framing in a topic about "sexual harassment", this process looks like:

  1. you create a CSV of ~50 story URLS with a boolean column indicating T/F for stories about "toxic masculinity" (these stories can be outside of the topic); half True and half False
  2. you email us that file
  3. we run a script that generates a model and a CSV to review with all the stories in the Topic labelled T or F
  4. we email you back that CSV to review for spot-checking (including recall/precision numbers)
  5. you approve/reject the CSV after checking a few random stories
  6. we run a script that creates a "Classifications" tag-set with two tags ("toxic masculinity" and "NOT toxic masculinity")
  7. we create two boolean query sub-topics using the UI, using tags_id_stories to filter for stories with those each of those two tags This lets you browse the standard Topic Manager interface to see the results of the model and analyze them like everything else.

Integration Requirements

I want to start a conversation about implementation design to drive scheduling this on the roadmap. In the shorter-term, I thought perhaps it was worth considering this project while thinking about how to store word2vec models in #225, because this needs to store models too.

rahulbot commented 6 years ago

The UI for this is ready on a branch in the front-end. We decided to generate a model for user validation inline on the front-end with a small sample of stories (so it is fast enough that they can validate it right after uploading the training data). The key requirements on the back end are:

Does that make sense? Should we build the model on the front-end or back-end? (@beckybell13 can you link to the code that build the models so we can see an example?)

beckybell13 commented 6 years ago

Here's the script I have that builds the model from a csv of training data, pickles the model, and optionally tags the topic stories and outputs a csv of the results.

You can see what we currently have on the front-end branch here.

pypt commented 6 years ago

Some notes of mine:

As for the architecture, I think it would be tremendously useful to decide whether the "similar stories" implementation will be exclusively a backend's or frontend's problem. The proposed "mixed" design in which the frontend generates a model and stores it on the backend in some cases and expects the backend to generate the model using the same code (shared how? Git submodule? Python package?) in some other cases just seems too brittle to me, breaks the separation of concerns and might lead to weird project synchronization issues. I'd suggest that we decide on whether:

Sample user stories for both approaches:

User wants to create a new similarity model.

Snapshot gets generated and the new stories in it need to get run against the model.

rahulbot commented 6 years ago

Thanks for thinking those options through. I think we have to use the back-end approach.

With that backend solution:

If we are all agreed, I think the next step is to spec the API changes needed.

rahulbot commented 6 years ago

Ping! This is overdue. Please turn this into an API spec this week that we can review together.

pypt commented 6 years ago

Some initial questions:

pypt commented 6 years ago

Also:

rahulbot commented 6 years ago

A topic has focal_set_definitions, each of which has many focus_definitions in it. When a new snapshot is generated, these get turned into focal_sets and foci, which are associated with that specific snapshot.

To list all the stories in a snapshot, you pass in a snapshots_id param to the call, like topics/<topics_id>/stories/list?snapshots_id=<snapshots_id>.

rahulbot commented 6 years ago

With the "backend" solution we agree on, you are correct that we don't need to allow the model to be downloaded for any system reason. I think we need to allow it to be downloaded to have a good reproducibility/replicability story. I want to be able to tell a user that they can download and use the model and use it independently.

pypt commented 6 years ago

A topic has focal_set_definitions, each of which has many focus_definitions in it. When a new snapshot is generated, these get turned into focal_sets and foci, which are associated with that specific snapshot.

Yes, I get that from the database table structure, but what does it all mean? :)

With the "backend" solution we agree on, you are correct that we don't need to allow the model to be downloaded for any system reason. I think we need to allow it to be downloaded to have a good reproducibility/replicability story. I want to be able to tell a user that they can download and use the model and use it independently.

I can add the call which would return a .zip of the model (it consists of two files apparently), but that might not be too useful for the user as the model appears to be readable only by a specific version of sklearn, also we'll need to maintain a Python code sample for the user to load the model (which might go out of sync soon), etc.

hroberts commented 6 years ago

The focus and focal_set refer to the thing actually created by the snapshot. The focus and the focal_set will never change because they are part of that static snapshot. This is necessary because we don't want any of the results in a snapshot to change over time.

But we often make a series of snapshots over time, so we need a stable definition of which focal_set and foci to create for any new snapshot. So the focus_definition and focal_set_definition objects exist to tell the system which foci and focal_sets to create for any new snapshot for a given topic. They exist as separate objects from the snapshot because even though in theory they might change between every snapshot, in practice they are usually pretty static, so it would be a pain to require the user to recreate them every time.​

From a user point of view, the focus_definition and focal_set_definition is what you are editing when you edit the subtopics for a given topic. The foci and focal_sets are created using those definitions at the time of the actual snapshot.

You can specify a snapshot with the snapshots_id= parameter:

https://api.mediacloud.org/api/v2/topics/2306/stories/list?snapshots_id=2296

Note that you are really always querying a specific timespans_id. If you specify the snapshots_id, the api just selects the overall, no subtopic timespan for you. If you specify neither the timespan nor the snapshot, the system chooses the overall, no subtopic timespan from the latest snapshot.

rahulbot commented 6 years ago

I think the process should kind of work like this (based on your outline earlier):

When the user later decided to create a new snapshot:

It is up to you whether to use the tagging architecture and boolean foci for this or not; I don't know how stories are generally associated with foci.

pypt commented 6 years ago

Back-end creates focal_definition of type "similar-stories" with name from user (ie. "pro-Brexit")

@rahulbot, what's a focal definition? Is it a focus definition or focal set definition?

rahulbot commented 6 years ago

Good catch. That should say "back end creates focal_set_definition of type "similar stories"". I'll update it now.

pypt commented 6 years ago

Sorry again for the long delay, this (foc(us|al|i)|subtopic)( set)?( definition)? stuff is very hard to approach by the uninitiated (I'd argue that we need to simplify a little by at least renaming a bunch of stuff).

Here's my API draft, tell me what you think:

User wants to preview a new similarity model

  1. Frontend two sets of random stories for sample model generation using a pre-existing topics/<topics_id>/stories/list?sort=random call. One set is to be used for training, another one for evaluation.
  2. User comes up with True / False judgements of the training sample of the stories on the frontend.
  3. Frontend posts training and evaluation sets of stories to topics/<topics_id>/similarity_models/preview call.
  4. Backend quickly trains a sample model using the training story set, evaluates the evaluation story set using the sample model, and returns back the judgements back to the frontend.
  5. Frontend fetches another set of stories (see step 1) and comes up with a new improved training set with the help of the user.
  6. Rinse and repeat until the user is happy with the training set of stories.

Notes:

User wants to store a new similarity model

  1. Frontend posts a training set of stories to topics/<topics_id>/similarity_models/create call
  2. Model gets stored on the backend, API returns created model's description with similarity_models_id field in it.
  3. Frontend creates a new focal set definition using topics/<topics_id>/focal_set_definitions/create call.
    • Similar Stories "focal technique" is to be used to create focal set definition for similar stories matching.
  4. Frontend creates one or more focus definitions using topics/<topics_id>/focus_definitions/create call.
    • A single query argument which supported only Boolean Query "focal technique" has been replaced by focal technique-dependent args argument.
    • Focus definition can be made to "negate" the similar stories model by setting negate to true.
rahulbot commented 6 years ago

I see your approach is to split the model generation into its own process/endpoints, and then to connect the model to the focal_set_def once it is ready. That seems reasonable. A few notes/questions:

pypt commented 6 years ago

Thanks for the feedback!

• shouldn't the required roles for both similarity_models/preview and similarity_models/create be topic-admin?

I've made "/preview", "/list", "/download_model" and "/download_vectorizer" require "tm-readonly" role, and "/create" require "tm" role.

• similarity_models/preview should return precision and recall of the training set following Becky's example here - the backend has the "right" judgement for all the training set so it can do this I think. It should also return the top words associated with pos/neg result code here demonstrates how.

Thanks, precision and recall weren't present in the original sample. Added "precision" and "recall" floats in "/preview", "/create" and "/list" responses.

• Does calling similarity_models/create actually generate the model (ie. take a few seconds), or just queue the model up to be generated once the snapshot is run?

"/create", as well as "/preview", are supposed to generate the model right away.

• How do I download a model? Does the similarity/list call include a url to download the model from or something?

Given that the model consists of two files - model data and vectorizer data, I've added two API calls to download both:

• I think the similarity models need to include the user_id of the person that trained them (in the results of similarity_models/list at least)

I've added "auth_users_id" field to "/create" and "/list".

rahulbot commented 6 years ago

That all sounds great. Two further reactions:

pypt commented 6 years ago

Thanks on the clarification regarding the permissions. I've made the spec to require either write (for /create) or read (for the rest of the API calls) permissions to the specific topic.

Also, I've replaced auth_users_id with owner which contains full_name parameter, among others.

rahulbot commented 6 years ago

API agreed in. Moved implementation to separate issue.