Closed rahulbot closed 6 years ago
The UI for this is ready on a branch in the front-end. We decided to generate a model for user validation inline on the front-end with a small sample of stories (so it is fast enough that they can validate it right after uploading the training data). The key requirements on the back end are:
focal_set
names: the name the user providedfoci
in that focal_set
, named "matching SUBTOPIC_NAME" and "not matching SUBTOPIC_NAME"foci
created based on whether the result of the model (thresholded at 50%)Does that make sense? Should we build the model on the front-end or back-end? (@beckybell13 can you link to the code that build the models so we can see an example?)
Some notes of mine:
Fetching full text of stories and testing it against the model will be relatively slow for us too. Boolean query subtopic (focal) technique is fast because we query for the matching stories against an indexed database, so creating subtopics (foci) and whatnot is more or less instant, but to compile the model-based subtopics, we'd have to fetch every story individually and test it against the model. My point here is that it's going to be a slow process independently from whether we do it on backend or frontend.
I'd consider scrapping the "you email us X, we email you Y back" part because it's a manual process prone to delays, "you can't do it yourself, write us a nice letter" is a true user engagement killer, and we're half-targeting media makers who probably prefer to be kept away from CSVs, XMLs and JSONs of the world. Maybe users could create the models (and the subsequent subtopics-foci) themselves and simply delete the ones that don't work out for them, thus arriving at their perfect model by basic trial and error?
As for the architecture, I think it would be tremendously useful to decide whether the "similar stories" implementation will be exclusively a backend's or frontend's problem. The proposed "mixed" design in which the frontend generates a model and stores it on the backend in some cases and expects the backend to generate the model using the same code (shared how? Git submodule? Python package?) in some other cases just seems too brittle to me, breaks the separation of concerns and might lead to weird project synchronization issues. I'd suggest that we decide on whether:
Generating and storing models will be done exclusively on the backend. The frontend won't generate, store or upload the models, instead relying on the backend to do everything via API calls (even for sample models for the immediate preview to the user).
Generating and storing models will be done exclusively on the frontend. The frontend generates, stores the models and makes other decisions on them while backend is oblivious to any kind of models -- it just creates subtopics-foci based on a discrete list of stories_id
as determined by the frontend (which, in turn, uses its generated model to come up with a list of stories_id
to go into a subtopic-foci).
Sample user stories for both approaches:
User wants to create a new similarity model.
stories_id
for sample model generation.True
/ False
supervision of the sample of the stories.stories_id
and supervision results back to the backend; backend quickly generates a sample model using the parameter stories and tests it against a yet another sample of stories, returning the picked stories back to the frontend for user preview (model generation with 50 or so stories should be pretty quick).stories_id
and supervision results to the backend; backend generates and stores a final model.True
/ False
supervision of the sample of the stories.Snapshot gets generated and the new stories in it need to get run against the model.
Thanks for thinking those options through. I think we have to use the back-end approach.
With that backend solution:
focal_technique
, rather than overloading this onto boolean queryIf we are all agreed, I think the next step is to spec the API changes needed.
Ping! This is overdue. Please turn this into an API spec this week that we can review together.
Some initial questions:
topics/<topics_id>/snapshots/<snapshots_id>/stories/list
?Also:
A topic has focal_set_definitions
, each of which has many focus_definitions
in it. When a new snapshot is generated, these get turned into focal_sets
and foci
, which are associated with that specific snapshot.
To list all the stories in a snapshot, you pass in a snapshots_id
param to the call, like topics/<topics_id>/stories/list?snapshots_id=<snapshots_id>
.
With the "backend" solution we agree on, you are correct that we don't need to allow the model to be downloaded for any system reason. I think we need to allow it to be downloaded to have a good reproducibility/replicability story. I want to be able to tell a user that they can download and use the model and use it independently.
A topic has focal_set_definitions, each of which has many focus_definitions in it. When a new snapshot is generated, these get turned into focal_sets and foci, which are associated with that specific snapshot.
Yes, I get that from the database table structure, but what does it all mean? :)
With the "backend" solution we agree on, you are correct that we don't need to allow the model to be downloaded for any system reason. I think we need to allow it to be downloaded to have a good reproducibility/replicability story. I want to be able to tell a user that they can download and use the model and use it independently.
I can add the call which would return a .zip
of the model (it consists of two files apparently), but that might not be too useful for the user as the model appears to be readable only by a specific version of sklearn, also we'll need to maintain a Python code sample for the user to load the model (which might go out of sync soon), etc.
The focus and focal_set refer to the thing actually created by the snapshot. The focus and the focal_set will never change because they are part of that static snapshot. This is necessary because we don't want any of the results in a snapshot to change over time.
But we often make a series of snapshots over time, so we need a stable definition of which focal_set and foci to create for any new snapshot. So the focus_definition and focal_set_definition objects exist to tell the system which foci and focal_sets to create for any new snapshot for a given topic. They exist as separate objects from the snapshot because even though in theory they might change between every snapshot, in practice they are usually pretty static, so it would be a pain to require the user to recreate them every time.
From a user point of view, the focus_definition and focal_set_definition is what you are editing when you edit the subtopics for a given topic. The foci and focal_sets are created using those definitions at the time of the actual snapshot.
You can specify a snapshot with the snapshots_id= parameter:
https://api.mediacloud.org/api/v2/topics/2306/stories/list?snapshots_id=2296
Note that you are really always querying a specific timespans_id. If you specify the snapshots_id, the api just selects the overall, no subtopic timespan for you. If you specify neither the timespan nor the snapshot, the system chooses the overall, no subtopic timespan from the latest snapshot.
I think the process should kind of work like this (based on your outline earlier):
judgement
column they filled in
stories_id
and judgement
list to back-endstories_id
/judgement
pairs to server, with name for new subtopic setfocal_set_definition
of type "similar-stories" with name from user (ie. "pro-Brexit"), creates two focus_definitions
that are children of that focal_set_definition
- one for "is When the user later decided to create a new snapshot:
focal_set_definition
, creates focal_set
and 2 foci
(associated with model and training data)It is up to you whether to use the tagging architecture and boolean foci for this or not; I don't know how stories are generally associated with foci.
Back-end creates focal_definition of type "similar-stories" with name from user (ie. "pro-Brexit")
@rahulbot, what's a focal definition? Is it a focus definition or focal set definition?
Good catch. That should say "back end creates focal_set_definition of type "similar stories"". I'll update it now.
Sorry again for the long delay, this (foc(us|al|i)|subtopic)( set)?( definition)?
stuff is very hard to approach by the uninitiated (I'd argue that we need to simplify a little by at least renaming a bunch of stuff).
Here's my API draft, tell me what you think:
User wants to preview a new similarity model
topics/<topics_id>/stories/list?sort=random
call. One set is to be used for training, another one for evaluation.True
/ False
judgements of the training sample of the stories on the frontend.topics/<topics_id>/similarity_models/preview
call.Notes:
User wants to store a new similarity model
topics/<topics_id>/similarity_models/create
callsimilarity_models_id
field in it.topics/<topics_id>/focal_set_definitions/create
call.
Similar Stories
"focal technique" is to be used to create focal set definition for similar stories matching.topics/<topics_id>/focus_definitions/create
call.
query
argument which supported only Boolean Query
"focal technique" has been replaced by focal technique-dependent args
argument.negate
to true
.I see your approach is to split the model generation into its own process/endpoints, and then to connect the model to the focal_set_def once it is ready. That seems reasonable. A few notes/questions:
similarity_models/preview
and similarity_models/create
be topic-admin
?similarity_models/preview
should return precision and recall of the training set following Becky's example here - the backend has the "right" judgement for all the training set so it can do this I think. It should also return the top words associated with pos/neg result code here demonstrates how.similarity_models/create
actually generate the model (ie. take a few seconds), or just queue the model up to be generated once the snapshot is run?similarity/list
call include a url to download the model from or something?similarity_models/list
at least)args
property on focus_definitions/create
& focus_definitions/update
; it shouldn't be hard for us to refactor the existing calls to make it work that wayThanks for the feedback!
• shouldn't the required roles for both similarity_models/preview and similarity_models/create be topic-admin?
I've made "/preview", "/list", "/download_model" and "/download_vectorizer" require "tm-readonly" role, and "/create" require "tm" role.
• similarity_models/preview should return precision and recall of the training set following Becky's example here - the backend has the "right" judgement for all the training set so it can do this I think. It should also return the top words associated with pos/neg result code here demonstrates how.
Thanks, precision and recall weren't present in the original sample. Added "precision" and "recall" floats in "/preview", "/create" and "/list" responses.
• Does calling similarity_models/create actually generate the model (ie. take a few seconds), or just queue the model up to be generated once the snapshot is run?
"/create", as well as "/preview", are supposed to generate the model right away.
• How do I download a model? Does the similarity/list call include a url to download the model from or something?
Given that the model consists of two files - model data and vectorizer data, I've added two API calls to download both:
• I think the similarity models need to include the user_id of the person that trained them (in the results of similarity_models/list at least)
I've added "auth_users_id" field to "/create" and "/list".
That all sounds great. Two further reactions:
auth_users_id
. I suppose that'll change if we do what I suggested for the user management API (https://github.com/berkmancenter/mediacloud/issues/404#issuecomment-403606308), but failing that it'd be super helpful to include the user's actual name in the /create and /list response so I have something useful to show in the UI.Thanks on the clarification regarding the permissions. I've made the spec to require either write
(for /create
) or read
(for the rest of the API calls) permissions to the specific topic.
Also, I've replaced auth_users_id
with owner
which contains full_name
parameter, among others.
API agreed in. Moved implementation to separate issue.
We are almost ready to create a new subtopic creation technique (a new "focal technique") - codename similar stories. This lets you upload a small training set of manually coded articles to train a model to find similar stories in your topic. The end result of this user flow will be a model lets us tag each story as similar to the training set of not. This is implemented as a naive bayes classifier using TF-IDF as the feature (sklearn); thresholded at 0.5 to decide if a story is "similar" or not (thx @ColCarroll).
Current Out-Of-Band Support
With @beckybell13 we've done multiple rounds of iteration using Rebekah's "Right to be Forgotten" topic. Our current out-of-band solution is a manual process that accepts a CSV of coded articles, generates a model, tags stories, and then creates boolean subtopics referring to those tags. For example, if you are looking to identify articles using a "toxic masculinity" framing in a topic about "sexual harassment", this process looks like:
tags_id_stories
to filter for stories with those each of those two tags This lets you browse the standard Topic Manager interface to see the results of the model and analyze them like everything else.Integration Requirements
I want to start a conversation about implementation design to drive scheduling this on the roadmap. In the shorter-term, I thought perhaps it was worth considering this project while thinking about how to store word2vec models in #225, because this needs to store models too.