An interesting question about the machine learning model is this:
What do we do about already known facts of shows like Characters and Actors?
If we have a page P_A, about a show we don't want spoiled S_A, could you determine spoilers based on characters and actors alone?
Getting all the actors and characters in an epsiode would not be so hard, due to that information being accessible online. (Although importing it would be a challenge for all shows).
I would say that in 90% of cases, you could detect spoilers.
I cannot imagine a scenario in which no character names and no actor names contains a spoiler. If you have an example, please share it with me!
Many online articles may talk about an actor but not have a spoiler for the show they star in. While attempting to avoid spoilers about Sansa Stark - played by Sophie Turner, you may be prevented from reading an article about X-men Apocalypse, due to Sophie Turner being in that movie.
This is what is known as a false positive. An article that we classify as a spoiler, but it is not actually a spoiler.
People might get annoyed if they get false positives often enough. They may disable the spoiler or decide to uninstall completely.
So we cannot solely rely on censoring an article if it contains character and actor names.
Lets think about this in a different way: A spoiler is just an article about an episode that we have not seen before. So really, if we just find out a way of determining what episode an article is talking about, we could just compare it to the list of spoilers you have, and censor it based on that.
What we could do is create models for each episode and see how they are similar to a model of an article.
So for the first episode of the series, the model would contain a series of tokens(characters,actors,locations,actions): {Ned Stark, Sean Bean, Winterfell, Pentos, Falling, Incest, etc...}
You would compare that to the article's model
http://www.ew.com/recap/game-of-thrones-season-1-episode-1
tokens(characters,actors,locations,actions): {Ned Stark, Winterfell, direwolves, etc...}
Then identify that this article has spoilers for season 1 episode 1, and censor it.
What we could do is try to create a multi-actor/character model. One which assigns a probability that multiple characters will be a spoiler.
The one problem is in creating the model for all shows, all seasons and episodes. Forget about all the episodes, even a single episode would take so many datapoints to model correctly.
This is because we would need to create a text-based classifier, that would need quite a bit of training data due to the high-dimensionality of text. Lots of training data which could only be achieved by people who get spoiled or people who volunteer their time to classify spoiler examples for you. Not happening.
We need to identify a way to generate a model with lots of predictive power. Which doesn't need that much training data from our users to identify characters, actors, locations and actions.
An interesting question about the machine learning model is this:
What do we do about already known facts of shows like Characters and Actors?
If we have a page P_A, about a show we don't want spoiled S_A, could you determine spoilers based on characters and actors alone?
Getting all the actors and characters in an epsiode would not be so hard, due to that information being accessible online. (Although importing it would be a challenge for all shows).
I would say that in 90% of cases, you could detect spoilers.
I cannot imagine a scenario in which no character names and no actor names contains a spoiler. If you have an example, please share it with me!
Many online articles may talk about an actor but not have a spoiler for the show they star in. While attempting to avoid spoilers about Sansa Stark - played by Sophie Turner, you may be prevented from reading an article about X-men Apocalypse, due to Sophie Turner being in that movie.
This is what is known as a false positive. An article that we classify as a spoiler, but it is not actually a spoiler.
People might get annoyed if they get false positives often enough. They may disable the spoiler or decide to uninstall completely.
So we cannot solely rely on censoring an article if it contains character and actor names.
Lets think about this in a different way: A spoiler is just an article about an episode that we have not seen before. So really, if we just find out a way of determining what episode an article is talking about, we could just compare it to the list of spoilers you have, and censor it based on that.
What we could do is create models for each episode and see how they are similar to a model of an article.
So for the first episode of the series, the model would contain a series of tokens(characters,actors,locations,actions): {Ned Stark, Sean Bean, Winterfell, Pentos, Falling, Incest, etc...} You would compare that to the article's model http://www.ew.com/recap/game-of-thrones-season-1-episode-1 tokens(characters,actors,locations,actions): {Ned Stark, Winterfell, direwolves, etc...}
Then identify that this article has spoilers for season 1 episode 1, and censor it.
What we could do is try to create a multi-actor/character model. One which assigns a probability that multiple characters will be a spoiler.
The one problem is in creating the model for all shows, all seasons and episodes. Forget about all the episodes, even a single episode would take so many datapoints to model correctly.
This is because we would need to create a text-based classifier, that would need quite a bit of training data due to the high-dimensionality of text. Lots of training data which could only be achieved by people who get spoiled or people who volunteer their time to classify spoiler examples for you. Not happening.
We need to identify a way to generate a model with lots of predictive power. Which doesn't need that much training data from our users to identify characters, actors, locations and actions.