mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
278 stars 87 forks source link

improve date guessing #198

Closed hroberts closed 6 years ago

hroberts commented 6 years ago

The code we run for date guessing was mostly written a couple of years ago and has not been updated much since then. When we wrote it, we validated it as about 87% accurate, but we have anecodata that it is less accurate now. We should take another look at it to see if we can improve, and to update our metrics about it.

We use date guessing when we discover stories through topic spidering. For stories crawled from RSS feeds, we use the date from the rss item.

The current system just uses a set of serial heuristics to try to guess the date. The date guessing code is here:

https://github.com/berkmancenter/mediacloud/blob/master/lib/MediaWords/TM/GuessDate.pm

The most reliable heuristic is a date in the url ('http://nytimes.com/2017/12/07/foo-bar.html'). There are a bunch of methods looking for specific html or meta tags. The fallback methods are just to use any text in the story that looks like a date (which is pretty common) or finally just to use the date of the story that linked to the story whose date we are guessing.

My first approach to improving the current code would be just to look at the html source of a collection of current stories from our database and come up with more / better heuristics to capture tags that I'm sure are in use now that weren't a couple of years ago (or weren't notice by us a couple years ago). As you get a feel for the data and problem, though, feel free to come up with more creative solutions, or even consider a different framework for the whole approach.

In addition to guessing the date, we need to guess whether a story is 'undateable'. An undateable story is one that inherently does not have a single publication date, such as a wikipedia page, a tag search page on a newspaper site, or the front page of any site. It is very important to properly mark undateable stories because stories improperly dated to a given timespan can badly skew results for that timespan in the topic mapper analytical tools (topic mapper allows the user to analyze a topic not just overall but also broken down into monthly or weekly or even smaller timespans).

To validate the date guessing, we should take a sample of stories from recent topics. Validation should consist of manually dating a couple hundred stories from each of the below topics and using those manually dated stories as the evaluation set to generate accuracy numbers to validate each change to the existing heuristics. We should generate accuracy numbers both overall and for each topic to get a sense of whether different topics behave differently based on country or topic.

Here are some good topics to include in the validation:

U.S. Presidential Election (1404) Trump 2017 (1643) Culture of Health (1674) Aadhar card (1630) vaccines 2016 (1542)

When I do this sort of work, I treat the data just like I am the machine model. I divide it into a training set and an evaluation set. I let myself look at the sources of errors in the training set but only the top end numbers from the evaluation set. If the results from the training and evaluation set start to diverge, I manually code more stories to add to the training set.

After the improvements are complete, we also need to update the constants at the top of this module to reflect the current accuracy of our date guessing:

https://github.com/berkmancenter/mediacloud/blob/master/lib/MediaWords/TM/Model.pm

Those constants are used to model the accuracy of our list of top media sources for a given timespan of a given topic.

You should also add the validation methods, evaluation set, and results to docs/validate when you are done or as you go.

hroberts commented 6 years ago

To generate a random set of spidered stories for a given topic, you can use this query:

select s.*
  from stories s
    join topic_stories ts using ( stories_id )
    join stories_tags_map stm using ( stories_id )
    join tags t using ( tags_id )
    join tag_sets tas using ( tag_sets_id )
  where 
    ts.topics_id = 123456 and
    t.tag = 'spidered' and
    tas.name = 'spidered'
  order by random()
  limit 200

scripts/mediawords_guess_date.pl and scripts/mediawords_fix_date.pl have simple-ish examples of how to query a story from the database, fetch the content for that story, and then call the date guessing code on the story.

pypt commented 6 years ago

Some unit tests for the date guessing are under:

https://github.com/berkmancenter/mediacloud/blob/master/lib/MediaWords/TM/t/GuessDate.t

To run then, do:

./script/run_in_env.sh prove lib/MediaWords/TM/t/GuessDate.t

or:

$ ./script/run_in_env.sh bash
mediacloud$ prove lib/MediaWords/TM/t/GuessDate.t 
ColCarroll commented 6 years ago

Just leaving a note for reference -- tried two approaches with varying success. The first one is pretty similar to what is currently running, the second is a little funnier. A notebook is here if you'd like to look at it/try it out. I'm going to grab a bigger training set and try to make a PR tomorrow if the performance is reasonable.

ColCarroll commented 6 years ago

Funny issue while scoring this: see, for example, this nytimes link, where:

There are lots of examples like this -- looks like the current code would use the url slug, but it seems like the meta property is "more correct". Wanted to make sure that wasn't intentional!

ColCarroll commented 6 years ago

This one is even funnier, since it redirects from a slug containing May 8, 2014, to a slug containing Nov. 11, 2014, and lists Nov. 10, 2014 as the publish date.

ColCarroll commented 6 years ago

Surprisingly (to me!), the slug is probably the "most correct" version. A link to the previous article is on archive.org, showing the correct publish date.
You can check out the redirects, and see that you get bounced to the newer version with a 301: Moved Permanently status code. (see also, https://www.nytimes.com/content/help/rights/linking/linking.html#linkingq08)

hroberts commented 6 years ago

In our validation of precious date guessing methods, the url slug was the most reliable in the sense that it virtually always represents the date of the story itself. Every date that we find within the html itself is at least sometimes representative of something other than the publication date of the actual story. So we privilege the slug over everything else because we at least know that it is close. Frankly, a day or two error is not horrible. We are much more worried about the three or six month errors.

We haven't done any serious validation of this stuff in years, though, so I'm completely open to new approached.

hroberts commented 6 years ago

Thanks for the report and good progress so far. Apologies again for your lack of good working data. I am running some (slow) queries now to generate good sample data for you to work with.

I think that writing date scrapers for individual sources is a non-starter. We just have too many different sources to maintain a list of working scraping patterns. Even worse, we have a lot of similar problems to solve (extracting content, discovering urls, etc) that would in theory work as custom scrapers but in practice would require all of our time just maintaining these custom scraping patterns. For this reason, we have a strong principle as a project not to do custom site scraping if we can at all avoid it, which is almost always.

Looking at the other approaches in your notebook, we have a pretty extensive date text parsing routine in _timestamp_from_html(). The obvious missing thing from that list is international month names, and I'm sure it could be improved somewhat, but I think we covered the low hanging fruit. The second to last stopgap heuristic is just to use that broad date parsing routine to find any date in the text. That method is surprisingly helpful, but ideally we don't want to rely on it for obvious reasons.

Also note that the data you will get from the query (running now!) to generate random spidered stories from our topics will be more difficult than your examples in the notebook, because the stories we discover via spidering (as opposed to crawling rss feeds) are less likely to be mainstream news outlets (or even in a syndicated news format at all).

ColCarroll commented 6 years ago

Just to update, this has been moved to a branch with more production-like code and some tests. Benchmarking against some labelled data now.

hroberts commented 6 years ago

just scanned through the code, and it looks good so far. eager to see how what the validation results look like.

also note that it's always a good idea to poke around to see if someone else has written code we can use. I found this, which seems like it might be useful:

https://github.com/Webhose/article-date-extractor

it's unlikely we'd be able to use this code instead of our own, but we could potentially mine it for ideas or even just plug it in to use as a stopgap if I our code can't find anything.

ColCarroll commented 6 years ago

Cool library! I included a few attributions where I grabbed some regexes or patterns from https://github.com/codelucas/newspaper, which seems like a pretty popular library for this sort of thing (article-date-extractor also uses some of those strategies). I would say our extractor is slightly ahead of theirs for most metrics, which is exciting!

hroberts commented 6 years ago

In my experience, external libraries do less well than advertised with our data because we collect a much wider range of types of data. A lot of the newspaper type libraries focus on big, typically newsy sites like the nyt, but our code has to handle a broader array of sites, many of which are goofier and less structured. But that doesn't mean that we can't use and learn from the other libraries. And sometimes they do work better than our own stuff (as with the python-readability library that we replaced our hand rolled extraction code with a couple of years ago).

On Tue, Oct 17, 2017 at 9:37 AM, Colin notifications@github.com wrote:

Cool library! I included a few attributions https://github.com/berkmancenter/mediacloud/blob/py_date_guesser/mediacloud/mediawords/date_guesser/urls.py#L11 where I grabbed some regexes or patterns from https://github.com/codelucas/newspaper, which seems like a pretty popular library for this sort of thing (article-date-extractor also uses some of those strategies). I would say our extractor is slightly ahead of theirs for most metrics, which is exciting!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/198#issuecomment-337251541, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvT3PVsIQAjvi6mENPj4EWifoq6Nhoks5stLuZgaJpZM4PsPn1 .

-- Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University

ColCarroll commented 6 years ago

On 100 randomly sampled articles you sent, here are the scores of the code on the branch compared with the date labels you provided. The keys in the dictionary are the number of days off it can be: a label of January 15 for an article from January 10 would be correct for "7" and "15", but wrong for "1". If the article should not have been labelled (landing pages, tag pages, ...), then it would only be correct if it was marked None. Happy to share the training data. It has a surprising number of cookie recipes!

{
  "1": {
    "new": 76,
    "current": 53
  },
  "7": {
    "new": 79,
    "current": 57
  },
  "15": {
    "new": 83,
    "current": 57
  }
}
hroberts commented 6 years ago

Hi Colin,

This is great progress! The old date guessing stuff has indeed aged poorly, and your improvements look to have at least gotten us back to about the point we were initially. The new code is also cleaner and more pleasant to edit, which should help us keep it more up to date.

Some comments after scanning through your code:

ColCarroll commented 6 years ago

Did you ditch the 'use the first date in the text' method altogether?

It depends on the false positive rate we're looking for -- the current method looks pretty good at not labelling articles that it should not label. More on this below.

It looks to me like you are scanning through all of the methods and choosing the one that provides the most specific date... Ignoring the accuracy preference, it looks to me like the code actually reverses the order of preference for the methods.

I think you are misreading the code (which is more the code's fault than anything else). There's a super stateful use of guess_date.DateGuesser._choose_better_guess, which implicitly "trusts" the current guess more than the new guess. In summary, the logic there says:

Specifically, the image url is the least trustworthy method we use. There could also be a lot of speedups by short-circuiting the search once enough accuracy is found.

with apologies for shoehorning in extra requirements, we have been storing the date guess method used for a given story as a tag

Should be easy enough, and sounds super helpful!

in addition to the date itself, we need to detect whether a given page is altogether undateable.

I'm including those in the validation sets as having date None, and using the metric (in pseudocode):

def score(page):
    true_date = get_true_date(page)
    guess_date = guess_date(page)
    if true_date is None and guess_date is None:
        return 1
    elif true_date is None or guess_date is None:
        return 0
    else:
         return int(abs((true_date - guess_date).days) < 2)

We can express our preferences for catching "undateable" pages by adjusting the scores in the first two branches in that if statement (or using a different metric altogether).

I would love to validate these dates against 100 stories from each of the topic dumps.

I can do that! Should be another ~3 hours to get everything labelled -- the gzipped html files were missing most of topic 1404, and had only 59 of those, but had 1,000 of the four other topics.

ColCarroll commented 6 years ago

Just an update -- finished labelling the extra 300 articles, and was surprised to see super different accuracy results. All the scores are out of 100. Let me know if you want to tweak the metric at all, and how valuable it would be to improve the date guessing for these particular corpora.

In the meantime, I am adding the date guess method, and will update the branch.

screen shot 2017-10-26 at 4 16 11 pm
hroberts commented 6 years ago

I am not surprised that the accuracy is quite different across topics. I am surprised that the india topic (1630) was as high as it is, and I am surprised by the low accuracy of the vaccine topic (1542). My guess is that india topic has more msm sources because the link economy there is much less rich, and that the vaccine topic is lower because it has a very high proportion of primary science articles.

The algorithm you describe for accuracy vs. trustworthiness seems good to me.

I think I am still not describing the concept of undateability well. The assignment of the 'undateable' state to a story has nothing directly to do with whether or not we can detect a date within the story. It has to do with whether we think the content has a single publication date after which the content is changed only minimally. A nyt story is dateable because after its publication date, there should be minimal edits to the story. The same is true of basically all of the stories that we collect through RSS and for a majority of traditional new stories we capture by spidering.

However, when we spider, we find a whole lot of web pages that are not anything like a traditional new story. Some of these are static pages, so if we can find a publication date it is fine to assign that as a publication date. But other of these pages inherently change over time, so there's no single date that can be assigned to the content on the page. The most obvious example is a wikipedia page. There's no one version of a given wikipedia page because the content is designed to change over time. These undateable pages cause problems for the way that we analyze our data by date ranges, because we define the set of stories within a timespan to be any story published within that date range or any story linked to by a story published within that date range. Before we started marking some pages as undateable, we would commonly have stories from the future inserted into a given timespan because links had been added to the undateable web page after its publication date. This was not a theoretical problem -- it nearly caused us to publish wrong findings early in our topic spidering experiments.

So the solution is to have our date guessing module first guess whether a story falls into this 'undateable' category and if so tag it as undateable. Our story querying code in our api and various interface then shows that story as undateable, even if it has a publication_date assigned to it, and we only include undateable stories in timespans in which they were linked to by dateable stories with a publication date within the timespan date range (so for a given weekly timespan, we include a given wikipedia page if it is linked to be a dateable story within the timespan week).

The function that currently tries to guess undateability is here:

https://github.com/berkmancenter/mediacloud/blob/master/lib/MediaWords/TM/GuessDate.pm#L846

Basically, it looks for a few different specific websites (like wikipedia.org), or specific terms in the url path (/search, /tag), or the lack of any digits at all in the url (which I argued would not work but improved the accuracy of this guess when we tested it). As with the date guessing, I have anecdotal evidence that this code has aged poorly. I have made some small specific fixes to it over time, but I think we need to take a fresh look at it as you have with the date guessing stuff.

The converse of all this is true as well. If a story is formally 'dateable', we have just done our best to assign a date to it even if we have to fall back to potentially wildly wrong methods like looking for any date in the content or just using the date of the story that linked to the story. That's why we have those dating methods as fallbacks in the current system. I would at least like to see how the 'use any date in the content' method impacts the accuracy of your improved dating module so that we can make an informed decision about it.

It is often the case that we want to mark a story as 'undateable' even if we are able to detect a date for it. It is always the case that we want to make any reasonable case we can for a given story if that story us considered to be 'dateable'.

-hal

On Thu, Oct 26, 2017 at 3:19 PM, Colin notifications@github.com wrote:

Just an update -- finished labelling the extra 300 articles, and was surprised to see super different accuracy results. All the scores are out of 100. Let me know if you want to tweak the metric at all, and how valuable it would be to improve the date guessing for these particular corpora.

In the meantime, I am adding the date guess method, and will update the branch.

[image: screen shot 2017-10-26 at 4 16 11 pm] https://user-images.githubusercontent.com/2295568/32075031-715e67ac-ba69-11e7-89f2-a934e7d9fd34.png

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/198#issuecomment-339788843, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvT3vrrbazpT69VG3THnSqdSXVg_Upks5swOlngaJpZM4PsPn1 .

-- Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University

hroberts commented 6 years ago

I have added a 'hal' column to the 1542 dating spreadsheet you sent me with notes about stories that I think you have coded wrong. The key thing is that dateability is an ideal category that is determined by the nature of the content and not our ability to find or not a find a date.

ColCarroll commented 6 years ago

Ah, this is helpful, as is the "undateable" filter -- I'd lean towards thinking of it not as classifying undateable/not, but filtering sites that it should obviously not try to date.

ColCarroll commented 6 years ago

Just pushed an update that includes an "undateability" pass. I did not include a bunch of the tags from the current check, since a quick grep on the 4000 files I have locally shows lots of false negatives for, for example, "/archive/".

Also added logic to process twitter statuses, and that increased scores ~2-4 points in each topic.

ColCarroll commented 6 years ago

@hroberts any further comments on this? I think it is ready to merge and start wiring in.

hroberts commented 6 years ago

apologies for the wait. I have merged this pull request. good work!

next step is that we have to integrate this into the perl code base to operate with the working code.

pypt commented 6 years ago

next step is that we have to integrate this into the perl code base to operate with the working code.

You can create a new GitHub task and assign it to me. Colin is sure very much capable of figuring out the Perl-Python integration, but I went through this nasty trial and error process already.

hroberts commented 6 years ago

marking this done. #218 addresses integration.