Live / auto updating datasets

interrogator commented 4 years ago

There's no reason why datasets (I'm tired of calling them corpora) need to be historical and unchanging. We could easily create automatically updating datasets containing news / social media discussions, via RSS, plus topic modelling to add some useful metadata.

So I'd like to suggest a service, digests, that grabs a couple of hundred news articles per day, runs them through a topic modeller, adds topic info as extra metadata, parses and stores as CONLL/feather/parquet. Whenever the corpus is reimported, the new data shows up for the user. (We will eventually need to figure out how to reimport the corpus without restarting the app, too).

It has a different set of dependencies to buzz(word), so code can go in here (empty ish, I just wanted to register pypi):

git clone https://github.com/interrogator/digests
pip install digests

So I think it should be a dockerised Python CLI type thing, outputting data in the plaintext plus XML that buzz likes to .parse(). (I was thinking briefly this could be kind of 'general-use', but I don't think there's much point, it's basically gonna just be a wrapper around 'news-please' + gensim, with output in a unique format...)

I'd like there to be three datasets produced in this way: one, an ever-expanding news corpus, basically containing everything we scrape. Second is 'News of the day', which is just content from the latest 24 hours. Third would be a single topic (i.e. news about Australia). Notice that the second and third are just subsets of the first, and could even be generated while exploring the first with a single just filter. But users won't notice that, especially when learning the tool. So in reality it is just one dataset, which we can open up with a cronjob, run the just ourselves, and save to feather/parquet, easy-peasy:

>>> Corpus("news-parsed").just.topic.australia.save('australia.feather')
# ^ wow buzz can be handy, perhaps there should be a CLI for that kind of operation

For this dataset, I'd like a two-layer topic model applied: a first-pass 12 topic model to distinguish politics, sport, world news, technology etc, and then a submodel of 12 topics within each topic (in a perfect world, distinguish between sports, or between country for world news).

From here we could also develop a nice page, buzzword.com/digests, which would be an explorer-like interface to these news corpora, pre-populated with interesting views. User should be able to click on (e.g.) concordance lines to view the content on the original news site, or see it in a nice iframe.

This would show off a totally different use-case for the tool and its underlying methods. For us, it would also encourage the development of topic modeller integration, as well as ways of exploring data metadata-first, rather than data first. What I mean by this is, develop ways of viewing the most prominent topics of a given day, rather than text by topic, which has so far been the main focus. News corpora should also link in more clearly to tools like allennlp's reading comprehension question answerer (i.e. Who did Trump call today?)

interrogator commented 4 years ago

Scraper starting point: https://github.com/fhamborg/news-please

News analysis: https://github.com/fhamborg/Giveme5W1H -- could be good for adding metadata, also worth investigating for buzzword generally

interrogator commented 4 years ago

news-please is going to make this job VERY easy:

from newsplease import NewsPlease
u = "https://www.nytimes.com/2019/12/23/us/politics/elizabeth-warren-oklahoma-native-american.html"
test = NewsPlease.from_url(u)
test.__dict__

Gives us:

{'authors': ['Astead W. Herndon'],
 'date_download': datetime.datetime(2019, 12, 27, 22, 59, 37),
 'date_modify': None,
 'date_publish': datetime.datetime(2019, 12, 23, 10, 0, 29),
 'description': 'It was a personal and political homecoming at a key moment in Ms. Warren’s candidacy. Her monthslong rise in national polling has stalled in recent weeks.',
 'filename': 'https%3A%2F%2Fwww.nytimes.com%2F2019%2F12%2F23%2Fus%2Fpolitics%2Felizabeth-warren-oklahoma-native-american.html.json',
 'image_url': 'https://static01.nyt.com/images/2019/12/22/us/politics/22warren-01/22warren-01-facebookJumbo.jpg',
 'language': 'en',
 'localpath': None,
 'title': 'Elizabeth Warren Returns to Oklahoma, Stressing Working-Class Roots',
 'title_page': None,
 'title_rss': None,
 'source_domain': 'www.nytimes.com',
 'text': 'OKLAHOMA CITY — At the age of 16, an “Okie” known to her friends as Liz Herring and to her family as “Betsy” graduated from Northwest Classen High School, her professional prospects limited by her gender and her politics fairly conservative.\nOn Sunday afternoon, she returned to the high school more than a half-century later as Senator Elizabeth Warren of Massachusetts, a left-wing candidate in the Democratic primary who has set out to shatter one of the highest glass ceilings imaginable: becoming the first woman to be elected president.\n“I spent a lot of hours in this gymnasium,” Ms. Warren said. “I never thought I’d be down here, on the floor, doing something like this. But you know what — you don’t get what you don’t fight for.”\nIt was a personal and political homecoming at a key moment in Ms. Warren’s candidacy. Her monthslong rise in national polling has stalled in recent weeks. In response, Ms. Warren and her campaign team have made some marginal changes: a shorter stump speech in favor of more audience questions, and a greater willingness to have Ms. Warren criticize her Democratic rivals, particularly more centrist opponents such as former Mayor Michael R. Bloomberg of New York, Mayor Pete Buttigieg of South Bend, Ind., and the race’s front-runner, former Vice President Joseph R. Biden Jr.',
 'url': 'https://www.nytimes.com/2019/12/23/us/politics/elizabeth-warren-oklahoma-native-american.html'}

interrogator / buzzword

Live / auto updating datasets #38