mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
281 stars 87 forks source link

Differentiate news article URLs from the rest in a sitemap #605

Open pypt opened 5 years ago

pypt commented 5 years ago

As a follow-up to #600, it would be beneficial for us to work out some code that categorizes a list of sitemap URLs:

http://www.example.com/
http://www.example.com/about.html
http://www.example.com/contact.html
http://www.example.com/category/apples/
http://www.example.com/category/flowers/
http://www.example.com/2019/01/01/article-1.html
http://www.example.com/2019/01/02/article-2.html
http://www.example.com/2019/01/03/article-3.html
http://www.example.com/2019/01/04/article-4.html
http://www.example.com/2019/01/05/article-5.html
http://www.example.com/2019/01/06/article-6.html
http://www.example.com/2019/01/07/article-7.html

into lists of URLs that point to news articles and the ones that don't:

# Not news article URLs:
http://www.example.com/
http://www.example.com/about.html
http://www.example.com/contact.html
http://www.example.com/category/apples/
http://www.example.com/category/flowers/

# News article URLs:
http://www.example.com/2019/01/01/article-1.html
http://www.example.com/2019/01/02/article-2.html
http://www.example.com/2019/01/03/article-3.html
http://www.example.com/2019/01/04/article-4.html
http://www.example.com/2019/01/05/article-5.html
http://www.example.com/2019/01/06/article-6.html
http://www.example.com/2019/01/07/article-7.html
hroberts commented 5 years ago

My first approach would be just to try this heuristically. I'm sure it will end up being harder than your examples above, but even if we just eliminate 50% of the non-news urls that would be helpful. Do you have an idea of what the volume of the non-news urls is? If it is 1%, we should just make a quick best effort and then ignore the problem. If it is 50%, we have to think a lot harder about it.

If simple heuristics won't work or you want to try something fancier, I would try a supervised learning approach. Pick out a few features (presence of at least one number, length of the url in characters, length of url in path elements, number of dashes or spaces, presence of certain words like 'category', 'search', or 'tag', etc), code a bunch of urls, run the machine, and see how accurate you can get. If possible, optimize for precision over recall (it's much worse to falsely omit a news url than to falsely include a non-news url). This is a much better approach than the unsupervised one because once we find a model that works, it is easy and quick to apply it to each url that we discover. If we try to do unsupervised clustering instead, we have to batch all of the urls up, cluser them, and then decide which ones are news, which will be a pain to implement within the daily pipeline.

I like to use decision trees for this sort of thing because you can actually look at the resulting tree and verify yourself that it is not overlying on a single feature that might not be robust going forward.

-hal

On Mon, Aug 12, 2019 at 10:21 AM Linas Valiukas notifications@github.com wrote:

As a follow-up to #600 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_600&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=c6UDNxRXy9Rb0o97Dfz8W1iqvHHrl-FWNban2gTvZm0&s=hNKQmAUNiegcUwwTPWsDkzVuTC8arw9c23tOGX8S4zY&e=, it would be beneficial for us to work out some code that categorizes a list of sitemap URLs:

http://www.example.com/http://www.example.com/about.htmlhttp://www.example.com/contact.htmlhttp://www.example.com/category/apples/http://www.example.com/category/flowers/http://www.example.com/2019/01/01/article-1.htmlhttp://www.example.com/2019/01/02/article-2.htmlhttp://www.example.com/2019/01/03/article-3.htmlhttp://www.example.com/2019/01/04/article-4.htmlhttp://www.example.com/2019/01/05/article-5.htmlhttp://www.example.com/2019/01/06/article-6.htmlhttp://www.example.com/2019/01/07/article-7.html

into lists of URLs that point to news articles and the ones that don't:

Not news article URLs:http://www.example.com/http://www.example.com/about.htmlhttp://www.example.com/contact.htmlhttp://www.example.com/category/apples/http://www.example.com/category/flowers/

News article URLs:http://www.example.com/2019/01/01/article-1.htmlhttp://www.example.com/2019/01/02/article-2.htmlhttp://www.example.com/2019/01/03/article-3.htmlhttp://www.example.com/2019/01/04/article-4.htmlhttp://www.example.com/2019/01/05/article-5.htmlhttp://www.example.com/2019/01/06/article-6.htmlhttp://www.example.com/2019/01/07/article-7.html

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_605-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAN66T4KSZ243AAV2YY7VDDQEF5YLA5CNFSM4ILCKUPKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HEXXWTQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=c6UDNxRXy9Rb0o97Dfz8W1iqvHHrl-FWNban2gTvZm0&s=I9gPEpLiVIHO1fgkTMfG49tspqbkKqe5FuSrp6joKPQ&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T5CVU4YFFQKLSY2D2DQEF5YLANCNFSM4ILCKUPA&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=c6UDNxRXy9Rb0o97Dfz8W1iqvHHrl-FWNban2gTvZm0&s=0G0ZKLOBpA4Xfiyv1lT5Jod-v7k_GXi9QHlR_7geB08&e= .

pypt commented 5 years ago

I've tried applying some unsupervised and supervised ML ideas on labeling sitemap URLs.

First off, disclaimer:

fa5

meaning that it was my first ever attempt at ML, and my knowledge of clustering and such is limited to tutorials found in Medium.com articles, so implementations of those attempts might be way off :)

URL feature extractor

It is easy to distinguish URLs that point to news articles just by looking at a list of all sitemap-derived URLs, so it's obvious that news article URLs have a distinct structure that sets them apart from the rest of the URLs. I tried to capture properties of that structure by parsing every URL and extracting features such as:

The full list of features is in the URLFeatureExtractor class.

Dataset

I've used the sitemap-derived URLs from a Colombian online news media dataset that I've collected as part of #600. The dataset consists of 6,362,383 URLs in total. As part of a previous task, I've manually tagged all of those URLs using this manually coded function.

Unsupervised: K-means clustering

News article URLs not only have a different structure when compared to other URLs, but more often than not there are just way more of them in a full list of sitemap-derived URLs (i.e. news articles take up a majority of news website's pages), so my initial idea was to cluster all URLs into 2-3 clusters, take the biggest cluster (made up of news articles), identify the URL pattern (something like a regex) that most of the URLs in the biggest cluster use, and apply said pattern to a full list of URLs.

It seemed to me that such an algorithm (if it worked) would be more accurate because it could use a website-specific context to make its decisions; for example, if website A's news article URLs end with .html and category pages don't, and it's vice versa for website B's URL structure (as observed in Colombian news website dataset), an unsupervised clusterer could make decisions within a given website's list of URLs and ignore what other websites are doing, while a supervised trained neural net would probably have to ignore the path ends with .html signal altogether as noisy.

Too bad it didn't quite work out (or at least I didn't figure out how to make it work yet). The clusters generated by the K-means clusterer script look kinda right, i.e. the biggest cluster is made up of mostly news article URLs, and the smaller one has the rest (categories, attachments, etc.), but I didn't work out how to extract a pattern from the bigger cluster to apply to all URLs.

This is a much better approach than the unsupervised one because once we find a model that works, it is easy and quick to apply it to each url that we discover. If we try to do unsupervised clustering instead, we have to batch all of the urls up, cluser them, and then decide which ones are news, which will be a pain to implement within the daily pipeline.

When fetching sitemaps, we see the full sitemap-derived list of URLs in all cases anyway, so IMHO it wouldn't be too hard to implement it in a pipeline (given that it worked).

Supervised:

After randomizing the order of 6,3m URLs dataset and splitting it into 80% training set and 20% evaluation set, I've adapted some neural network code that I've found on the web into a training script and built a model. The model appears to be quite accurate (99,7%) when run against against the evaluation set made up of Colombian news media URLs, but less so with URL structures that it hasn't seen before. Output of testing script:

#
# Model seems to be pretty good at identifying true positives:
#

* 1.00 == https://www.nytimes.com/2019/08/08/climate/climate-change-food-supply.html
* 1.00 == https://www.delfi.lt/news/daily/lithuania/sunu-i-ligonine-isgabenes-pogrebnojus-kaltina-simasiu-mano-vaikas-verkia-o-jie-politikuoja.d?id=81942177
* 1.00 == https://www.15min.lt/naujiena/aktualu/lietuva/astravo-atomineje-elektrineje-ivykus-rimtai-avarijai-vilnieciu-nebutu-kur-evakuoti-56-1185646
* 1.00 == https://globalvoices.org/2019/08/07/two-universities-sign-historic-agreement-on-slavery-reparations-in-the-caribbean/
* 1.00 == https://www.kdnuggets.com/2016/10/machine-learning-detect-malicious-urls.html
* 1.00 == https://www.facebook.com/zuck/posts/10108280403736331
* 1.00 == https://stackoverflow.com/questions/45310254/fixed-digits-after-decimal-with-f-strings
* 0.94 == https://www.bbc.com/news/world-asia-china-49317975
* 1.00 == https://www.huffpost.com/entry/acting-dhs-chief-concedes-timing-unfortunate-mississippi-ice-raids_n_5d503738e4b0820e0af6d6ab
* 1.00 == https://www.foxnews.com/auto/jeffrey-epstein-former-cellmate-apparent-suicide-attempt
* 0.97 == https://www.foxnews.com/media/officer-fox-friends-burger-king-worker-drew-pig
* 1.00 == https://www.washingtonpost.com/national/angry-and-fearful-americans-struggle-to-talk-about-guns-and-race/2019/08/11/d040c678-bad2-11e9-b3b4-2bb69e8c4e39_story.html
* 1.00 == https://www.wsj.com/articles/wealth-of-jeffrey-epsteins-brother-is-also-a-mystery-11565607148

#
# ...but it's a hit and miss with non-news article URLs (unless we set the threshold at 0.95 or so):
#

* 0.00 == https://www.nytimes.com/
* 0.98 == https://www.nytimes.com/section/business
* 0.34 == https://www.nytimes.com/newsletters
* 0.00 == https://www.delfi.lt/
* 0.94 == https://www.delfi.lt/krepsinis/turnyrai/europos-taure/
* 0.96 == https://www.15min.lt/naujienos/aktualu/pasaulis
* 0.00 == https://globalvoices.org/
* 0.23 == https://globalvoices.org/-/world/western-europe/
* 0.07 == https://globalvoices.org/-/world/western-europe,eastern-central-europe/
* 0.87 == https://facebook.com/globalvoicesonline/
* 0.90 == https://en.support.wordpress.com/posts/categories/
* 0.07 == http://example.com/tag/news/
* 0.23 == https://disqus.com/by/hussainahmedtariq/
* 0.09 == https://www.facebook.com/zuck
* 0.34 == https://stackoverflow.com/questions/tagged/python-3.x
* 0.00 == https://www.bbc.com/news/world/asia/china
* 0.00 == https://www.huffpost.com/
* 0.00 == https://www.huffpost.com/?guce_referrer=aHR0cDovL3d3dy5lYml6bWJhLmNvbS9hcnRpY2xlcy9uZXdzLXdlYnNpdGVz
* 0.00 == https://www.foxnews.com/
* 0.42 == https://www.foxnews.com/entertainment
* 0.97 == https://www.foxnews.com/category/person/jeffrey-epstein
* 0.92 == https://www.washingtonpost.com/national/investigations/
* 1.00 == https://www.washingtonpost.com/national/investigations/?nid=top_nav_investigations
* 0.99 == https://www.wsj.com/news/types/television-review

I'm still optimistic about this approach, but I think we just need to feed it with a bigger dataset which would also be more representative, i.e. consist of not only Colombian news media. Capturing a couple of more URL properties (e.g. whether or not URL includes /category/ or /tag/) could work too, although it would be nice to have a English language-independent model.

Notes and questions

My first approach would be just to try this heuristically. I'm sure it will end up being harder than your examples above, but even if we just eliminate 50% of the non-news urls that would be helpful. Do you have an idea of what the volume of the non-news urls is? If it is 1%, we should just make a quick best effort and then ignore the problem. If it is 50%, we have to think a lot harder about it.

As detailed in https://github.com/berkmancenter/mediacloud/issues/600#issuecomment-515518709 and the spreadsheet, a typical news website's sitemap tree consists of 89% or so of news article URLs anyway, so statistically we can afford to just add each and every URL to our database and be done with it. However, 89% is only an average, and some of the top Colombian news websites have their news-article-to-cruft ratio as low as 67% (or even 30%), so we'd end up collecting 89% of "clean" data across the collection but 50% of "dirty" URLs for news websites that actually matter. In other words, it wouldn't be great if a specific system worked in 99% of the cases for the US media but just so happened to fail for NYTimes.com.

Problems with simple heuristics that I see are:

I like to use decision trees for this sort of thing because you can actually look at the resulting tree and verify yourself that it is not overlying on a single feature that might not be robust going forward.

Thanks, I'll have a look!

Do you think there's a point in researching various neural network properties before training one? For example, in my training script I create a sequential model, add three layers with two different activations, and run everything in 4 epochs, and even though it all seems to kinda work, I have no idea what it all means, so maybe there was a way to improve predictions by trying out various other layers, activations and whatnot?

hroberts commented 5 years ago

This is good progress, but I think the approach is odd. If we want to make sure that the heuristic approach is valid enough to provide a training and evaluation set, we need to validate it by manually coding some random sample of its results. If we don't find the manual heuristics to be accurate, we cannot use them as the training and evaluation set. If we do find them to be accurate, we just use them instead of the ml approach.

I get that you are trying to train the machine on a subset of media sources and then extrapolate to the larger set. But if you want to do that, you'll need to manually code a random sample of the results from the larger set to figure out the precision and recall of the results. But if you are going to do that, you might as well use that manually coded data from the larger set as the training and evaluation set rather than using the heuristically derived results.

It should be pretty quick to manually code a large set of stories for the training and evaluation set, because for the vast majority of urls you can tell at a glance whether it is a story. Once you have a few thousand urls coded that way, you can just use it as your training and evaluation set to test specific changes in the ml setup -- both adding and removing features and experimenting with different ml engines and configurations for those engines. Once you have a script that runs the data against a given engine and spits out the precision and recall, it will be quick, fun, and easy to experiment.

Looking at the variance between media sources is a great insight. You should include some stats about variance of accuracy across different media sources in your testing script so that you can judge on those results as well.

I don't have a super strong preference for the particular ml engine, other than my preference for decision trees stated above. We have gotten burned by ml in the past when it hyper focused on one feature that proved not to be robust over time, but that's just a mild preference.

-hal

On Mon, Aug 12, 2019 at 12:07 PM Linas Valiukas notifications@github.com wrote:

I've tried applying some unsupervised and supervised ML ideas on labeling sitemap URLs.

First off, disclaimer:

[image: fa5] https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_181949_62877406-2Da45a3a80-2Dbd2f-2D11e9-2D9ba9-2Dfb105b972b21.jpg&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=h1GFGvJO7zjfVvXvFf8r3eiapEDiTZmlZhw1dJd1WgI&s=IV4AkFCPhmyPImHyKS9zB3haotS20uJ18ihp1TyvBLg&e=

meaning that it was my first ever attempt at ML, and my knowledge of clustering and such is limited to tutorials found in Medium.com articles, so implementations of those attempts might be way off :) URL feature extractor

It is easy to distinguish URLs that point to news articles just by looking at a list of all sitemap-derived URLs, so it's obvious that news article URLs have a distinct structure that sets them apart from the rest of the URLs. I tried to capture properties of that structure by parsing every URL and extracting features such as:

  • Whether or not URL has a path part that looks like a date, e.g. /2017/01/01/
  • Whether or not URL's path ends with a number (decimal or hexadecimal), e.g. /some-article-17392
  • Whether or not URL's path ends with a .htm[l] extension
  • Whether or not URL's query string has a numeric (decimal or hexadecimal) value, e.g. /article?id=12345
  • Length of a longest URL path part, e.g. len(this-is-a-news-article) for https://example.com/a/b/this-is-a-news-article/
  • Length of URL path
  • ...and so on

The full list of features is in the URLFeatureExtractor class https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_blob_sitemap-5Fis-5Fnews-5Farticle-5Furl_mediacloud_mediawords_util_sitemap_url-5Fvectors.py-23L93-2DL327&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=h1GFGvJO7zjfVvXvFf8r3eiapEDiTZmlZhw1dJd1WgI&s=vqQqf1E16RssXS1xMBCjRzoTCDcGU1CkqtYZYsnujHU&e= . Dataset

I've used the sitemap-derived URLs from a Colombian online news media dataset that I've collected as part of #600 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_600&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=h1GFGvJO7zjfVvXvFf8r3eiapEDiTZmlZhw1dJd1WgI&s=f9hiVMOGJ0eQ8BGoR26o1WOacQtbiW1d7qks9us2QvM&e=. The dataset consists of 6,362,383 URLs in total. As part of a previous task, I've manually tagged all of those URLs using this manually coded function https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_blob_sitemap-5Fis-5Fnews-5Farticle-5Furl_mediacloud_mediawords_util_sitemap_is-5Fnews-5Farticle.py-23L9-2DL409&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=h1GFGvJO7zjfVvXvFf8r3eiapEDiTZmlZhw1dJd1WgI&s=G8Wd5LKkBIxLjiektBdvcoP0qU2Luj23GrlGUEAvzYk&e= . Unsupervised: K-means clustering

News article URLs not only have a different structure when compared to other URLs, but more often than not there are just way more of them in a full list of sitemap-derived URLs (i.e. news articles take up a majority of news website's pages), so my initial idea was to cluster all URLs into 2-3 clusters, take the biggest cluster (made up of news articles), identify the URL pattern (something like a regex) that most of the URLs in the biggest cluster use, and apply said pattern to a full list of URLs.

It seemed to me that such an algorithm (if it worked) would be more accurate because it could use a website-specific context to make its decisions; for example, if website A's news article URLs end with .html and category pages don't, and it's vice versa for website B's URL structure (as observed in Colombian news website dataset), an unsupervised clusterer could make decisions within a given website's list of URLs and ignore what other websites are doing, while a supervised trained neural net would probably have to ignore the path ends with .html signal altogether as noisy.

Too bad it didn't quite work out (or at least I didn't figure out how to make it work yet). The clusters generated by the K-means clusterer script https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_blob_sitemap-5Fis-5Fnews-5Farticle-5Furl_tools_sitemap_cluster-5Fsitemap-5Furls.py-23L23-2DL48&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=h1GFGvJO7zjfVvXvFf8r3eiapEDiTZmlZhw1dJd1WgI&s=s6pSPkA8uyWHihAK2_tItQQtFGHPCrwKuFMWoqtEeo8&e= look kinda right, i.e. the biggest cluster is made up of mostly news article URLs, and the smaller one has the rest (categories, attachments, etc.), but I didn't work out how to extract a pattern from the bigger cluster to apply to all URLs.

This is a much better approach than the unsupervised one because once we find a model that works, it is easy and quick to apply it to each url that we discover. If we try to do unsupervised clustering instead, we have to batch all of the urls up, cluser them, and then decide which ones are news, which will be a pain to implement within the daily pipeline.

When fetching sitemaps, we see the full sitemap-derived list of URLs in all cases anyway, so IMHO it wouldn't be too hard to implement it in a pipeline (given that it worked). Supervised:

After randomizing the order of 6,3m URLs dataset and splitting it into 80% training set and 20% evaluation set, I've adapted some neural network code that I've found on the web into a training script https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_blob_sitemap-5Fis-5Fnews-5Farticle-5Furl_tools_sitemap_train-5Fnews-5Farticle-5Fmodel.py-23L45-2DL82&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=h1GFGvJO7zjfVvXvFf8r3eiapEDiTZmlZhw1dJd1WgI&s=L6hbUTYW2PPNdxkbq7XkAnHSmi0-lNi56N2FUWa3nfE&e= and built a model. The model appears to be quite accurate (99,7%) when run against against the evaluation set made up of Colombian news media URLs, but less so with URL structures that it hasn't seen before. Output of testing script https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_blob_sitemap-5Fis-5Fnews-5Farticle-5Furl_tools_sitemap_try-5Fnews-5Farticle-5Fmodel.py&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=h1GFGvJO7zjfVvXvFf8r3eiapEDiTZmlZhw1dJd1WgI&s=iZIvKyA4g7XxWPrqXRO6o7NqHZCmZQWU1RKZmkGoZhM&e= :

#

Model seems to be pretty good at identifying true positives:

#

#

...but it's a hit and miss with non-news article URLs (unless we set the threshold at 0.95 or so):

#

I'm still optimistic about this approach, but I think we just need to feed it with a bigger dataset which would also be more representative, i.e. consist of not only Colombian news media. Capturing a couple of more URL properties (e.g. whether or not URL includes /category/ or /tag/) could work too, although it would be nice to have a English language-independent model. Notes and questions

My first approach would be just to try this heuristically. I'm sure it will end up being harder than your examples above, but even if we just eliminate 50% of the non-news urls that would be helpful. Do you have an idea of what the volume of the non-news urls is? If it is 1%, we should just make a quick best effort and then ignore the problem. If it is 50%, we have to think a lot harder about it.

As detailed in #600 (comment) https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_600-23issuecomment-2D515518709&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=h1GFGvJO7zjfVvXvFf8r3eiapEDiTZmlZhw1dJd1WgI&s=AvphoSvr3YF0OGbt0nUMZP47vJSFbKsdsCY1lX3pQiE&e= and the spreadsheet, a typical news website's sitemap tree consists of 89% or so of news article URLs anyway, so statistically we can afford to just add each and every URL to our database and be done with it. However, 89% is only an average, and some of the top Colombian news websites have their news-article-to-cruft ratio as low as 67% (or even 30%), so we'd end up collecting 89% of "clean" data across the collection but 50% of "dirty" URLs for news websites that actually matter. In other words, it wouldn't be great if a specific system worked in 99% of the cases for the US media but just so happened to fail for NYTimes.com.

Problems with simple heuristics that I see are:

I like to use decision trees for this sort of thing because you can actually look at the resulting tree and verify yourself that it is not overlying on a single feature that might not be robust going forward.

Thanks, I'll have a look!

Do you think there's a point in researching various neural network properties before training one? For example, in my training script https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_blob_sitemap-5Fis-5Fnews-5Farticle-5Furl_tools_sitemap_train-5Fnews-5Farticle-5Fmodel.py-23L63-2DL66&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=h1GFGvJO7zjfVvXvFf8r3eiapEDiTZmlZhw1dJd1WgI&s=VHM_5njvA4lwqCUIC8KQgCm9OKinZBEev787jk4UBRE&e= I create a sequential model, add three layers with two different activations, and run everything in 4 epochs, and even though it all seems to kinda work, I have no idea what it all means, so maybe there was a way to improve predictions by trying out various other layers, activations and whatnot?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_605-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAN66T2PXX34TOICRFZOGR3QEGKEPA5CNFSM4ILCKUPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4DF3AY-23issuecomment-2D520510851&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=h1GFGvJO7zjfVvXvFf8r3eiapEDiTZmlZhw1dJd1WgI&s=fZl-96YPwBTIKiWvWaNyUlVFcwe0wfVvHVmEBNkhPgU&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T7UETPI6YONALWRA3TQEGKEPANCNFSM4ILCKUPA&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=h1GFGvJO7zjfVvXvFf8r3eiapEDiTZmlZhw1dJd1WgI&s=gbdhQ1z1y-NLyqm8eJZ2_v_Z1UUsm8Kh7lCuBp-OwDM&e= .

pypt commented 5 years ago

This is good progress, but I think the approach is odd. If we want to make sure that the heuristic approach is valid enough to provide a training and evaluation set, we need to validate it by manually coding some random sample of its results. If we don't find the manual heuristics to be accurate, we cannot use them as the training and evaluation set. If we do find them to be accurate, we just use them instead of the ml approach.

I think I didn't manage to explain this clearly, my bad! I did manually code a bunch of Colombian news media sources, i.e. I went through ~138 sources, eyeballed their lists of sitemap-derived URLs, and coded every URL as being a news article or some other page. By "heuristics" in this context, I meant that I made my job easier by writing a function that applied a tailored pattern for every individual source so that I wouldn't have to manually mark millions of URLs in Excel.

So, basically I do have a manually coded dataset, I just would like to create a similar manually coded training dataset from a bigger number of more varied sources.

hroberts commented 5 years ago

That doesn't count as a manually coded set. You are assuming that your heuristics are the same as actually manually coding everything. If they are, we may as well just use the heuristics for those sources. If they are not, then they are not a good training / eval set. And in any case, they don't apply to the larger set of non-colombian sources.

If you repeat the process with the non-colombian sources, you will end up with the same problem. Either the heuristics are accurate enough to be as good as manual coding, in which which case we should just use the heuristics, or they are not, in which they are not good enough to use for training and eval.

There's no shortcut to manually coding a big random sample of the urls from the global set.

-hal

On Fri, Aug 16, 2019 at 2:39 PM Linas Valiukas notifications@github.com wrote:

WARNING: Harvard's email systems could not validate that the sender of this message is legitimate. Please be cautious in opening attachments, clicking any links, or following any other instructions in this email. [Error Code: SF]

This is good progress, but I think the approach is odd. If we want to make sure that the heuristic approach is valid enough to provide a training and evaluation set, we need to validate it by manually coding some random sample of its results. If we don't find the manual heuristics to be accurate, we cannot use them as the training and evaluation set. If we do find them to be accurate, we just use them instead of the ml approach.

I think I didn't manage to explain this clearly, my bad! I did manually code a bunch of Colombian news media sources, i.e. I went through ~138 sources, eyeballed their lists of sitemap-derived URLs, and coded every URL as being a news article or some other page. By "heuristics" in this context, I meant that I made my job easier by writing a function https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_blob_sitemap-5Fis-5Fnews-5Farticle-5Furl_mediacloud_mediawords_util_sitemap_is-5Fnews-5Farticle.py-23L9-2DL409&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=OMSIdwkF7p9ihx90D6_NLjhXpO9IC1gLPAWW8ICIJ5o&s=PR_fcogAAdyln_4dviX-Zvv8kruqnPJuSgclqLeTwYg&e= that applied a tailored pattern for every individual source so that I wouldn't have to manually mark millions of URLs in Excel.

So, basically I do have a manually coded dataset, I just would like to create a similar manually coded training dataset from a bigger number of more varied sources.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_605-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAN66T42LEQJ6UMI5XUYZKDQE365BA5CNFSM4ILCKUPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4PQSGI-23issuecomment-2D522127641&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=OMSIdwkF7p9ihx90D6_NLjhXpO9IC1gLPAWW8ICIJ5o&s=cOKIMajw2cdKBdY7kZ_gSN5bDTk93_68dKcLOrtHxzQ&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66TZ6CKVNS2MLZXYJWYDQE365BANCNFSM4ILCKUPA&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=OMSIdwkF7p9ihx90D6_NLjhXpO9IC1gLPAWW8ICIJ5o&s=djy7jrvgIzUdMox7ayt4vjVZtbx33ZSIpO4Xpi8_bfQ&e= .

pypt commented 5 years ago

How is that not a manually coded set if I have manually coded the Colombian training dataset with the help of a bunch of per-source, manually tailored regexes just to speed up the work?

Excerpt from the script:

def url_points_to_news_article(url: str) -> bool:

    # <...>

    # Consider only URLs from rcnradio.com
    elif '//rcnradio.com' in url:

        # By investigating a list of rcnradio.com URLs manually, I figured out that the
        # ones that end with a slash don't point to news articles:
        if url.endswith('/'):
            return False

        # Also, there are some category pages on that specific news website
        # that can be identified by counting the number of path elements,
        # e.g. "http://rcnradio.com/bogota"
        if url.count('/') <= 3:
            return False

    # Rules for publimetro.co specifically
    elif '//publimetro.co' in url:

        # For publimetro.co (and that website alone), the news article pages end
        # with .html and the rest don't
        if not url.endswith('.html'):
            return False

    # <...>

    # If the URL managed to pass through a bunch of per-site rules filtering out
    # the non-news articles, it's a news article by this point
    return True

As you can see, the code" above are not fit to be used for anything except for identifying news articles in a small, specific Colombian list of sources. I've coded them here just to get a training set to be able to train a sample model to be used for identifying news articles among a bigger list of sources. Sure, this dataset might is biased towards how Colombian programmers like to make up their URL structure, so I'd like to manually code a training set from a larger and more representative list of sources.

I now think I have misused the term "heuristics" (just looked up the term in a dictionary), I shouldn't have used it at all, that made this thread confusing and misleading - sorry for that!

pushshift commented 5 years ago

If I can chime in here for a moment and offer some suggestions. From what I've read so far, it sounds like the ultimate goal here is to separate links within a site map so that one group is classified as news articles while other links are classified as non-news articles (I just want to make sure I understand what the ultimate goal here is).

That said, every site is going to probably have some specific methods that will more or less globally signify if a link is a news story or is not a news story. In your example for rcnradio.com, you've determined that links ending with a slash are not news stories and for the site publimetro.co, pages ending with .html do not contain news stories (both previous examples have a function returning false).

If the goal is to come up with a method that applies globally to many different (all) sources, these methods appear to be extremely specific to just these two sites. I doubt that we could depend on this logic to work globally for all sources.

In my work in analyzing the Python RSS crawler, I've noticed some strategies that might work on a more global set of sources. For instance, links that don't point directly to news stories (for example https://www.nytimes.com/section/business) will often be repeated across many pages within the entire site map.

As a hypothetical, let's assume that nytimes.com has 100,000 pages that are discovered during a full site crawl. There will probably be hundreds or more pages that link to https://www.nytimes.com/section/business within the full site map whereas there may only be a dozen or less pages that references an actual news story. If you look at the distribution, you might find a steep cut-off between pages that aren't news stories but are referenced repeatedly through all pages in the site map vs. far less pages that actually reference any particular news story.

This type of analysis would probably hold across many different sources (this is my hypothesis) and you would find something like (letters are not news stories and numbers are news stories):

Non-news pages: Page A: Referenced 720 times in the site map Page B: Referenced 640 times in the site map Page C: Referenced 540 times in the site map

News pages: Page 1: Referenced 12 times in the site map Page 2: Referenced 8 times in the site map etc ...

There may also be some form of content analysis where pages that are news stories consistently contain some form of content compared to non-news pages.

Also, why couldn't we use Google and pass in urls to see the earliest references to those pages or some other heuristic where non-news pages have most likely existed far longer than an actual news story itself. Or look at other metrics like the number of in-bound links to each of the URLS crawled via Google to see if we can determine if there is some major difference between non-news links vs. news links.

I think using some external tools would be advantageous where we could set up a score for the pages and any pages that are indeterminate (a score that is nebulous) could be manually reviewed if that process narrowed down the pool of candidates to something that was manageable for manual review (narrowing down one million pages to a few hundred / thousand).

These are just some ideas but I think it's worth exploring some methods that could be applied to all sources. If we want to determine exact methods for every source, that seems like it would be a lot of work but if we could narrow the list of candidates to something manageable, it might make the process easier.

Also, I don't know how we would tolerate false negatives / positives but I do think we could find something that works relatively well across a majority of the sources we currently track.

Just some ideas ...

pypt commented 5 years ago

Thanks Jason, your input is very useful and I appreciate it!

If I can chime in here for a moment and offer some suggestions. From what I've read so far, it sounds like the ultimate goal here is to separate links within a site map so that one group is classified as news articles while other links are classified as non-news articles (I just want to make sure I understand what the ultimate goal here is).

That's absolutely right.

If the goal is to come up with a method that applies globally to many different (all) sources, these methods appear to be extremely specific to just these two sites. I doubt that we could depend on this logic to work globally for all sources.

Obviously so, this is why I'm not even attempting to use such code "globally". Those per-site exceptions are simple helpers to speed up manually tagging a sample dataset that I've collected from Colombian news sources. The purpose of such a tagged dataset is to use it for training a neural net that would later predict whether new incoming URLs are news articles or not.

To put it simply, what I did is tantamount to opening up the full list of 3,8m URLs in Excel and tagging all of them manually. It was easy to spot specific patterns that individual websites use, so I uses said observation to batch-tag a bunch of URLs from a certain source.

In my work in analyzing the Python RSS crawler, I've noticed some strategies that might work on a more global set of sources. For instance, links that don't point directly to news stories (for example https://www.nytimes.com/section/business) will often be repeated across many pages within the entire site map.

Ideally, we'd like to learn whether a URL is a news article or not without having to fetch it. It looks like it's doable by making a neural net predict on URL's structure alone, and it kind of works with a model trained from a manually tagged Colombian sources dataset, we just need to manually tag + train a model using a more representative set of URLs.

Also, why couldn't we use Google and pass in urls to see the earliest references to those pages or some other heuristic where non-news pages have most likely existed far longer than an actual news story itself. Or look at other metrics like the number of in-bound links to each of the URLS crawled via Google to see if we can determine if there is some major difference between non-news links vs. news links.

AFAIK, there's no official Google Search API, and I think at least technically we'd be breaking Google's T&C by attempting to parse their SERPs. Again, ideally we should be able to figure out whether a URL is a news article without having to fetch the URL itself or making any kind of an API call - it's very easy for a human being to do (especially if one sees contextual URLs), so supposedly a neural net should be able to figure it out too!