mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
277 stars 87 forks source link

test and validate new feed scraping code #581

Open hroberts opened 5 years ago

hroberts commented 5 years ago

We are currently running feed scraping code that is a few years old. Feed scraping for us just means discovering the set of rss feeds that cover all syndicated stories for a given site url. So feed scraping for 'http://nytimes.com' should ideally return a set of rss feeds that combined include all of the stories published at nytimes.com.

We are currently running old perl code (https://github.com/berkmancenter/mediacloud/blob/master/lib/MediaWords/Feed/Scrape.pm) to do this work currently. When we validated it several years ago, it was about as good as a human at discovering feeds. I'm sure it is worse now as the internet has moved on.

Last year, we worked on a newly implemented python module (https://github.com/mitmedialab/feed_seeker) to replace the crusty old code, and it seemed to mostly work, but we never validated it or hammered on it enough to get it into production.

The approach of both modules is similar. They use some pretty simple heuristics to discover the feed urls. The look for link tags, look at likely locations, look for any url with some relevant keywords like 'rss' in it, and also spider for one or two levels to look for rss feeds in sub pages (many sites have an 'rss' page that lists all of the rss feeds but no rss feed urls directly on the home page).

The ideal behavior of the feed scraper is to return the minimal set of feeds that we think will include all stories published by the source. The perl module has a notion of 'default feeds', which is just some set of feed locations that it tries and, if it finds a valid feed at that location, assumes that all stories are at that location. So for exmple if we find valid rss at 'http://nytimes.com/feed' we just assume that that feed has all the stories for the source. If we don't find one of those default feeds, we just return all rss feeds that we find. Having lots of feeds for a given source is fine if that is what it takes to get all of the stories, but we don't want to have to download a hundred feeds for a given source if we know that all the stories are included in a single feed.

It is sometimes the case that a given url will include rss feeds that link to other sources. This is bad for us, because it results in us adding stories to a source that don't belong to that source (think a random blog including the nyt rss feed on its site for whatever reason). We guard against this in the perl code by only returning rss feeds whose domains differ from the domain of the source url if there are no feeds found with the same domain.

We would like to test how well the new python module works by validating how the feeds returned by the python module compare to the existing feeds in our system (generated by our old perl module and sometimes manually tweaked by a human curator). In cases in which the feeds returned by the new python module and the existing feeds in our system differ, you should manually look at the differences and try to piece together whether each set of results is returning all, some, or none of the stories for the given source and also whether it is returning feeds that include stories not belonging to the given source. There is no easy way to figure out with full certainty whether a given set of feeds is actually returning all stories within a source, but just do your best to eyeball the feeds to get a sense for whether they seem to have all the needed stories (if there's a functional feed called 'all stories' that has dozens of stories per day, we can assume it has all of the stories). The idea here is just to do a quick validation of how well the system is working.

Elapsed time performance is not nearly performance as accuracy. If it takes us an hour to scrape for the feeds for each source, that's fine as long as the scraping is accurate.

We should do this validation for up to 50 sources for each of the following media collections:

You should be able to get a random set of feeds from each of the above collections using the media cloud api.

pushshift commented 5 years ago

I'm getting closed to getting all of the data organized but I've run into an issue. For the Germany National Set, I can see the tags information using this:

https://api.mediacloud.org/api/v2/tags/list?rows=100&tag_sets_id=15765102&search=Germany

tag_sets_id 15765102 is for the Geographic Collection but when I do a search for the country Columbia, I don't see anything returned. It's possible this may be other another tag set and I just haven't found it yet but was curious if you knew which tag sets id it is under.

Also, for the retweets, I'm searching using:

https://api.mediacloud.org/api/v2/tags/list?rows=100&search=Retweet%20Right

I'm not finding an exact match for "Retweet Partisanship Left" and "Retweet Partisanship Right" with tags.

hroberts commented 5 years ago

colombia!

On Tue, Apr 30, 2019, 6:07 PM Jason Michael Baumgartner < notifications@github.com> wrote:

I'm getting closed to getting all of the data organized but I've run into an issue. For the Germany National Set, I can see the tags information using this:

https://api.mediacloud.org/api/v2/tags/list?rows=100&tag_sets_id=15765102&search=Germany https://urldefense.proofpoint.com/v2/url?u=https-3A__api.mediacloud.org_api_v2_tags_list-3Frows-3D100-26tag-5Fsets-5Fid-3D15765102-26search-3DGermany&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=69DudDr3xtnHQS2WKxfP1ajM4mfYQO_DN6MVoxZqRU8&s=itcyl_px0Lc6kT90oLIU0n-JqYL-SF5TQi6MQpGqON4&e=

tag_sets_id 15765102 is for the Geographic Collection but when I do a search for the country Columbia, I don't see anything returned. It's possible this may be other another tag set and I just haven't found it yet but was curious if you knew which tag sets id it is under.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_581-23issuecomment-2D488147059&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=69DudDr3xtnHQS2WKxfP1ajM4mfYQO_DN6MVoxZqRU8&s=VfO71clGR_l4fMBqCTeYTJCBWRR148HxycFtzMLVRbA&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66TYZOZTKUBFXZ2XTC4LPTDGL3ANCNFSM4HIABPEQ&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=69DudDr3xtnHQS2WKxfP1ajM4mfYQO_DN6MVoxZqRU8&s=mWEijozZlZMK9We90qvp2phjKLx9IO5oyWb8GD1vzWc&e= .

pushshift commented 5 years ago

My bad! Internal misspelling. What about the Retweet collections? (Solved)

Current tag ids: US Top Online Media: 9139487 Retweet Partisanship Left / Right: 9360520, 9360521, 9360522, 9360523, 9360524 (Left-- 9360520, Right -- 9360524 -- From Hal) Colombia National Set: 34412358 Germany National Set: 34412409

Nevermind -- I found what I needed here: https://gist.github.com/rahulbot/6371c35677655305d2988e77f7b3af05

The Retweet tests may or may not be helpful for testing as it looks like this is for older data (2016 election) so it's entirely possible those feeds don't even exist anymore. However, I'll find out soon enough.

pushshift commented 5 years ago

feed_seeker appears to have a redirect loop bug. Bug report here: https://github.com/mitmedialab/feed_seeker/issues/4

pushshift commented 5 years ago

Investigating https://github.com/mitmedialab/feed_seeker/issues/6

pushshift commented 5 years ago

There are a few issues:

1) I'm seeing some major discrepancies for sources like nj.com vs. what we have in the current mediacloud database. Comparing it to nytimes.com, either nj.com has drastically changed their site map / removed links to feeds from the previous crawl or there was manual additions to the DB for nj.com. To get a clear comparison, I would like to run the Perl script directly -- so if this requires installation of mediacloud, I could do that (assuming there aren't any gotchas with installing the entire system) or I could hop on a dev installation and run it.

2) Some sources have been running for days (washingtonpost.com for example has been running since Thursday evening). While the Python script has found sources for washingtonpost.com, at a spider depth of 2 it seems to be crawling quite a lot of pages. I'm adding logging to the script to get a better sense of what's going on there. Also, if the script bombs out for any reason, it has no memory of where it left off and has to start from the very beginning.

3) While waiting to test out a fix to the script, I registered with feedly and created a method to fetch feeds from feedly and from the spot check that I did, it appears feedly has a comprehensive list of RSS feeds for the sources I did check. If we want to improve on the number of RSS feeds we have in our system, I would also highly recommend incorporating that at some point as well.

I'd like to compare directly against the Perl script so that I can get a direct comparison of the two scripts.

hroberts commented 5 years ago

Thanks for the update and good progress.

  1. I'm seeing some major discrepancies for sources like nj.com vs. what we have in the current mediacloud database. Comparing it to nytimes.com, either nj.com has drastically changed their site map / removed links to feeds from the previous crawl or there was manual additions to the DB for nj.com. To get a clear comparison, I would like to run the Perl script directly -- so if this requires installation of mediacloud, I could do that (assuming there aren't any gotchas with installing the entire system) or I could hop on a dev installation and run it.

I think a good compromise is for me to just get a local repo of mediacloud working for you, then you'll have something to start with instead of wrestling from scratch. I just do my development on my own little virtual machine in the cloud. Would it work well for you for me to setup remote virtual machine for you with a functional media cloud on it?

Please remember that the ultimate thing we want to determine is whether we are getting all of the stories we can from each media source. So it's not necessarily better that one version gets more feeds. It's actually better if we can get all of the stories in fewer feeds. Getting all of the feeds is just the fallback when we have to do that to make sure we're getting all of the stories.

  1. Some sources have been running for days (washingtonpost.com for example has been running since Thursday evening). While the Python script has found sources for washingtonpost.com, at a spider depth of 2 it seems to be crawling quite a lot of pages. I'm adding logging to the script to get a better sense of what's going on there. Also, if the script bombs out for any reason, it has no memory of where it left off and has to start from the very beginning.

The perl version of this only spiders sites that match some small set of keywords like rss, rdf, xml, atom, subscribe, etc. I suspect the python module might be just spidering every single page, which would indeed take a Very Long Time.

  1. While waiting to test out a fix to the script, I registered with feedly and created a method to fetch feeds from feedly and from the spot check that I did, it appears feedly has a comprehensive list of RSS feeds for the sources I did check. If we want to improve on the number of RSS feeds we have in our system, I would also highly recommend incorporating that at some point as well.

great. I agree! please add to the task to add feedly integration to the python module. I think it would be good to do validation of the python module without the feedly pulled data first and then with the feedly pulled data to be able to compare.

pushshift commented 5 years ago
  1. Perfect! A remote virtual machine would be great for this. I appreciate that!

  2. Good point. I'm adding logging and some other basic features to the current Python script as well -- which doesn't take long to add but helps debug issues more quickly and will help with issues down the road.

  3. Awesome. Thank you! I agree about validation without using Feedly -- since we want to compare what is currently there. But this will help improve feed detection overall down the road. So I'll create methods for this with the ability to run the separately so that we can compare without using Feedly.

On Tue, May 7, 2019 at 10:57 AM Hal Roberts notifications@github.com wrote:

Thanks for the update and good progress.

  1. I'm seeing some major discrepancies for sources like nj.com vs. what we have in the current mediacloud database. Comparing it to nytimes.com, either nj.com has drastically changed their site map / removed links to feeds from the previous crawl or there was manual additions to the DB for nj.com. To get a clear comparison, I would like to run the Perl script directly -- so if this requires installation of mediacloud, I could do that (assuming there aren't any gotchas with installing the entire system) or I could hop on a dev installation and run it.

I think a good compromise is for me to just get a local repo of mediacloud working for you, then you'll have something to start with instead of wrestling from scratch. I just do my development on my own little virtual machine in the cloud. Would it work well for you for me to setup remote virtual machine for you with a functional media cloud on it?

Please remember that the ultimate thing we want to determine is whether we are getting all of the stories we can from each media source. So it's not necessarily better that one version gets more feeds. It's actually better if we can get all of the stories in fewer feeds. Getting all of the feeds is just the fallback when we have to do that to make sure we're getting all of the stories.

  1. Some sources have been running for days (washingtonpost.com for example has been running since Thursday evening). While the Python script has found sources for washingtonpost.com, at a spider depth of 2 it seems to be crawling quite a lot of pages. I'm adding logging to the script to get a better sense of what's going on there. Also, if the script bombs out for any reason, it has no memory of where it left off and has to start from the very beginning.

The perl version of this only spiders sites that match some small set of keywords like rss, rdf, xml, atom, subscribe, etc. I suspect the python module might be just spidering every single page, which would indeed take a Very Long Time.

  1. While waiting to test out a fix to the script, I registered with feedly and created a method to fetch feeds from feedly and from the spot check that I did, it appears feedly has a comprehensive list of RSS feeds for the sources I did check. If we want to improve on the number of RSS feeds we have in our system, I would also highly recommend incorporating that at some point as well.

great. I agree! please add to the task to add feedly integration to the python module. I think it would be good to do validation of the python module without the feedly pulled data first and then with the feedly pulled data to be able to compare.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/581#issuecomment-490116202, or mute the thread https://github.com/notifications/unsubscribe-auth/ADAFG7LQQJLKXHWTTHLVRTTPUGKFTANCNFSM4HIABPEQ .

-- Jason Michael Baumgartner pushshift.io http://pushshift.io