mediacloud / web-search

Code that drives the public web-based tools for the Media Cloud Online News Archive and Directory.
https://search.mediacloud.org
Apache License 2.0
8 stars 12 forks source link

Feed finding packages research #637

Open Evan-Leon opened 4 months ago

Evan-Leon commented 4 months ago

We have found that the feed_seeker package isn't finding feeds as effectively as we hoped. We would like to investigate different feed finding packages to see what might be best for our needs. Please test 4 different packages and then make a recommendation about which package you think would be best.

NullPxl commented 4 months ago

Continuously testing various packages as I'm reviewing the Canada - National collection. What I'm finding so far is that there are not very many well-written feed finding python packages out there, and the few that are (including mediacloud's feed_seeker) are all based off of feedfinder2 which is itself based off of feedfinder. This is not an inherently bad thing, just an observation.

The library that seems to perform the best so far is feedsearch-crawler, which is an improvement over feedsearch (by the same creator). It has repeatedly found feeds that feed_seeker has missed, and includes multiple options that I have found helpful. For example, it allows easy integration of what is essentially a wordlist search for common RSS feed paths. MediaCloud could easily create a list of common RSS paths based on our experience and add to the list of paths to check. Feed_seeker's spider option seems too slow at the moment to be useful (I've let it run for 30 minutes with no results). That said, feed_seeker does include a useful function to grab feeds from feedly (find_feedly_feeds). This is not in the current pypi version, but if you clone from github you can use it locally. find_feedly_feeds has often found feeds that feedsearch-crawler has not, however, it is a bit slow and there does not seem to be great error handling implemented there at the moment.

As I continue to test I will update this issue with results. At the moment, the best path seems to be switching to feedsearch-crawler. Potentially, we could combine feedsearch-crawler with find_feedly_feeds from feed_seeker, however we would likely need to implement stronger error handling there to account for the fact that feedly is a third party service.

rahulbot commented 4 months ago

Is feedsearch-crawler consistently finding a superset of what our own feed_seeker finds? If not then perhaps we could run both and union the results. That idea of using a "common path" list is intriguing. Is that the sort of thing you imagine generating from our existing database of functioning RSS URLs?

NullPxl commented 4 months ago

In what I've looked at so far, feedsearch-crawler has found everything that feed_seeker has found. I will have to run a more concrete test of this to be sure though (or it might be better for me to just read the code).

And yes I was thinking that if we grabbed the N most common url paths from our current feeds, we could create a pretty good list of things to check. For example sometimes a WordPress site won't have the RSS feed linked in the source, but it is present if you go to /feed.

rahulbot commented 4 months ago

Can you execute on the idea of generating a list of most common feed paths under a domain? If you don't have a list of feeds from rss-fetcher already you can ask @philbudne for one. If you produce a histogram of top paths we can pick an N that makes sense for a set to test as "hints" to feedsearch-crawler.

NullPxl commented 4 months ago

Here are the top 20 paths @rahulbot :

/feed/                                                                    16555
/data/atom                                                                10140
/feeds/posts/default                                                       3846
/rss.xml                                                                   2690
/?feed=rss2                                                                2407
/rss                                                                       2405
/feed                                                                      2244
/data/rss                                                                  2013
/rdf                                                                       1905
/?feed=rdf                                                                 1806
/stories.rss                                                               1152
/search/?f=rss&t=article&l=50&s=start_time&sd=desc&k%5B%5D=%23topstory     1030
/feed/atom/                                                                 908
/comments/feed/                                                             556
/search/?f=rss&t=article&l=50&s=start_time&sd=desc&k[]=%23topstory          528
/rss/                                                                       318
/index.rss                                                                  311
/?format=feed&type=atom                                                     257
/index.php/feed/                                                            222
/feed/?type=100                                                             214

Most of these 20 are already present in the default search list for feedsearch-crawler, but it is missing a few top ones such as

stories.rss
search/?f=rss&t=article&l=50&s=start_time&sd=desc&k[]=%23topstory
?format=feed&type=atom
index.php/feed/
/feed/?type=100
NullPxl commented 4 months ago

I also wanted to add that when I've come across rss feeds at /search/?f=rss&t=article&l=50&s=start_time&sd=desc&k[]=%23topstory, changing it to search/?f=rss&t=article&l=50&s=start_time&sd=desc seems to give more articles. the topstory key is just the one present in the source. I don't suggest doing a mass change of existing feeds, but if we were to add URL paths to automatically try, we should add search/?f=rss&t=article&l=50&s=start_time&sd=desc as well.

rahulbot commented 4 months ago

That's great. I think with the findings you've shared it is worth modifying the feed-finding code to switch from feed_seeker to feedsearch_crawler (with these suggestions). Do you want to (a) learn some web-app Django and dive into that code or (b) provide a code sample on here that someone else can later integrate?

NullPxl commented 4 months ago

I can definitely provide a sample of code for integration-- But I still would like to see first why feed_seeker is missing the URLs it is. For example I just read through feed_seeker and can see it also has a list of URL suffixes. If the only reason why feedsearch-crawler performs better than feedseeker is the small difference in current path lists, it might be better to just update feed-seeker.

I will look into it and comment again with proposed code changes either tomorrow or early next week.

EDIT: I'll write up code for feedsearch-crawler. Here's a test with new york times where it found 79 (valid and unique) feeds compared to feed_seeker's 1 (not because of the path-list).

http://nytimes.com

feed_seeker: 10.272264003753662 seconds (1)
feedsearch-crawler: 30.022326231002808 seconds (79)
feedfinder2: 0.5545389652252197 seconds (1)

And again, big improvement found here:

http://journaldemontreal.com
feed_seeker: 15.894778490066528 seconds (2)
feedsearch-crawler: 9.991537094116211 seconds (111)
NullPxl commented 4 months ago

From _scrape_source() in tasks.py, only a few lines need to be changed. new_urls still contains a list of strings of urls, so there are not any big changes. The comments in the code have more specifics.

import feedsearch_crawler # new feed finder library - https://github.com/DBeath/feedsearch-crawler
SCRAPE_TIMEOUT_SECONDS = 120 # kept from original file
USER_AGENT = "Mozilla/5.0 (compatible; mediacloud academic archive; mediacloud.org)" # added

def _scrape_source(source_id, homepage, user_email):
    logger.info(f"==== starting _scrape_source(source_id, homepage)")

    ...

    # paths were determined as a union of feed_seeker's wordlist, feedsearch-crawler's wordlist, \
    # as well as the top 20 RSS feed paths from the mediacloud database (bit of a self-fulfilling prophecy here, I know)
    # May want to put this list in metadata.
    paths = ['?feed=atom', '?format=feed&type=atom', 'atom.xml', 'index.rss', 'feeds', 'data/rss', \
            '?feed=rss', 'search/?f=rss&t=article&l=50&s=start_time&sd=desc', 'stories.rss', 'feed/atom', \
            'feed/default', 'feeds/posts/default', 'index.php/feed', 'rss', 'atom', 'rss-feeds', \
            'rss.xml', 'index.json', 'index.atom', '?format=feed&type=rss', 'index.rdf', 'data/atom', \
            '?feed=rdf', '?type=100', 'articles.atom', 'rdf', 'feeds/posts/default/', 'about/feeds', \
            'about', 'feed', '?feed=rss2', 'feeds/default', 'articles.rss', 'index.xml']

    # See https://github.com/DBeath/feedsearch-crawler?tab=readme-ov-file#search-arguments for arguments
    # can set user agent with user_agent, or request timeout with request_timeout for example
    # 10 concurrent requests are allowed by default, can change this with concurrency parameter
    # default max_depth is 10; I am writing it explicitly for clarity.
    feeds = feedsearch_crawler.search(homepage, try_urls=paths, max_depth=10, user_agent=USER_AGENT, total_timeout=SCRAPE_TIMEOUT_SECONDS) 
    new_urls =  [str(feed.url) for feed in feeds]
    # More information is available in each feed object, such as the title.
    # For the sake of consistency with the old code, only the url string is returned in this example
    ...

    logger.info(f"==== finished _scrape_source(source_id, homepage)")