mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

how much would a different User-Agent string help fetch feeds we aren't geting right now? #36

Closed rahulbot closed 4 months ago

rahulbot commented 4 months ago

Some notable number of feeds, and story URLs, reject our bot-based user-agent string. It'd be helpful to quantify that, and see if a tweaked one would work better. This would help us fetch more RSS feeds and more stories. See #34 for a recent example.

Background: We want to make sure we identify as a bot to be polite and a "good web citizen". However, this means some set of servers reject us. The majority of these are probably CMS-based systems that just have an "accept" list of some type to allow only web browsers. Perhaps some tiny number have us on a "block" list 🤷🏽‍♂️

Proposal: grab the list of feeds URLs failing with something that looks like a UA-block. Try a few patterns of UA strings that are more like browsers, but still identify us as a bot project if someone looks at them. For example, we could try something like: "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15 (mediacloud.org academic archive)". Be creative about various options.

It'd be helpful to know what type of alternatives work, and then we could consider the ethical question of what appropriate UA strings for a project like ours are. The key is to strike a balance between respecting servers but also the broader social good of having a freely searchable online news archive.

philbudne commented 4 months ago

Regarding: "something that looks like a UA-block"

npr.org's behavior is outside what we previously imagined what a UA-block could possibly look like!

NullPxl commented 4 months ago

Around a week ago I ran test to see the percentage of feeds returning a 403 response code due to our current user agent. This did not include various options for different user agents (just a typical chrome UA vs mediacloud's current UA), but I think it is still a good initial data point to build off of.

Of the feeds that were returning 403, 9% would return 200 if a typical browser UA was used instead of the Media Cloud UA. It should be noted that 'live journal' made up 70% of those feeds that were returning 403; if we ignore live journal the UA block figure goes from 9% to 29%. I've shared a doc in Slack which has more detail.

I agree with Phil that a website returning no response is not typical for a 'block'. It's hard to say with certainty, but I still believe most services would opt to return a status code rather than no response at all.

rahulbot commented 4 months ago

So a short summary would be to say that pretending to be a browser might help with < 1% of feeds, but those include a handful of notable examples relevant for specific countries. That's really helpful to know.

NullPxl commented 4 months ago

I'll run my script again tomorrow, including more user agents. If I'm remembering right it took a good few hours to complete last time, so I'll select the UAs sparingly (if it takes crazily long I'll rewrite with asyncio or similar).

Does anyone have opinions on if we should stick with looking at 403 responses, or switch to a more generic failure? @philbudne if we were to look at all 4xx and 5xx responses, what would the volume of feeds be?

philbudne commented 4 months ago

Looking at HTTP status will skip cases like NPR.

Here is a count of feeds by "system_status"

rss_fetcher=# select system_status, count(1) from feeds where system_enabled group by system_status order by count desc;
                  system_status                  | count  
-------------------------------------------------+--------
 Working                                         | 117769
 HTTP 404 Not Found                              |   1232
 parse error                                     |    884
 HTTP 429 Too Many Requests                      |    372
 HTTP 500 Internal Server Error                  |    287
 HTTP 403 Forbidden                              |    276
 SSL error                                       |    208
 read timeout                                    |    201
 unknown hostname                                |    192
 too many redirects                              |     93
 connect timeout                                 |     71
 HTTP 429 banned                                 |     49
 HTTP 400 Bad Request                            |     45
 connection error                                |     28
 HTTP 503 Service Unavailable                    |     21
 HTTP 502 Bad Gateway                            |     18
 HTTP 521                                        |      9
 fetch error                                     |      7
                                                 |      6
 HTTP 403 OK                                     |      6
 HTTP 504 Gateway Time-out                       |      6
 HTTP 409 Conflict                               |      5
 DNS error                                       |      5
 HTTP 503 Service Temporarily Unavailable        |      5
 HTTP 404                                        |      4
 HTTP 508 Loop Detected                          |      3
 HTTP 523                                        |      3
 HTTP 410 Gone                                   |      3
 HTTP 101 Switching Protocols                    |      2
 HTTP 429                                        |      2
 HTTP 500                                        |      1
 HTTP 503 unable to get local issuer certificate |      1
 HTTP 509                                        |      1
 HTTP 530                                        |      1
 HTTP 404 Unknown site                           |      1
 HTTP 403                                        |      1
 HTTP 401 Unauthorized                           |      1
 HTTP 204 No Content                             |      1
 HTTP 500 Service unavailable (with message)     |      1
(39 rows)

Summary of npr.org feeds:

rss_fetcher=#  select system_status, count(1) from feeds where url like '%npr.org%' group by system_status order by count desc;
   system_status    | count 
--------------------+-------
 read timeout       |    46
 Working            |    33
 HTTP 404 Not Found |     2
(3 rows)
NullPxl commented 4 months ago

Those are just currently enabled feeds right? So looking at just those would be missing feeds that have been blocking for long enough to trigger deactivation.

philbudne commented 4 months ago

@NullPxl no, the above queries were across all feeds, without consideration of system_enabled which indicates whether the feed has been disabled for excessive errors.

NullPxl commented 4 months ago

@philbudne I think I might be misunderstanding; for the npr query what you said looks to be the case, but the first query says from feeds where system_enabled and the volume of 403s is much less than the previous file you gave me (~17000 in the file vs ~300 in the query above)

philbudne commented 4 months ago

Sorry for the apples to oranges comparison. Maybe I should reconsider my candidacy for president? Here is the first query for all feeds:

rss_fetcher=# select system_status, count(1) from feeds group by system_status order by count desc;
                        system_status                         | count  
--------------------------------------------------------------+--------
 Working                                                      | 129695
 HTTP 404 Not Found                                           |  13634
 HTTP 403 Forbidden                                           |  11753
 parse error                                                  |   9143
 unknown hostname                                             |   2782
 HTTP 410 Gone                                                |   1411
 SSL error                                                    |   1175
 read timeout                                                 |   1025
 connect timeout                                              |    868
 HTTP 500 Internal Server Error                               |    839
 HTTP 429 Too Many Requests                                   |    619
 connection error                                             |    502
 DNS error                                                    |    230
 HTTP 503 Service Unavailable                                 |    196
 too many redirects                                           |    193
 HTTP 401 Unauthorized                                        |    151
 HTTP 400 Bad Request                                         |     91
 HTTP 502 Bad Gateway                                         |     73
 HTTP 405 Method Not Allowed                                  |     71
 HTTP 429 banned                                              |     68
 HTTP 405 Not Allowed                                         |     62
 HTTP 503 Service Temporarily Unavailable                     |     53
 HTTP 522                                                     |     44
 HTTP 404 File Not Found                                      |     33
 HTTP 501 Origin hit suppressed (0)                           |     30
 HTTP 404                                                     |     24
 HTTP 521                                                     |     23
 fetch error                                                  |     22
 HTTP 404 404 Not Found                                       |     16
 HTTP 409 Conflict                                            |     15
 HTTP 403 OK                                                  |     15
 HTTP 202 Accepted                                            |     14
 HTTP 523                                                     |     13
 HTTP 418 Unknown Error                                       |     13
 HTTP 404 Not found                                           |     13
 job timeout                                                  |     10
 HTTP 530                                                     |      9
 HTTP 504 Gateway Time-out                                    |      8
 HTTP 520                                                     |      8
 HTTP 526                                                     |      7
 HTTP 404 Not Fround                                          |      7
                                                              |      6
 HTTP 404 404                                                 |      6
 HTTP 503 Backend fetch failed                                |      5
 HTTP 404 NOT FOUND                                           |      5
 HTTP 423 Locked                                              |      5
 HTTP 404 not found                                           |      4
 HTTP 404 OK                                                  |      4
 bad URL                                                      |      4
 HTTP 508 Loop Detected                                       |      4
 HTTP 524                                                     |      3
 HTTP 204 No Content                                          |      3
 HTTP 401 Restricted                                          |      3
 HTTP 403                                                     |      3
 HTTP 404 Unknown site                                        |      3
 HTTP 406 Not Acceptable                                      |      3
 HTTP 429                                                     |      3
 HTTP 500                                                     |      3
 HTTP 500 500 Service unavailable (with message)              |      3
 HTTP 405 Not allowed.                                        |      2
 HTTP 404 Page not found                                      |      2
 HTTP 101 Switching Protocols                                 |      2
 HTTP 503 Service unavailable                                 |      1
 HTTP 500 Service unavailable (with message)                  |      1
 HTTP 503 Service Unavailable: Back-end server is at capacity |      1
 HTTP 503 unable to get local issuer certificate              |      1
 HTTP 451                                                     |      1
 HTTP 423                                                     |      1
 HTTP 509                                                     |      1
 HTTP 421 Misdirected Request                                 |      1
 HTTP 520 Origin Server Unavailable                           |      1
 HTTP 418 I'm a teapot                                        |      1
 HTTP 502 Proxy Error                                         |      1
 HTTP 410 Not Found                                           |      1
 HTTP 410 Disparu                                             |      1
 HTTP 404 Page not found: /rss.xml                            |      1
 HTTP 404 Page not found: /feed/latest-rss.xml                |      1
 HTTP 404 Page Not Found                                      |      1
 HTTP 403 Site Disabled                                       |      1
 HTTP 403 HTTP Forbidden                                      |      1
 HTTP 403 Access Denied                                       |      1
 HTTP 402                                                     |      1
 HTTP 413 Request Entity Too Large                            |      1
 HTTP 503 Backend unavailable, connection timeout             |      1
 HTTP 502                                                     |      1
(85 rows)
NullPxl commented 4 months ago

Haha this is the first time i've seen HTTP 418 I'm a teapot in the wild.

@philbudne Whenever you get the chance would you be able to send me a csv file from the DB, similar to the previous one you sent, but with the following status codes? All 4xx (except for 404), 5xx, parse error, read timeout, connect timeout, connection error, fetch error, job timeout, and the line with just whitespace (between "404 Not 'Fround'" and "404 404").

NullPxl commented 4 months ago

The script finished running, so while there's more analysis to do here's the main numbers: 16000 feeds/URLs tested with 4 different user agents. With the most recent Chrome user agent, 6257 feeds returned with status code 200. With the current mediacloud user agent, 4909 feeds returned with 200 (1348 more). One of the strings tested was Mozilla/5.0 (compatible; mediacloud academic archive; mediacloud.org), which is a similar format to GoogleBot's user agent. This resulted in 5985 200s (or 1076 more than the current mediacloud user agent). To me this string presents a good balance.

(Like I mentioned these are surface level numbers and there's more to look at)

NullPxl commented 4 months ago

I've sent an excel workbook that outlines the results of a second round of more tests. Some numbers changed around but the general result that a string similar to Mozilla/5.0 (compatible; mediacloud academic archive; mediacloud.org) would be an improvement remains the same.

rahulbot commented 4 months ago

Switching to Mozilla/5.0 (compatible; mediacloud academic archive; mediacloud.org) is uncontroversial with larger team. @philbudne any concerns about doing that? If not let's make the change.

philbudne commented 4 months ago

@rahulbot wrote:

any concerns about doing that? If not let's make the change.

My one, MAJOR concern is that we currently have MANY places, across many repos where we use, or should use the string. Rather than editing them all time and again (I can't believe this is, or will be a one-time event), I'd like to have one place where the string lives and all our code picks it up.

mcmetadata is somewhat tempting, since it's probably already used by all the projects that need the string. AND mcmetadata.extract CAN do fetches (tho story-indexer doesn't use that feature, AND I'm not sure we want ANY/EVERYONE who picks up the library to use a UA string that puts us at the center of a bullseye by default!

Places I know of that should use our UA string:

  1. fetching RSS feeds (rss-fetcher)
  2. (re)scraping a domain for feeds (web-search)
  3. fetching story pages (story-indexer)
rahulbot commented 4 months ago

Great point 👍🏽 This is kind of the point of mcmetadata (to centralize shared "business" logic across our various projects) so storing it there makes sense to me. With the note that it shouldn't be default.

rahulbot commented 4 months ago

Closing discussion because I split the decided-upon action off to #37.