Open rahulbot opened 3 months ago
NOTE: There isn't currently code to detect if a feed has been removed from the web-search (mcweb) feed table....
tallies of last system_status for system_disabled feeds:
***@***.***:~/rss-fetcher$ cat status-disabled.psql
select system_status, count(1)
from feeds
where not system_enabled
group by system_status
order by count desc;
***@***.***:~/rss-fetcher$ psql rss-fetcher < !$
psql rss-fetcher < status-disabled.psql
Pseudo-terminal will not be allocated because stdin is not a terminal.
system_status | count
--------------------------------------------------------------+-------
HTTP 404 Not Found | 14494
HTTP 403 Forbidden | 12250
Working | 11978
parse error | 11009
unknown hostname | 2847
HTTP 410 Gone | 1424
SSL error | 1130
read timeout | 937
connect timeout | 894
HTTP 500 Internal Server Error | 740
connection error | 517
DNS error | 240
HTTP 429 Too Many Requests | 228
too many redirects | 196
HTTP 503 Service Unavailable | 189
HTTP 401 Unauthorized | 152
HTTP 400 Bad Request | 102
HTTP 404 | 101
HTTP 405 Method Not Allowed | 71
HTTP 405 Not Allowed | 62
HTTP 502 Bad Gateway | 59
HTTP 503 Service Temporarily Unavailable | 56
HTTP 522 | 43
HTTP 404 File Not Found | 33
HTTP 501 Origin hit suppressed (0) | 30
HTTP 404 404 Not Found | 22
HTTP 521 | 20
fetch error | 17
HTTP 409 Conflict | 14
HTTP 202 Accepted | 14
HTTP 418 Unknown Error | 13
HTTP 404 Not found | 13
HTTP 523 | 13
HTTP 520 | 11
job timeout | 10
HTTP 403 OK | 10
HTTP 530 | 8
HTTP 526 | 7
HTTP 404 Not Fround | 7
HTTP 404 404 | 6
HTTP 503 Backend fetch failed | 6
bad URL | 5
HTTP 404 NOT FOUND | 5
HTTP 423 Locked | 5
HTTP 404 OK | 4
HTTP 404 not found | 4
HTTP 504 Gateway Time-out | 4
HTTP 524 | 3
HTTP 406 Not Acceptable | 3
HTTP 401 Restricted | 3
HTTP 404 Unknown site | 3
HTTP 500 500 Service unavailable (with message) | 3
HTTP 204 No Content | 3
HTTP 404 Page not found | 2
HTTP 101 Switching Protocols | 2
HTTP 403 | 2
HTTP 500 | 2
HTTP 405 Not allowed. | 2
HTTP 503 Under Maintenance | 1
HTTP 509 | 1
HTTP 520 Origin Server Unavailable | 1
HTTP 403 Site Disabled | 1
HTTP 403 HTTP Forbidden | 1
HTTP 403 Access Denied | 1
HTTP 402 | 1
HTTP 401 HTTP Forbidden | 1
HTTP 302 Found | 1
HTTP 413 Request Entity Too Large | 1
HTTP 410 Not Found | 1
HTTP 418 I'm a teapot | 1
HTTP 410 Disparu | 1
HTTP 421 Misdirected Request | 1
HTTP 423 | 1
HTTP 451 | 1
HTTP 502 | 1
HTTP 404 Página no encontrada | 1
HTTP 404 Page not found: /rss.xml | 1
HTTP 503 Backend unavailable, connection timeout | 1
HTTP 404 Page not found: /feed/latest-rss.xml | 1
HTTP 503 Service unavailable | 1
HTTP 404 Page Not Found | 1
HTTP 503 Service Unavailable: Back-end server is at capacity | 1
(82 rows)
The RSS Fetcher can indicate to us that an RSS feed is not working by marking the "system enabled?" value as false. In these cases it sticks a machine-generated note in the "System Status" field. Should we audit the most common statuses and investigate if they mean those feeds should be deleted?
I this noted while looking at CNN that some say "HTTP 410 Gone". This is an indication that the feed no longer exists. Here's an example showing some from https://search.mediacloud.org/sources/1095/feeds: