mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

add feature to save RSS file, and headers, to disk #15

Closed rahulbot closed 1 year ago

rahulbot commented 1 year ago

If the SAVE_RSS_FILES env-var is set to 1, then for each queued job two files will be created:

Sample summary JSON:

{
    "id": 133746,
    "url": "https://www.destentor.nl/buitenland/rss.xml",
    "mcFeedsId": 874977,
    "mcMediaId": 38933,
    "statusCode": 200,
    "headers": {
        "Content-Type": "application/rss+xml;charset=UTF-8",
        "ETag": "W/\"09e5bea5a36ddee446052eafc7484536f\"",
        "X-Content-Type-Options": "nosniff",
        "X-XSS-Protection": "1; mode=block",
        "Strict-Transport-Security": "max-age=31536000 ; includeSubDomains",
        "Referrer-Policy": "same-origin",
        "X-Frame-Options": "DENY",
        "Content-Encoding": "gzip",
        "Content-Length": "8384",
        "Expires": "Mon, 11 Jul 2022 19:42:23 GMT",
        "Cache-Control": "max-age=0, no-cache, no-store",
        "Pragma": "no-cache",
        "Date": "Mon, 11 Jul 2022 19:42:23 GMT",
        "Connection": "keep-alive",
        "Vary": "Accept-Encoding",
        "Link": "<https://images0.persgroep.net>; rel=preconnect;"
    }
}
rahulbot commented 1 year ago

Would it be more useful to name it by mc-feeds-id?

philbudne commented 1 year ago

Yeah, I think mc_feeds_id would be better, at least right now!

rahulbot commented 1 year ago

OK. Any other changes or should I merge/push this and let it run for a while to collect some data?

philbudne commented 1 year ago

No additional requests, plenty of metadata to peruse... Thanks!

philbudne commented 1 year ago

(so yes, merge and run)!