mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
281 stars 87 forks source link

explore use of sitemaps as alternative to rss #45

Closed hroberts closed 5 years ago

hroberts commented 8 years ago

We should explore the possibility of using sitemap.xml files as an alternative to rss.

The handful of sites I poked at tonight seemed to have frequently updated links with publication dates. At a minimum, we should look for a sitemap.xml file if we can't find an rss feed for a given source.

Many sites have links to the sitemaps in their robots.txt files. Others seem to just have them at http://foo.bar/sitemap.xml.

pypt commented 8 years ago

Did some investigating too.

The best way to discover sitemaps for news websites seems to be robots.txt (it's a bit silly that I'm proposing to use robots.txt not to limit crawling of a particular website but to find even more things to crawl). For example, robots.txt for csmonitor.com includes the following link to the sitemap file:

Sitemap: http://www.csmonitor.com/sitemap-index.xml

...which then leads to multiple sub-sitemaps, which lead to even more sitemaps, and the hierarchy of all them seem to cover the whole article archive on csmonitor.com. Other media websites have similar configurations in-place.

Another interesting bit is that many of the news websites seem to implement the sitemap-news XML schema for the articles to get imported into Google News, e.g. http://www.csmonitor.com/sitemap-news-auto-1.xml (see n: XML namespace); even Univision does it.

So, I think it's totally worth it to piggyback on Google News support implemented on many websites and collect the news articles that way.

pypt commented 8 years ago

Should I proceed with adding sitemap support as part of feed discovery + periodic fetching?

hroberts commented 8 years ago

This is great. Yes, we should integrate this into our discovery process.

On Thu, Oct 6, 2016 at 2:49 AM, Linas Valiukas notifications@github.com wrote:

Should I proceed with adding sitemap support as part of feed discovery + periodic fetching?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/45#issuecomment-251891283, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvTwsdLvx--j63TT1dRPxkvlifIarVks5qxKghgaJpZM4JeL2F .

Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University

pypt commented 6 years ago

Writing a sitemap evaluation tool now.

The tool will run against the following media sets:

and for every set, find out the following:

Already implemented a prototype tool, however lxml seems to be a bit slow parsing 50 MB of XML, so porting it to Expat.

rahulbot commented 6 years ago

The country-level national collections should be good short-to-medium sized lists of country-wide news websites within each country. Those aren't necessarily the "top" ones, but they are still good lists. Here is a json dump of all the country-level national collections:

[{"tagsId": 38376339, "alpha3": "AFG", "name": "Afghanistan"}, {"tagsId": 34412107, "alpha3": "ALB", "name": "Albania"}, {"tagsId": 34412286, "alpha3": "DZA", "name": "Algeria"}, {"tagsId": 34412058, "alpha3": "ASM", "name": "American Samoa"}, {"tagsId": 34412104, "alpha3": "AND", "name": "Andorra"}, {"tagsId": 34412237, "alpha3": "AGO", "name": "Angola"}, {"tagsId": 34412386, "alpha3": "AIA", "name": "Anguilla"}, {"tagsId": 38376410, "alpha3": "ATA", "name": "Antarctica"}, {"tagsId": 34412355, "alpha3": "ATG", "name": "Antigua and Barbuda"}, {"tagsId": 34412043, "alpha3": "ARG", "name": "Argentina"}, {"tagsId": 34412196, "alpha3": "ARM", "name": "Armenia"}, {"tagsId": 34412036, "alpha3": "ABW", "name": "Aruba"}, {"tagsId": 34412282, "alpha3": "AUS", "name": "Australia"}, {"tagsId": 34412245, "alpha3": "AUT", "name": "Austria"}, {"tagsId": 34412123, "alpha3": "AZE", "name": "Azerbaijan"}, {"tagsId": 34412268, "alpha3": "BHS", "name": "Bahamas"}, {"tagsId": 34412049, "alpha3": "BHR", "name": "Bahrain"}, {"tagsId": 34412132, "alpha3": "BGD", "name": "Bangladesh"}, {"tagsId": 34412366, "alpha3": "BRB", "name": "Barbados"}, {"tagsId": 34412217, "alpha3": "BLR", "name": "Belarus"}, {"tagsId": 34412298, "alpha3": "BEL", "name": "Belgium"}, {"tagsId": 34412305, "alpha3": "BLZ", "name": "Belize"}, {"tagsId": 34412177, "alpha3": "BEN", "name": "Benin"}, {"tagsId": 34412150, "alpha3": "BMU", "name": "Bermuda"}, {"tagsId": 34412376, "alpha3": "BTN", "name": "Bhutan"}, {"tagsId": 34412045, "alpha3": "BOL", "name": "Bolivia, Plurinational State of"}, {"tagsId": 34412336, "alpha3": "BES", "name": "Bonaire, Sint Eustatius and Saba"}, {"tagsId": 34412066, "alpha3": "BIH", "name": "Bosnia and Herzegovina"}, {"tagsId": 38379247, "alpha3": "BWA", "name": "Botswana"}, {"tagsId": 34412257, "alpha3": "BRA", "name": "Brazil"}, {"tagsId": 34412241, "alpha3": "BRN", "name": "Brunei Darussalam"}, {"tagsId": 34412233, "alpha3": "BGR", "name": "Bulgaria"}, {"tagsId": 34412047, "alpha3": "BFA", "name": "Burkina Faso"}, {"tagsId": 38379378, "alpha3": "BDI", "name": "Burundi"}, {"tagsId": 38379418, "alpha3": "CPV", "name": "Cabo Verde"}, {"tagsId": 34411590, "alpha3": "KHM", "name": "Cambodia"}, {"tagsId": 38379387, "alpha3": "CMR", "name": "Cameroon"}, {"tagsId": 34411583, "alpha3": "CAN", "name": "Canada"}, {"tagsId": 34412207, "alpha3": "CYM", "name": "Cayman Islands"}, {"tagsId": 38379433, "alpha3": "CAF", "name": "Central African Republic"}, {"tagsId": 38379436, "alpha3": "TCD", "name": "Chad"}, {"tagsId": 34412295, "alpha3": "CHL", "name": "Chile"}, {"tagsId": 34412193, "alpha3": "CHN", "name": "China"}, {"tagsId": 34412091, "alpha3": "CXR", "name": "Christmas Island"}, {"tagsId": 34412368, "alpha3": "CCK", "name": "Cocos (Keeling) Islands"}, {"tagsId": 34412358, "alpha3": "COL", "name": "Colombia"}, {"tagsId": 38379588, "alpha3": "COM", "name": "Comoros"}, {"tagsId": 34412281, "alpha3": "COG", "name": "Congo"}, {"tagsId": 34412042, "alpha3": "COD", "name": "Congo, The Democratic Republic of the"}, {"tagsId": 34412175, "alpha3": "COK", "name": "Cook Islands"}, {"tagsId": 34412266, "alpha3": "CRI", "name": "Costa Rica"}, {"tagsId": 34412323, "alpha3": "HRV", "name": "Croatia"}, {"tagsId": 34412184, "alpha3": "CUB", "name": "Cuba"}, {"tagsId": 34412374, "alpha3": "CUW", "name": "Cura\\u00e7ao"}, {"tagsId": 34412239, "alpha3": "CYP", "name": "Cyprus"}, {"tagsId": 34412292, "alpha3": "CZE", "name": "Czechia"}, {"tagsId": 34412173, "alpha3": "CIV", "name": "C\\u00f4te d\'Ivoire"}, {"tagsId": 34412412, "alpha3": "DNK", "name": "Denmark"}, {"tagsId": 34412350, "alpha3": "DJI", "name": "Djibouti"}, {"tagsId": 34412078, "alpha3": "DMA", "name": "Dominica"}, {"tagsId": 34412198, "alpha3": "DOM", "name": "Dominican Republic"}, {"tagsId": 34412279, "alpha3": "ECU", "name": "Ecuador"}, {"tagsId": 34412471, "alpha3": "EGY", "name": "Egypt"}, {"tagsId": 34412288, "alpha3": "SLV", "name": "El Salvador"}, {"tagsId": 34412470, "alpha3": "GNQ", "name": "Equatorial Guinea"}, {"tagsId": 34412418, "alpha3": "ERI", "name": "Eritrea"}, {"tagsId": 34412338, "alpha3": "EST", "name": "Estonia"}, {"tagsId": 34412034, "alpha3": "ETH", "name": "Ethiopia"}, {"tagsId": 34412259, "alpha3": "FLK", "name": "Falkland Islands (Malvinas)"}, {"tagsId": 34412277, "alpha3": "FRO", "name": "Faroe Islands"}, {"tagsId": 34412363, "alpha3": "FJI", "name": "Fiji"}, {"tagsId": 34412208, "alpha3": "FIN", "name": "Finland"}, {"tagsId": 34412146, "alpha3": "FRA", "name": "France"}, {"tagsId": 34412482, "alpha3": "GUF", "name": "French Guiana"}, {"tagsId": 34412145, "alpha3": "PYF", "name": "French Polynesia"}, {"tagsId": 34412473, "alpha3": "ATF", "name": "French Southern Territories"}, {"tagsId": 34412093, "alpha3": "GAB", "name": "Gabon"}, {"tagsId": 34412312, "alpha3": "GMB", "name": "Gambia"}, {"tagsId": 34412310, "alpha3": "GEO", "name": "Georgia"}, {"tagsId": 34412409, "alpha3": "DEU", "name": "Germany"}, {"tagsId": 34412202, "alpha3": "GHA", "name": "Ghana"}, {"tagsId": 34412270, "alpha3": "GIB", "name": "Gibraltar"}, {"tagsId": 34412477, "alpha3": "GRC", "name": "Greece"}, {"tagsId": 34412352, "alpha3": "GRL", "name": "Greenland"}, {"tagsId": 34412332, "alpha3": "GRD", "name": "Grenada"}, {"tagsId": 34412462, "alpha3": "GLP", "name": "Guadeloupe"}, {"tagsId": 34412116, "alpha3": "GUM", "name": "Guam"}, {"tagsId": 34412063, "alpha3": "GTM", "name": "Guatemala"}, {"tagsId": 34412327, "alpha3": "GGY", "name": "Guernsey"}, {"tagsId": 34412263, "alpha3": "GIN", "name": "Guinea"}, {"tagsId": 34412317, "alpha3": "GNB", "name": "Guinea-Bissau"}, {"tagsId": 34412443, "alpha3": "GUY", "name": "Guyana"}, {"tagsId": 34412303, "alpha3": "HTI", "name": "Haiti"}, {"tagsId": 34412389, "alpha3": "VAT", "name": "Holy See (Vatican City State)"}, {"tagsId": 34412466, "alpha3": "HND", "name": "Honduras"}, {"tagsId": 34412306, "alpha3": "HKG", "name": "Hong Kong"}, {"tagsId": 34412252, "alpha3": "HUN", "name": "Hungary"}, {"tagsId": 34412394, "alpha3": "ISL", "name": "Iceland"}, {"tagsId": 34412118, "alpha3": "IND", "name": "India"}, {"tagsId": 34412392, "alpha3": "IDN", "name": "Indonesia"}, {"tagsId": 34412284, "alpha3": "IRN", "name": "Iran, Islamic Republic of"}, {"tagsId": 34412423, "alpha3": "IRQ", "name": "Iraq"}, {"tagsId": 34412271, "alpha3": "IRL", "name": "Ireland"}, {"tagsId": 34412255, "alpha3": "IMN", "name": "Isle of Man"}, {"tagsId": 34412391, "alpha3": "ISR", "name": "Israel"}, {"tagsId": 34412372, "alpha3": "ITA", "name": "Italy"}, {"tagsId": 34412082, "alpha3": "JAM", "name": "Jamaica"}, {"tagsId": 34412056, "alpha3": "JPN", "name": "Japan"}, {"tagsId": 34412102, "alpha3": "JEY", "name": "Jersey"}, {"tagsId": 34412072, "alpha3": "JOR", "name": "Jordan"}, {"tagsId": 34412415, "alpha3": "KAZ", "name": "Kazakhstan"}, {"tagsId": 34412126, "alpha3": "KEN", "name": "Kenya"}, {"tagsId": 34412301, "alpha3": "KIR", "name": "Kiribati"}, {"tagsId": 34412434, "alpha3": "PRK", "name": "Korea, Democratic People\'s Republic of"}, {"tagsId": 34412127, "alpha3": "KOR", "name": "Korea, Republic of"}, {"tagsId": 34412340, "alpha3": "XKX", "name": "Kosovo"}, {"tagsId": 34412071, "alpha3": "KWT", "name": "Kuwait"}, {"tagsId": 34412420, "alpha3": "KGZ", "name": "Kyrgyzstan"}, {"tagsId": 34412160, "alpha3": "LAO", "name": "Lao People\'s Democratic Republic"}, {"tagsId": 34412437, "alpha3": "LVA", "name": "Latvia"}, {"tagsId": 34412343, "alpha3": "LBN", "name": "Lebanon"}, {"tagsId": 34412125, "alpha3": "LSO", "name": "Lesotho"}, {"tagsId": 38380274, "alpha3": "LBR", "name": "Liberia"}, {"tagsId": 38380279, "alpha3": "LBY", "name": "Libya"}, {"tagsId": 38380281, "alpha3": "LIE", "name": "Liechtenstein"}, {"tagsId": 38379746, "alpha3": "LTU", "name": "Lithuania"}, {"tagsId": 38380287, "alpha3": "LUX", "name": "Luxembourg"}, {"tagsId": 34412111, "alpha3": "MAC", "name": "Macao"}, {"tagsId": 34412429, "alpha3": "MKD", "name": "Macedonia, Republic of"}, {"tagsId": 34412370, "alpha3": "MDG", "name": "Madagascar"}, {"tagsId": 34412402, "alpha3": "MWI", "name": "Malawi"}, {"tagsId": 34412243, "alpha3": "MYS", "name": "Malaysia"}, {"tagsId": 34412080, "alpha3": "MDV", "name": "Maldives"}, {"tagsId": 34412222, "alpha3": "MLI", "name": "Mali"}, {"tagsId": 34412381, "alpha3": "MLT", "name": "Malta"}, {"tagsId": 34412294, "alpha3": "MHL", "name": "Marshall Islands"}, {"tagsId": 34412087, "alpha3": "MTQ", "name": "Martinique"}, {"tagsId": 34412134, "alpha3": "MRT", "name": "Mauritania"}, {"tagsId": 34412215, "alpha3": "MUS", "name": "Mauritius"}, {"tagsId": 38380320, "alpha3": "MYT", "name": "Mayotte"}, {"tagsId": 34412427, "alpha3": "MEX", "name": "Mexico"}, {"tagsId": 34412325, "alpha3": "FSM", "name": "Micronesia, Federated States of"}, {"tagsId": 34412319, "alpha3": "MDA", "name": "Moldova, Republic of"}, {"tagsId": 34412097, "alpha3": "MCO", "name": "Monaco"}, {"tagsId": 34412201, "alpha3": "MNG", "name": "Mongolia"}, {"tagsId": 34412188, "alpha3": "MNE", "name": "Montenegro"}, {"tagsId": 34412425, "alpha3": "MSR", "name": "Montserrat"}, {"tagsId": 34412321, "alpha3": "MAR", "name": "Morocco"}, {"tagsId": 34412248, "alpha3": "MOZ", "name": "Mozambique"}, {"tagsId": 34412468, "alpha3": "MMR", "name": "Myanmar"}, {"tagsId": 34412330, "alpha3": "NAM", "name": "Namibia"}, {"tagsId": 34412168, "alpha3": "NRU", "name": "Nauru"}, {"tagsId": 34412380, "alpha3": "NPL", "name": "Nepal"}, {"tagsId": 34412382, "alpha3": "NLD", "name": "Netherlands"}, {"tagsId": 34412212, "alpha3": "NCL", "name": "New Caledonia"}, {"tagsId": 34412098, "alpha3": "NZL", "name": "New Zealand"}, {"tagsId": 34412113, "alpha3": "NIC", "name": "Nicaragua"}, {"tagsId": 34412253, "alpha3": "NER", "name": "Niger"}, {"tagsId": 38376341, "alpha3": "NGA", "name": "Nigeria"}, {"tagsId": 34412095, "alpha3": "NIU", "name": "Niue"}, {"tagsId": 34412342, "alpha3": "NFK", "name": "Norfolk Island"}, {"tagsId": 34412060, "alpha3": "MNP", "name": "Northern Mariana Islands"}, {"tagsId": 34412171, "alpha3": "NOR", "name": "Norway"}, {"tagsId": 34412083, "alpha3": "OMN", "name": "Oman"}, {"tagsId": 34412272, "alpha3": "PAK", "name": "Pakistan"}, {"tagsId": 34412274, "alpha3": "PLW", "name": "Palau"}, {"tagsId": 34412148, "alpha3": "PSE", "name": "Palestine, State of"}, {"tagsId": 34412265, "alpha3": "PAN", "name": "Panama"}, {"tagsId": 34412399, "alpha3": "PNG", "name": "Papua New Guinea"}, {"tagsId": 34412480, "alpha3": "PRY", "name": "Paraguay"}, {"tagsId": 34412158, "alpha3": "PER", "name": "Peru"}, {"tagsId": 34412313, "alpha3": "PHL", "name": "Philippines"}, {"tagsId": 34412261, "alpha3": "PCN", "name": "Pitcairn"}, {"tagsId": 34412416, "alpha3": "POL", "name": "Poland"}, {"tagsId": 34412337, "alpha3": "PRT", "name": "Portugal"}, {"tagsId": 34412297, "alpha3": "PRI", "name": "Puerto Rico"}, {"tagsId": 34412242, "alpha3": "QAT", "name": "Qatar"}, {"tagsId": 34412235, "alpha3": "ROU", "name": "Romania"}, {"tagsId": 34412232, "alpha3": "RUS", "name": "Russian Federation"}, {"tagsId": 34412053, "alpha3": "RWA", "name": "Rwanda"}, {"tagsId": 34412360, "alpha3": "REU", "name": "R\\u00e9union"}, {"tagsId": 34412075, "alpha3": "BLM", "name": "Saint Barth\\u00e9lemy"}, {"tagsId": 38380789, "alpha3": "SHN", "name": "Saint Helena"}, {"tagsId": 34412190, "alpha3": "KNA", "name": "Saint Kitts and Nevis"}, {"tagsId": 34412141, "alpha3": "LCA", "name": "Saint Lucia"}, {"tagsId": 34411586, "alpha3": "MAF", "name": "Saint Martin (French part)"}, {"tagsId": 34412231, "alpha3": "SPM", "name": "Saint Pierre and Miquelon"}, {"tagsId": 34412162, "alpha3": "VCT", "name": "Saint Vincent and the Grenadines"}, {"tagsId": 34412109, "alpha3": "WSM", "name": "Samoa"}, {"tagsId": 34412143, "alpha3": "SMR", "name": "San Marino"}, {"tagsId": 34411588, "alpha3": "STP", "name": "Sao Tome and Principe"}, {"tagsId": 34412050, "alpha3": "SAU", "name": "Saudi Arabia"}, {"tagsId": 38380807, "alpha3": "SEN", "name": "Senegal"}, {"tagsId": 34412475, "alpha3": "SRB", "name": "Serbia"}, {"tagsId": 34412170, "alpha3": "SYC", "name": "Seychelles"}, {"tagsId": 34412308, "alpha3": "SLE", "name": "Sierra Leone"}, {"tagsId": 34412474, "alpha3": "SGP", "name": "Singapore"}, {"tagsId": 34412464, "alpha3": "SXM", "name": "Sint Maarten (Dutch part)"}, {"tagsId": 34412152, "alpha3": "SVK", "name": "Slovakia"}, {"tagsId": 34412061, "alpha3": "SVN", "name": "Slovenia"}, {"tagsId": 34412137, "alpha3": "SLB", "name": "Solomon Islands"}, {"tagsId": 34412155, "alpha3": "SOM", "name": "Somalia"}, {"tagsId": 34412238, "alpha3": "ZAF", "name": "South Africa"}, {"tagsId": 34412055, "alpha3": "SGS", "name": "South Georgia and the South Sandwich Islands"}, {"tagsId": 34412439, "alpha3": "SSD", "name": "South Sudan"}, {"tagsId": 34412356, "alpha3": "ESP", "name": "Spain"}, {"tagsId": 34412435, "alpha3": "LKA", "name": "Sri Lanka"}, {"tagsId": 34412379, "alpha3": "SDN", "name": "Sudan"}, {"tagsId": 34412384, "alpha3": "SUR", "name": "Suriname"}, {"tagsId": 34412040, "alpha3": "SJM", "name": "Svalbard and Jan Mayen"}, {"tagsId": 34412038, "alpha3": "SWZ", "name": "Swaziland"}, {"tagsId": 34412223, "alpha3": "SWE", "name": "Sweden"}, {"tagsId": 34411591, "alpha3": "CHE", "name": "Switzerland"}, {"tagsId": 34412453, "alpha3": "SYR", "name": "Syrian Arab Republic"}, {"tagsId": 34412361, "alpha3": "TWN", "name": "Taiwan, Province of China"}, {"tagsId": 34412129, "alpha3": "TJK", "name": "Tajikistan"}, {"tagsId": 34412085, "alpha3": "TZA", "name": "Tanzania, United Republic of"}, {"tagsId": 34412328, "alpha3": "THA", "name": "Thailand"}, {"tagsId": 34412431, "alpha3": "TLS", "name": "Timor-Leste"}, {"tagsId": 34412192, "alpha3": "TGO", "name": "Togo"}, {"tagsId": 34412204, "alpha3": "TON", "name": "Tonga"}, {"tagsId": 34412405, "alpha3": "TTO", "name": "Trinidad and Tobago"}, {"tagsId": 34412348, "alpha3": "TUN", "name": "Tunisia"}, {"tagsId": 34412131, "alpha3": "TUR", "name": "Turkey"}, {"tagsId": 38381094, "alpha3": "TKM", "name": "Turkmenistan"}, {"tagsId": 34412139, "alpha3": "TCA", "name": "Turks and Caicos Islands"}, {"tagsId": 34412290, "alpha3": "TUV", "name": "Tuvalu"}, {"tagsId": 34412251, "alpha3": "UGA", "name": "Uganda"}, {"tagsId": 38381103, "alpha3": "UKR", "name": "Ukraine"}, {"tagsId": 34412114, "alpha3": "ARE", "name": "United Arab Emirates"}, {"tagsId": 34412476, "alpha3": "GBR", "name": "United Kingdom"}, {"tagsId": 34412234, "alpha3": "USA", "name": "United States"}, {"tagsId": 34412117, "alpha3": "URY", "name": "Uruguay"}, {"tagsId": 34412346, "alpha3": "UZB", "name": "Uzbekistan"}, {"tagsId": 34412411, "alpha3": "VUT", "name": "Vanuatu"}, {"tagsId": 34412387, "alpha3": "VEN", "name": "Venezuela, Bolivarian Republic of"}, {"tagsId": 34412246, "alpha3": "VNM", "name": "Viet Nam"}, {"tagsId": 34412220, "alpha3": "VGB", "name": "Virgin Islands, British"}, {"tagsId": 34412089, "alpha3": "VIR", "name": "Virgin Islands, U.S."}, {"tagsId": 34412334, "alpha3": "WLF", "name": "Wallis and Futuna"}, {"tagsId": 34412182, "alpha3": "ESH", "name": "Western Sahara"}, {"tagsId": 34412100, "alpha3": "YEM", "name": "Yemen"}, {"tagsId": 34412396, "alpha3": "ZMB", "name": "Zambia"}, {"tagsId": 34412406, "alpha3": "ZWE", "name": "Zimbabwe"}]
hroberts commented 6 years ago

It would also be great to run against a set of sources from the ABYZ country level sets for which we don't have any active feeds. The sitemaps will be useful mostly to the degree that they give us an alternative for sites that don't have rss feeds.

pypt commented 6 years ago

It turns out that some news websites use the nicely parseable Google News format only for the most recent articles that they want to appear on Google News itself (similar to RSS feeds):

<url>
    <loc>https://www.news.com/2018/some-people-still-dont-like-trump.html</loc>

    <!-- "news" is Google News XML schema -->
    <news:news>
        <news:publication>
            <news:name>Some news website</news:name>
            <news:language>en</news:language>
        </news:publication>
        <news:publication_date>2018-11-20T12:14:04+02:00</news:publication_date>
        <news:title>Some people still don't like Trump</news:title>
    </news:news>
</url>

and offload their older article archive in plain, non-Google News sitemaps which lack some essential information, e.g. article titles and publication dates:

<url>
    <loc>https://www.news.com/1879/lightbulb-invented-by-some-dude-named-edison.html</loc>

    <!-- Exposed archive for some news websites go a long way,
         e.g. The Atlantic goes as far as 1857 -->
    <lastmod>1879-01-01T12:14:04+02:00</lastmod>

    <!-- No Google News metadata -->
</url>

I think it would be great to backfill our database with news articles dating back to the inception of a particular news source, but there are some problems with ingesting those non-Google News sitemaps:

So, with https://github.com/berkmancenter/mediacloud/pull/518 merged and (soon to be) deployed, the idea is to:

1) Fetch all the sitemap-derived URLs from a limited set of mediums (US Top Online News and US Mainstream Media) into media_sitemap_pages 2) Use the media_sitemap_pages for initial statistics on:

hroberts commented 6 years ago

This looks like a great approach to me. The main deliverable here is to understand the coverage and nature of these sitemaps and how they relate to our existing data. So doing a big data collection run and storing it in a way that is easy to play with the resulting data is perfect.

Note that we already have mediawords.util.parse_html.html_title, which we use for all topic spidered stories and works pretty well. It looks for a few different type tags before falling back to the tag (and ultimately to the url, but that is used only very rarely, usually for things like pdf files). I'm sure it could be improved a bit by running it against your proposed database of sitemaps results and generating a list of sites that have to fall back to the <title> tag and looking for any <meta> tags they use.</p> <p>Also, if we are reaching far back into time to recover old stories, we have to do more careful deduping than we do in the regular crawler. That's what the existing code in ImportStories.pm does (basically applies the topic spidering deduping to the imported stories). That module was designed to be pluggable for importing old stories from any source, but for a long time the only implementations have been for scraped html and feedly, and the only one we use regularly is feedly. That code is all in perl and could use a refactor when getting rewritten to python, but the approach it uses has been well validated and tested over a couple of years.</p> <p>-hal</p> <p>On Tue, Nov 20, 2018 at 6:41 AM Linas Valiukas <a href="mailto:notifications@github.com">notifications@github.com</a> wrote:</p> <blockquote> <p>It turns out that some news websites use the nicely parseable Google News format only for the most recent articles that they want to appear on Google News itself (similar to RSS feeds):</p> <url> <loc>https://www.news.com/2018/some-people-still-dont-like-trump.html</loc> <!-- "news" is Google News XML schema --> <news:news> <news:publication> <news:name>Some news website</news:name> <news:language>en</news:language> </news:publication> <news:publication_date>2018-11-20T12:14:04+02:00</news:publication_date> <news:title>Some people still don't like Trump</news:title> </news:news> </url> <p>and offload their older article archive in plain, non-Google News sitemaps which lack some essential information, e.g. article titles and publication dates:</p> <url> <loc>https://www.news.com/1879/lightbulb-invented-by-some-dude-named-edison.html</loc> <!-- Exposed archive for some news websites go a long way, e.g. The Atlantic goes as far as 1857 --> <lastmod>1879-01-01T12:14:04+02:00</lastmod> <!-- No Google News metadata --> </url> <p>I think it would be great to backfill our database with news articles dating back to the inception of a particular news source, but there are some problems with ingesting those non-Google News sitemaps:</p> <p>-</p> <p>We wouldn't know the article's title so we'd have to make a heuristic guess somehow -- look into <h1>, use Readability (which provides .title() method), or look for title in <title>:</p> <head> <title>Trump still not liked by some -- News Website

Trump still not liked by some

...

-

We'd have to guess the publication date too; "last modified" date (from ) is a good start, and we have Colin's date_guesser Python module.

There's no guarantee that pages linked from the plain XML sitemap are news articles at all -- they could be just a bunch of static pages.

So, with #518 https://github.com/berkmancenter/mediacloud/pull/518 merged and (soon to be) deployed, the idea is to:

  1. Fetch all the sitemap-derived URLs from a limited set of mediums (US Top Online News and US Mainstream Media) into media_sitemap_pages
  2. Use the media_sitemap_pages for initial statistics on:
    • How many URLs linked from sitemaps are not news articles?
    • How many news websites use Google News and how much articles do they expose to Google?
    • How far do the archived news articles reach and whether or not we want this data, i.e. is it of any advantage to store news articles from the year 1879?
  3. With the initial sampling done with the US media, do the same for the rest of the mediums and see whether we can get international news article archives too.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/45#issuecomment-440259786, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvTyhUm8dZUJ-SHqaPUseayRAA0rBOks5uw_h8gaJpZM4JeL2F .

-- Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University

hroberts commented 6 years ago

As mentioned in the meeting, I think we should run the ImportStories code to figure out how many stories from each source are new and how many are duplicates. The ImportStores code uses the topic spider to dedup stories baesd on both lossily normalized urls and on title part matching ('LA Times: Trump Send Army to Border' = 'Trump Send Army to Border'). The module supports a dry_run option that allows you to run it without actually importing anything.

hroberts commented 5 years ago

Spent some time looking at the sitemap data at a high level.

First most important finding is that only 1 million of the 228 million stories collected have titles or publish_dates. This means that we will have to guess titles and publish dates for basically all sitemap discovered stories. Title guessing is not a big deal, but date guessing is not great, because we know that its accuracy is not great.

Just looking at the nyt sitemaps, I see there is a tag. Are we parsing that into the news_publish_date field? If not, we should try collecting that data and using it as a publish_date. It will almost certainly be more accurate than our date guessing module.

Below is a list of story counts for the most prevalent media. I think we probably don't want most of those first few most common media other than nyt just because there are so many of them. We should make a decision for each of them about whether it is worth the cost to import all of those stories.

 34750085 |     2267 | Василисина Обитель                                                                                                                                                                                                                         
 17118432 |     1803 | 7Светлана, 44, г.Москва                                                                                                                                                                                                                    
 15962101 |        1 | New York Times                                                                                                                                                                                                                             
  6302615 |   506465 | manualzz.com                                                                                                                                                                                                                               
  5773154 |     2407 | iLL_Fancy                                                                                                                                                                                                                                  
  5353847 |     2262 | ЛюбимаяМая                                                                                                                                                                                                                                 
  5000406 |   510437 | origin-mnr.barnesandnoble.com                                                                                                                                                                                                              
  4667684 |     4429 | iTunes Top 25 Songs                                                                                                                                                                                                                        
  3723626 |   510268 | hosts.blogtalkradio.com                                                                                                                                                                                                                    
  3430686 |    40944 | Bloomberg                                                                                                                                                                                                                                  
  2846046 |     1763 | Новости - главные новости России, СНГ и мира - лента новостей ИА REGNUM                                                                                                                                                                    
  2812424 |   512439 | worldofbooks.com                                                                                                                                                                                                                           
  2682274 |     1747 | Daily Mail                                                                                                                                                                                                                                 
  2464631 |     1750 | Daily Telegraph                                                                                                                                                                                                                            
  2362600 |     4415 | CNET                                                                                                                                                                                                                                       
  2256033 |      101 | Washington Times                                                                                                                                                                                                                           
  2143696 |     1708 | Deutsche Welle                                                                                                                                                                                                                             
  1964392 |   516706 | linkfang.de                                                                                                                                                                                                                                
  1905183 |   516235 | pressebox.de                                                                                                                                                                                                                               
  1859940 |   510346 | alfa.com                                                                                                                                                                                                                                   
  1692212 |   491464 | ctd.mdibl.org                                                                                                                                                                                                                              
  1642342 |       39 | South Florida Sun-Sentinel                                                                                                                                                                                                                 
  1596670 |      570 | BBC NEWS | Nick Robinson's Newslog                                                                                                                                                                                                         
  1562258 |     1092 | FOX News                                                                                                                                                                                                                                   
  1532200 |       10 | houstonchronicle                                                                                                                                                                                                                           
  1472130 |       57 | The Buffalo News                                                                                                                                                                                                                           
  1351737 |   518781 | digital.zlb.de                                                                                                                                                                                                                             
  1254447 |     6218 | Buzzfeed                                                                                                                                                                                                                                   
  1243544 |   496688 | arabstoday.net                                                                                                                                                                                                                             
  1188794 |   515521 | lr-online.de                                                                                                                                                                                                                               
  1085938 |       14 | sfchronicle                                                                                                                                                                                                                                
  1060294 |      135 | Hugh Hewitt: Townhall.com                                                                                                                                                                                                                  
  1002395 |   509507 | maesaelephantcamp.com                                                                                                                                                                                                                      
   984800 |     1728 | Gazeta.ru                                                                                                                                                                                                                                  
   903924 |   515011 | anwalt.de                                                                                                                                                                                                                                  
   903447 |     1636 | daily legal news                                                                                                                                                                                                                           
   873577 |     2751 | Каждый из нас ответственен за чувства,которые испытывает,и обвянить в этом друг                                                                                                                                                            
   862376 |     1212 | TeleFutura                                                                                                                                                                                                                                 
   857539 |       40 | The Seattle Times                                                                                                                                                                                                                          
   851116 |     2843 | Без названия                                                                                                                                                                                                                               
   841816 |     1775 | Московский Комсомолец: происшествия, общество, культура, мнения, интервью                                                                                                                                                                  
   821810 |   516540 | thurgauerzeitung.ch                                                                                                                                                                                                                        
   816133 |     1752 | cbs news                                                                                                                                                                                                                                   
   795496 |   509283 | us.vwr.com                                                                                                                                                                                                                                 
   792450 |     1729 | Lenta.ru                                                                                                                                                                                                                                   
   776275 |   495043 | springermedizin.de                                                                                                                                                                                                                         
   765209 |     4363 | Ugnich Anton (ugnich) - Juick                                                                                                                                                                                                              
   759803 |     3293 | Filmz.RU. Новости кино                                                                                                                                                                                                                     
   753705 |   517867 | publications.rwth-aachen.de                                                                                                                                                                                                                
   730469 |   520105 | inspirock.com                                                                                                                                                                                                                              
   721470 |     4048 | Украинский Бизнес Ресурс - Украинские бизнес блоги.                                                                                                                                                                                        
   711344 |       45 | Pittsburgh Post-Gazette                                                                                                                                                                                                                    
   688194 |     1731 | RFE/RL                                                                                                                                                                                                                                     
   687484 |   518532 | boersennews.de                                                                                                                                                                                                                             
   675243 |    18710 | Business Insider                                                                                                                                                                                                                           
   628896 |   515531 | allmystery.de                                                                                                                                                                                                                              
   593839 |     1774 | Интерфакс                                                                                                                                                                                                                                  
   587748 |   511109 | iegpolicy.agribusinessintelligence.informa.com                                                                                                                                                                                             
   583964 |     1765 | NEWSru.com :: Самые быстрые новости. Фото и видео дня. Лента новостей в России                                                                                                                                                             
   575777 |   495033 | it.reuters.com                                                                                                                                                                                                                             
   574998 |     1727 | Izvestia                                                                                                                                                                                                                                   
   572534 |     1596 | fargo metro news                                                                                                                                                                                                                           
   571000 |   506840 | redshelf.com                                                                                                                                                                                                                               
   555119 |     1097 | Voice of America                                                                                                                                                                                                                           
   551347 |   505045 | studystack.com                                                                                                                                                                                                                             
   535561 |     2094 | Новости Санкт-Петербурга, последние новости дня, новости бизнеса - Фонтанка.ру                                                                                                                                                             
   529080 |   517229 | die-glocke.de                                                                                                                                                                                                                              
   522362 |       38 | The Orlando Sentinel                                                                                                                                                                                                                       
   521840 |     1770 | Правда.Ру: Новости и аналитика                                                                                                                                                                                                             
   521527 |   500070 | fishersci.com                                                                                                                                                                                                                              
   513908 |   516528 | traderscity.com                                                                                                                                                                                                                            
   512182 |     4470 | PopSugar                                                                                                                                                                                                                                   
   502968 |   516533 | immoscout24.ch                                                                                                                                                                                                                             
   500654 |     4115 | bb - Juick                                                                                                                                                                                                                                 
   498149 |     1095 | CNN                                                                                                                                                                                                                                        
   496262 |     1100 | US News & World Report                                                                                                                                                                                                                     
pypt commented 5 years ago

Just looking at the nyt sitemaps, I see there is a tag. Are we parsing that into the news_publish_date field? If not, we should try collecting that data and using it as a publish_date. It will almost certainly be more accurate than our date guessing module.

Some (many?) news websites have a tendency to expose the most recent news articles in a sitemap with Google News additions (including the publication date), but publish only basic sitemaps for their archive.

I'm a little bit more optimistic about the date guessing for the news articles as nowadays many of them have Open Graph tags (article:published_time) which were supported at the very least by the old iteration of the date guesser (I think Colin's code can parse OpenGraph too), but yeah, it would be a lot of work to do all the fetching and guessing.

34750085 | 2267 | Василисина Обитель 17118432 | 1803 | 7Светлана, 44, г.Москва

The way the sitemap fetcher works is that for every medium (source) it:

1) Reads its URL from media 2) Strips off the path part of the URL (https://nytimes.com/article/ -> https://nytimes.com/) 3) Fetches the robots.txt and sitemaps recursively

The first two sources point to individual blogs or profile pages from within a bigger website:

mediacloud=# select media_id, url, name from media where media_id in (2267, 1803);
 media_id |                           url                           |          name           
----------+---------------------------------------------------------+-------------------------
     1803 | http://mylove.ru/7svetlana/diary/                       | 7Светлана, 44, г.Москва
     2267 | http://www.liveinternet.ru/community/lj_vasilisa_ogneva | Василисина Обитель
(2 rows)

So essentially the collected links of the top 2 results are full archives of liveinternet.ru, a rather popular blogging platform in Russia, and mylove.ru, a dating website!

rahulbot commented 5 years ago

Ha! Clearly we need to filter for news sites, perhaps by iterating through all of the geographic collections (tag_sets_id=15765102) and only processing sites that are in one of those collections.

rahulbot commented 5 years ago

I think we're punting on this until there is a specific project that needs it. We have an approach for collecting via sitemaps now, and understand the problems that will occur related to title and date guessing. Closing until we pick a list of sites we want to run this on.