mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
280 stars 87 forks source link

run sitemaps import on Univision #691

Closed rahulbot closed 4 years ago

rahulbot commented 4 years ago

Fernando has a collaborator trying to run a project using Univision, but since we lost API access a bit back we don't have good content. This seems like a good source to just run the sitemap import on. Their main http://univision.com domain has loads of sitemaps going back years.

There are a number of media sources for Univision in MC. From a quick review it looks to me like 305166 is the "canonical" one. Even though oddly I see that 270007 is the one in the "Left" collection.

I don't think we need #605 or #690 solved before importing this. Thoughts?

rahulbot commented 4 years ago

We've now got giant CSV files for all Univision stories published in 2019 and (so far) in 2020, with URL and pub date. Based on a suggestion from Hal, I've converted these to simple RSS files. @hroberts - do you think it'll work to post those two RSS files I made somewhere temporarily, add their URLs as feeds to the Univision source, and let them be sucked in? It is about 15k total URLs.

hroberts commented 4 years ago

yes, that should just work.

-hal

On Thu, Apr 30, 2020 at 12:58 PM rahulbot notifications@github.com wrote:

We've now got giant CSV files for all Univision stories published in 2019 and (so far) in 2020, with URL and pub date. Based on a suggestion from Hal, I've converted these to simple RSS files. @hroberts https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_hroberts&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=VHc7lq5WL3qrSs3K9w9bDiwwS6zRIGxkAirIjxfyERo&s=6ILNpSScsAx1IGoak4qZNA_3kQeyCR0KKvjR3KMCcFc&e=

  • do you think it'll work to post those two RSS files I made somewhere temporarily, add their URLs as feeds to the Univision source, and let them be sucked in? It is about 15k total URLs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_691-23issuecomment-2D622009608&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=VHc7lq5WL3qrSs3K9w9bDiwwS6zRIGxkAirIjxfyERo&s=TYoZ5T4rJlzD2HOCLwJdanDtRmxo-qKuv1FudwZJeKw&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T7U326RF24BI6TKRQTRPG327ANCNFSM4LVGNVTA&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=VHc7lq5WL3qrSs3K9w9bDiwwS6zRIGxkAirIjxfyERo&s=_5HTqT_YcGdSYDP5JYRzt8kg9uYII7tVkJjZLUBJp6M&e= .

pypt commented 4 years ago

Do they have titles, publication dates and all that?

pypt commented 4 years ago

We can put them on our S3.

rahulbot commented 4 years ago

Just URL and date. Is that OK if all the RSS <item>s have only <link> and <pubDate>?

pypt commented 4 years ago

We'll want a <title> too. Hal has some advanced code that tries to guess the title from page's HTML (AFAIK going for HTML's <title> is not enough, you have to figure out if pages have some sort of a suffix too, e.g. This is a title - Univision), but I'm not sure where it's placed.

rahulbot commented 4 years ago

We don’t have title, so we’ll have to rely on the existing code that pulls it from the HTML. Does the RSS ingest processing code do this by default if title element is missing?

pypt commented 4 years ago

Does the RSS ingest processing code do this by default if title element is missing?

I don't think so. One of the sitemap ingestion subtasks is guessing title and publication date for every URL.

hroberts commented 4 years ago

err, linas is right. we need titles in the rss. next easiest is your idea about importing them into a topic. if you send me the csv on email I'll do the topic import.

rahulbot commented 4 years ago

We imported a big dump CSV manually, so we can ignore this until broader sitemap support is ready.