Closed rahulbot closed 4 years ago
We've now got giant CSV files for all Univision stories published in 2019 and (so far) in 2020, with URL and pub date. Based on a suggestion from Hal, I've converted these to simple RSS files. @hroberts - do you think it'll work to post those two RSS files I made somewhere temporarily, add their URLs as feeds to the Univision source, and let them be sucked in? It is about 15k total URLs.
yes, that should just work.
-hal
On Thu, Apr 30, 2020 at 12:58 PM rahulbot notifications@github.com wrote:
We've now got giant CSV files for all Univision stories published in 2019 and (so far) in 2020, with URL and pub date. Based on a suggestion from Hal, I've converted these to simple RSS files. @hroberts https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_hroberts&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=VHc7lq5WL3qrSs3K9w9bDiwwS6zRIGxkAirIjxfyERo&s=6ILNpSScsAx1IGoak4qZNA_3kQeyCR0KKvjR3KMCcFc&e=
- do you think it'll work to post those two RSS files I made somewhere temporarily, add their URLs as feeds to the Univision source, and let them be sucked in? It is about 15k total URLs.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_691-23issuecomment-2D622009608&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=VHc7lq5WL3qrSs3K9w9bDiwwS6zRIGxkAirIjxfyERo&s=TYoZ5T4rJlzD2HOCLwJdanDtRmxo-qKuv1FudwZJeKw&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T7U326RF24BI6TKRQTRPG327ANCNFSM4LVGNVTA&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=VHc7lq5WL3qrSs3K9w9bDiwwS6zRIGxkAirIjxfyERo&s=_5HTqT_YcGdSYDP5JYRzt8kg9uYII7tVkJjZLUBJp6M&e= .
Do they have titles, publication dates and all that?
We can put them on our S3.
Just URL and date. Is that OK if all the RSS <item>
s have only <link>
and <pubDate>
?
We'll want a <title>
too. Hal has some advanced code that tries to guess the title from page's HTML (AFAIK going for HTML's <title>
is not enough, you have to figure out if pages have some sort of a suffix too, e.g. This is a title - Univision
), but I'm not sure where it's placed.
We don’t have title, so we’ll have to rely on the existing code that pulls it from the HTML. Does the RSS ingest processing code do this by default if title element is missing?
Does the RSS ingest processing code do this by default if title element is missing?
I don't think so. One of the sitemap ingestion subtasks is guessing title and publication date for every URL.
err, linas is right. we need titles in the rss. next easiest is your idea about importing them into a topic. if you send me the csv on email I'll do the topic import.
We imported a big dump CSV manually, so we can ignore this until broader sitemap support is ready.
Fernando has a collaborator trying to run a project using Univision, but since we lost API access a bit back we don't have good content. This seems like a good source to just run the sitemap import on. Their main http://univision.com domain has loads of sitemaps going back years.
There are a number of media sources for Univision in MC. From a quick review it looks to me like 305166 is the "canonical" one. Even though oddly I see that 270007 is the one in the "Left" collection.
I don't think we need #605 or #690 solved before importing this. Thoughts?