mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
281 stars 87 forks source link

implement support for podcast ingestion #650

Closed rahulbot closed 4 years ago

rahulbot commented 4 years ago

As per the plan developed in #515, we need to support podcast transcription (via Google). I'm creating this task to track implementation, as the other tracks planning.

As mentioned on the recent call, this needs to allow us to mark specific feeds / sources as a "podcast" source to queue if up for transcription. I was thinking this could be a need "feed.type" option, because that would make it easy to set in the UI: Update_Feed___The_Boston_Globe___Source_Manager___Media_Cloud

hroberts commented 4 years ago

another option is to add a 'download and transcribe podcasts' boolean and just use the existing 'syndicated' feed type.

On Wed, Jan 8, 2020 at 10:21 AM rahulbot notifications@github.com wrote:

As per the plan developed in #515 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_515&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=0igWs_v6p0eABP1nsJyiZ8_bAxL72P9krzPNQezLLFg&s=np6REx-c2pTq8sWbZiu_r5OMJ-PK9rdOb54YBvtQ8Zk&e=, we need to support podcast transcription (via Google). I'm creating this task to track implementation, as the other tracks planning.

As mentioned on the recent call, this needs to allow us to mark specific feeds / sources as a "podcast" source to queue if up for transcription. I was thinking this could be a need "feed.type" option, because that would make it easy to set in the UI: [image: Update_Feed_The_BostonGlobeSource_Manager_Media_Cloud] <https://urldefense.proofpoint.com/v2/url?u=https-3Auser-2Dimages.githubusercontent.com_673178_71995638-2De6b57400-2D3208-2D11ea-2D88be-2D024dd41f1afe.jpg&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=0igWs_v6p0eABP1nsJyiZ8_bAxL72P9krzPNQezLLFg&s=9k-YdSxuxz-KdwurP8YV-_GvWwo2HBgEEAelVQc07Tk&e=>

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_650-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAN66TZ3E77UITZPLOGNKZDQ4X4OXA5CNFSM4KEK4HX2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IE2AESA&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=0igWs_v6p0eABP1nsJyiZ8_bAxL72P9krzPNQezLLFg&s=O6-rDDW-ezUDEBT57zY1wHELyoSq2kls0c3qUYySmwQ&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T2M6R5XNQ3QBVLT2HDQ4X4OXANCNFSM4KEK4HXQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=0igWs_v6p0eABP1nsJyiZ8_bAxL72P9krzPNQezLLFg&s=kFcoznYRX2jfDGBq72PGUeJqPp015bc7ugQVxHd3YZc&e= .

pypt commented 4 years ago

Deployed podcast transcription services, tried it with an initial podcast, seemed to work, so I've added the rest.

Here's the list:

https://docs.google.com/spreadsheets/d/1nUnhKGaazTgrUDhswCjz8KJ_KTlmSqvrF8VEkb2vQIs/edit#gid=0

All added podcast media sources are in collection tag set (tag_sets_id = 5) and are tagged with a Podcasts tag (tags_id = 196654054).

rahulbot commented 4 years ago

I'm not sure they got added correctly. For instance, the Glenn Beck podcast feed show up as "syndicated" when I think it should be "podcast". Am I mistaken?

pypt commented 4 years ago

Oh, and I was wondering why they weren't being fetched!

Updated to podcast, let's see if it works now.

rahulbot commented 4 years ago

The new 'podcast' type feed on that source still hasn't fetched any stories. It looks like that is the Apple Podcasts URL, not the raw feed URL. Are you parsing out feed URLs from the Podcasts link automatically?

To check this out I hacked somebody's script and made a ruby script that fetches the feed url from the Apple Podcast URL. For that Glen Beck feed it spits out https://feeds.megaphone.fm/BMDC3567910388 as the actual feed URL. Do we need to update the ones hosted at Apple Podcast to be the raw URLs or not?

Update: I switched it and am waiting for a fetch to happen to see if it works or not.

pypt commented 4 years ago

I can see 1248 stories from media source 1363086, so maybe the crawler didn't get around to fetching that specific podcast six days ago.

Crawler supports both iTunes Podcasts URLs (e.g. https://podcasts.apple.com/us/podcast/lovett-or-leave-it/id1216346463) and Google Podcasts URLs (e.g. https://podcasts.google.com/?feed=aHR0cHM6Ly93d3cubnByLm9yZy9yc3MvcG9kY2FzdC5waHA_aWQ9NTEwMjg5), meaning that you can add them as feeds.url directly, and the crawler should be able to work out the actual feed URL hiding behind that page on every fetch.