ookgezellig / Zimmerman-en-Space-podcast

Webscrape of the Zimmerman en Space podcast, and publication on Wikimedia Commons
https://ookgezellig.github.io/Zimmerman-en-Space-podcast/
0 stars 0 forks source link
astronomy nerds netherlands physics podcast podcasting space

Zimmerman en Space go Wiki

Webscrape of the Zimmerman en Space podcast, and (re)publication on Wikimedia Commons (and Zenodo in the future). High 5 for CC0 licenses, space, astronomy and nerds!

Latest update : 17 September 2024

afbeelding

Main result

Episodes 1 - 92 are now available on Wikimedia Commons:


Step by step process

Make initial scrape map

Excel with scraped data, post-processing

Output of webscrape, with post-processing to make data suitable input for Wikimedia Commons, OpenRefine and the Python modules used below: https://ookgezellig.github.io/Zimmerman-en-Space-podcast/ZimmermanEnSpacePodcast_episodes1-92.xlsx

Download mp3s from URL

Convert .mp3 to .oga, a format suitable for Wikimedia Commons

Converting from mp3 to ogg/oga:

Wikimedia Commons:

Input for OpenRefine

Wikimedia Commons

Category & gallery

Stuff in progress

1 - Audio transcriptions

Full-text audio transcriptions are being added bit by bit to the Commons files in the coming months.

2 - Structured file data / main subject

To the structured data of each Commons file, main subject (P921) will be added bit by bit in the coming months. These episode subjects/keywords will be extracted from the title and full-text audio transcriptions using Named Entity Recognition (NER) techniques and subsequent reconciliation of the found entities against Wikidata. For current status, see this issue.

For a fully worked example, see S01E01 Tsunami's op Mars.

API

Request info about episode 14, AI en Chat GPT in de sterrenkunde

SPARQL

Structured data has been added to all files, so we can do some (basic) semantic searching via SPARQL queries.

Wikidata


Copyright

All episodes 1-92 of the Zimmerman en Space podcast have been licensed under the Creative Commons CC0 1.0 license, as stated in the shownotes of each episode.