ivrit-ai / ivrit.ai

ivrit.ai codebase
MIT License
24 stars 9 forks source link

Implement Podcast Data Enrichment Pipeline Using OpenAI API #28

Closed yanirmr closed 7 months ago

yanirmr commented 7 months ago

Description

This pull request introduces a comprehensive pipeline for enriching podcast episode data using the OpenAI API. Our primary goal is to extract and add insightful metadata to our existing podcast episode dataset, which includes fields like title, number, duration, and description. The enrichment involves adding information about episode participants, their genders, and general topics, obtained via AI-driven analysis.

Key Components

RSS File Parsing (rss_parser.py):

Interacts with the OpenAI API to analyze podcast descriptions (api_integration.py)

Data Aggregation (data_aggregator.py) - Combines the parsed RSS data with API results.

Implementation Details

Usage

The pipeline is primarily intended for internal use to enhance our podcast dataset. It can be triggered via the data_agregator.py script, which is configured to process a random sample of episodes for initial testing and can be easily adjusted for full-scale processing.

yanirmr commented 7 months ago

@yairl refactored by your kindly suggestions.