This project enables the automatic extraction of semantic metadata from YouTube videos using their links. The extracted metadata is provided as a JSON object.
The pipeline has following structure
First, the user provides a YouTube link, which is received by FastAPI. FastAPI forwards this link to both PyTube and YT-DLP to obtain the video and essential data from YouTube, including the audio transcript. Then, PyScence Detect is employed to segment the video into meaningful scenes. Katna is utilized to extract the most significant keyframes that represent each scene. Using these keyframes and the audio transcript, a Vision-Language Model (VLM) extracts important metadata from the scenes. Since the metadata generated by the VLM may lack context for the entire scene or video, a Large Language Model (LLM) is then used to contextualize this data. Finally, the contextualized data from the LLM is combined with the metadata already provided by PyTube to form a metadata object
{
"MetaDataObject": {
"youtube_title": "str",
"youtube_description": "str",
"published_date": "str",
"youtube_video_id": "str",
"youtube_thumbnail_url": "str",
"youtube_rating": "str",
"youtube_views": "str",
"youtube_age_restricted": "str",
"youtube_keywords": ["str"],
"youtube_author": "str",
"youtube_channel_id": "str",
"youtube_length": "int",
"url": "str",
"llm_description": "str",
"learning_resource_type": "str",
"intended_end_user_role": "str",
"context": "str",
"difficulty_level": "str",
"discipline": "str",
"educational_level": "str",
"target_audience_age": "str",
"typical_learning_time": "str",
"scene_objects": ["SceneObject"]
},
"SceneObject":{
"duration": "str",
"scene_start": "str",
"scene_end": "str",
"title": "str",
"caption": "str",
"key-concepts": "str",
"questions": "str",
"text": "str",
"resources": "str",
"language": "str",
"video_type": "str"
}
}
To install the project's python libraries, run the following command in your terminal:
pip install -r requirements.txt
Additionally, to install flash attention run:
pip install flash-attn --no-build-isolation
As with most python projects we recommend setting up a virtual environment.
To install the non python dependencies of the project run the following, depending on your systen:
sudo apt-get install ffmpeg imagemagick
The models are downloaded from huggingface when running the program for the first time. To download Mistral-7B-Instruct-v0.3 you need to first create a HF account and ask for access on the model page. After access has been granted, create an access token with write permissions and place it into the constants file. Now all models should be downloaded without any problems.
We offer multiple ways to run the pipeline depending on what interaction level might be needed.
To make sure that there aren't any complications with running the scripts below, make sure that you added this repository to your python path by running.
export PYTHONPATH=$(pwd)
We offer a script which you can run locally. To do this, run python main.py <YOUR_YOUTUBE_LINK>
.
For the least interaction with the actual code you can start up the demo server by running python server.py
and access the front-end under localhost:8000.
For the evaluation of this project we created a small dataset containing manually created captions for 10 learning videos from YouTube. The datasets containing the manual and automated captions can be found under eval.
To execute the evaluation cd into the eval directory and run:
python eval_predictions.py
The scripts for calculating the metrics were taken from pycocoevalcap.