iejMac / video2dataset

Easily create large video dataset from video urls
MIT License
546 stars 65 forks source link

YouTube metadata is not saved #319

Open libeanim opened 8 months ago

libeanim commented 8 months ago

Issue

When using video2dataset (1.3.0) to download youtube videos i've set the following entry in the config to retrieve meta data:

reading:
    yt_args:
        download_size: 360
        download_audio_rate: 44100
        yt_metadata_args:
            writesubtitles: 'all'
            subtitleslangs: ['en', 'de', 'es', 'fr', 'it', 'nl', 'pl', 'ru']
            writeautomaticsub: True
            get_info: True
    timeout: 60
    sampler: null

But in the resulting json files the entry "yt_meta_dict": {}, is empty even though get_info: True in the config.

How to reproduce

For example this link: https://www.youtube.com/embed/JFUsP1coIKM When i download that with yt-dlp:

yt-dlp -N 2 \
       --write-subs --convert-subs srt \
       --write-info-json --embed-subs --embed-chapters --embed-metadata \
       --no-progress -q \
       --format 'b[height<=360][ext=mp4]' \
       --output './demo.mp4' \
       https://www.youtube.com/embed/JFUsP1coIKM

I get youtube meta data like "categories": ["Entertainment"], "tags": ["Deutsche", "Welle", "Made", "in", "Germany", "Bio", "Lettland", "Getreide"]

But with video2dataset it looks like this:

    "caption": "\"Volles Korn voran\" 28. November 2008 Beitrag \u00fcber den \u00f6kologischen Teil des Ackerbaus von german",
    "url": "https://www.youtube.com/embed/JFUsP1coIKM",
    "key": "0000000",
    "status": "success",
    "error_message": null,
    "yt_meta_dict": {},
    "video_metadata": {...
pabl0 commented 8 months ago

Are you getting empty yt_meta_dict for just some videos or all of them?

What I am is seeing, that for every 300 videos I seem to get roughly 100 videos with yt_meta_dict populated and 200 videos with yt_meta_dict = {}, which is quite strange.

What exactly does ignoring errors in yt_dlp mean? Even if you have retries, it gives up on the first try?

https://github.com/iejMac/video2dataset/blob/28e7d1c851a2298f3a75375f6e324950405987e7/video2dataset/data_reader.py#L74

Other yt_dlp codepaths don't seem to set this.

pabl0 commented 8 months ago

Ahh! Now I understand what happens: with multiple clips, only the first one (_00000.json) will have yt_meta_dict populated, not the following clips.

It seems this was a change introduced by clipping subsampler refactoring (#275), did it behave differently in v1.2.0?

https://github.com/iejMac/video2dataset/blob/28e7d1c851a2298f3a75375f6e324950405987e7/video2dataset/subsamplers/clipping_subsampler.py#L181-L183

I am not sure if this is a good idea. Depending on your processing pipeline, you might want to have the same metadata available on all the clips.

rom1504 commented 8 months ago

I agree duplicating the metadata makes more sense especially given the size of the data

On Thu, Mar 7, 2024, 12:33 PM Henrik Ahlgren @.***> wrote:

Ahh! Now I understand what happens: with multiple clips, only the first one (_00000.json) will have yt_meta_dict populated, not the following clips.

It seems this was a change introduced by clipping subsampler refactoring (

275 https://github.com/iejMac/video2dataset/pull/275), did it behave

differently in v1.2.0?

https://github.com/iejMac/video2dataset/blob/28e7d1c851a2298f3a75375f6e324950405987e7/video2dataset/subsamplers/clipping_subsampler.py#L181-L183

I am not sure if this is a good idea. Depending on your processing pipeline, you might want to have the same metadata available on all the clips.

— Reply to this email directly, view it on GitHub https://github.com/iejMac/video2dataset/issues/319#issuecomment-1983319110, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437QGWBOFA5L5DCYENX3YXBGB7AVCNFSM6AAAAABDRVKHRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBTGMYTSMJRGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>