jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
MIT License
2.87k stars 326 forks source link

disabled comments in list_transcript #181

Closed AaronPhilipp closed 1 year ago

AaronPhilipp commented 1 year ago

Hello all,

I have seen a lot of problems because of disabled comments. I ran into the same issue when I wrote a loop that checks for transcript availability. If there is a video with comments disabled, my loop stops. But I want to be able to program that it tells me that there are no subtitles without stopping the loop. Because it raises the special class and tells me the reason why I can't get the list, I cant grab this one with an exception and an error would be easier for me to handle.

So is there any way to catch the warning in an exception, or would it make sense to program an error that can be caught with an exception?

Many thanks in advance,

Aaron

jdepoix commented 1 year ago

Hi @AaronPhilipp, I am not 100% sure what you are trying to achieve and what your distinction between exception/error is. Generally speaking, you should be able to catch any exceptions thrown by this module using try/except and ignore them if you don't want the program to stop. Is this what you are looking for? If not, please post the code you are trying to run.

AaronPhilipp commented 1 year ago

Yes! I want to use the try/except statement but how do I catch the warning, when there are captions disabled?

jdepoix commented 1 year ago

I am not sure what you mean. This module doesn't provide any functionality that has anything to do with comments 🤔 Please share some code. I can't help you without knowing what you are doing 😄

AaronPhilipp commented 1 year ago

I mean captions of course. Sorry. Long day.

Basically I have his function.

def get_caption(video_id):

print(video_id)

df = pd.DataFrame(columns=['video_id', 'subtitle'])

transcript_list = YouTubeTranscriptApi.list_transcripts(video_id=video_id)

transcript = transcript_list.find_generated_transcript(['de', 'en'])

if transcript == 'de ("Deutsch (automatisch erzeugt)")[TRANSLATABLE]':

    result = transcript.fetch()

    text = ''

    for i in result:
        text += i['text'] + ' '

elif 'en ("Englisch (automatisch erzeugt)")[TRANSLATABLE]':

    translated_transcript = transcript.translate('de')
    result = translated_transcript.fetch()

    text = ''

    for i in result:
        text += i['text'] + ' '

df = pd.concat([df, pd.DataFrame([{'video_id': video_id, 'subtitle': text}])])

time.sleep(randint(2, 7))  # short break after request

return df

The I have a list with video_ids and pass this list to a loop:

for i in video_ids_list:

tmp = df = get_caption(video_id=i)

print(i + ': ok' + ' (' + datetime.datetime.now().strftime("%H:%M:%S") + ')')

if os.path.isfile(path + 'captions_' + name + '_' + time.strftime("%Y-%m-%d") + '.csv'):

    df = pd.read_csv(path + 'captions_' + name + '_' + time.strftime("%Y-%m-%d") + '.csv',
                         index_col=0)

    df = pd.concat([df, tmp], ignore_index=True)

    df.to_csv((path + 'captions_' + name + '_' + time.strftime("%Y-%m-%d") + '.csv'),
                  encoding='utf-8-sig')

else:

    tmp.to_csv((path + 'captions_' + name + '_' + time.strftime("%Y-%m-%d") + '.csv'),
                   encoding='utf-8-sig')

Now I don't want to get interupted with missing captions.

jdepoix commented 1 year ago

Just wrap get_caption in a try/except and skip that iteration

try:
   tmp = df = get_caption(video_id=i)
except CouldNotRetrieveTranscript:
  continue

Also, you can't compare a Transcript object with a string. So you can't just do transcript == 'de ("Deutsch (automatisch erzeugt)")[TRANSLATABLE]'. You can replace that entire section with:

transcript_list = YouTubeTranscriptApi.list_transcripts(video_id=video_id)
transcript = transcript_list.find_generated_transcript(['de', 'en'])
if transcript.language_code == 'en':
    transcript = transcript.translate('de')
text = ' '.join(snippet['text'] for snippet in transcript.fetch())
AaronPhilipp commented 1 year ago

Great! This works exactly the way I needed! Thank you for your help and sorry for question, I'm still learning python. :)