jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
MIT License
2.55k stars 280 forks source link

Add HTML text formatting option #192

Closed eseiver closed 1 year ago

eseiver commented 1 year ago

Decided to address #191 myself! This changes _TranscriptParser() to have a preserve_formatting option that defaults to False. I then added the ability to set this parameter at the user level in YouTubeTranscriptApi.get_transcript(), making changes to the intermediary classes and methods where appropriate. Finally, I updated the testing XML asset in transcript.xml.static to include escaped italics, as well as adding a new test test_get_transcript_formatted() that keeps the italics. None of the other similar tests were changed because it still defaults to the old version.

eseiver commented 1 year ago

Thanks so much for this helpful review! I have addressed all the points in some additional commits and updated a few more tests. Let me know if there's anything else needed.

jdepoix commented 1 year ago

Somehow the builds haven't run yet, I will have a look into that!

jdepoix commented 1 year ago

Apparently travis ci has to explicitly grant me OSS credits now (that did not use to be the case). I have opened a support case with them and hope that this will be resolved soon!

eseiver commented 1 year ago

Alright that makes sense that preserve_formatting can be mostly in class methods not in __init__. Updated and also moved _FORMATTING_TAGS into _TranscriptParser.

jdepoix commented 1 year ago

Hi @eseiver, thanks for the additional fixes! I just added a few of the missing things myself. I will merge this now and release it with v0.6.0 🥳

jdepoix commented 1 year ago

I just release this in v0.6.0. Thank you for your contribution @eseiver ! 😊 🙏