pbs / pycaption

Python module to read/write popular video caption formats
Apache License 2.0
256 stars 136 forks source link

pycaption cannot cope with subtitles in dfxp format if they contain a namespace #213

Closed HaydonBerrow closed 11 months ago

HaydonBerrow commented 3 years ago

pycaption cannot cope with subtitles in dfxp format if they contain a namespace Unforgotten-episode-xx.en.dfxp.txt

This command will produce an example (episode 6 of UK-programme 'Unforgotten', previous episodes didn't use a namespace)

youtube-dl --skip-download --write-srt --sub-lang en -o Unforgotten-episode-xx.mp4 -f hls-2172-2 https://player.stv.tv/episode/42m5/unforgotten

and the file starts with

<?xml version="1.0" encoding="UTF-8"?>

The issue first arises at line 72 in pycaption-1.1.0-py3.8.egg/pycaption/dfxp/base.py

with the code for div in dfxp_document.find_all('div'): because the correct element is still/now ''

ana-nichifor commented 3 years ago

Hi, can you provide more details on the use case of <tt:tt>? Are there any benefits over simple <tt> tags? Are there situations where tt is required as a namespace?

HaydonBerrow commented 3 years ago

Sorry, namespaces in XML files are terra incognita to me. I have no idea why this site (player.stv.tv) adds them but I have since found multiple instances of it. I get around it by editing the dfxp file.