readbeyond / aeneas

aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)
http://www.readbeyond.it/aeneas/
GNU Affero General Public License v3.0
2.45k stars 218 forks source link

Create PLAIN TextFile programmatically #232

Closed loretoparisi closed 5 years ago

loretoparisi commented 5 years ago

I want to load a TextFile programmatically

task = Task()
textfile = TextFile()
for identifier, frag_text in [
    (u"f001", [u"first fragment"]),
    (u"f002", [u"second fragment"]),
    (u"f003", [u"third fragment"])
]:
    textfile.add_fragment(TextFragment(identifier, Language.ENG, frag_text, frag_text))
task.text_file = textfile

but in my case I do not have the fragments, just a plain text like

First sentence of Paragraph One.
Second sentence of Paragraph One.

First sentence of Paragraph Two.

First sentence of Paragraph Three.
Second sentence of Paragraph Three.
Third sentence of Paragraph Three.

If I try like

from aeneas.textfile import TextFile
from aeneas.textfile import TextFileFormat
from aeneas.language import Language

text = 'First sentence of Paragraph One.\nSecond sentence of Paragraph One.\n\nFirst sentence of Paragraph Two.'

textfile = TextFile(file_format=TextFileFormat.PLAIN)
for line in text.splitlines():
    textfile.add_fragment(TextFragment("", Language.ENG, line, line))

I get an error

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/home/ubuntu/.local/lib/python3.5/site-packages/aeneas/textfile.py", line 269, in __init__
    self.lines = lines
  File "/home/ubuntu/.local/lib/python3.5/site-packages/aeneas/textfile.py", line 327, in lines
    raise TypeError(u"lines is not an instance of list")
TypeError: lines is not an instance of list

while if I pass the arguments lines and filtered_lines as lists like

for line in text.splitlines():
   tf.add_fragment(TextFragment("", Language.ENG, [line], [line]))

it creates a TextFile instance that seems to respect the structure:

>>> print(tf)
 First sentence of Paragraph One.
 Second sentence of Paragraph One.

 First sentence of Paragraph Two.
 First sentence of Paragraph One.
 Second sentence of Paragraph One.

Is that correct?

pettarin commented 5 years ago

@loretoparisi yes, that is correct, because each text fragment is a list of lines --- to account for e.g. closed captioning, where you want to compute the time alignment for the text which is the union of the lines forming a caption, but the text of the lines still need to be stored as separate entities for visual rendering them on two different lines. See the docs: https://www.readbeyond.it/aeneas/docs/textfile.html#aeneas.textfile.TextFragment

loretoparisi commented 5 years ago

@pettarin thank you, closing then!