I am currently working on this and found the following things:
The word boundaries are not obtainable, because the sentences are synthesized as a whole
Synthesizing singular words and accumulating the length (including silent bytes) to get alignment data for individual words is possible but takes much longer and also is anything but accurate.
Synthesized sentences/words are of different length with each run.
My current implementation would output alignment data for sentences in CSV:
This is useful to determine e.g. the word boundaries in the output waveform.