Does model output phoneme-level timing info ?

p0p4k / pflowtts_pytorch

Unofficial implementation of NVIDIA P-Flow TTS paper

https://neurips.cc/virtual/2023/poster/69899

MIT License

198 stars 28 forks source link

Open lumpidu opened 4 months ago

lumpidu commented 4 months ago

Hi, thanks for your work. I'd be interested, if the model also provides phoneme-level timing information at inference ?

p0p4k commented 4 months ago

Yes, the attn matrix can be used to get the frame numbers and convert to time in seconds during inference.