Closed roshideen closed 3 years ago
Hi, you can use the frame-wise confidence ('fconfm' inside SyncNetInstance.py) and set a threshold. This is the frame number, so you decide the frame index by 25 to get the time in seconds. To make datasets such as LRS and VoxCeleb, we used thresholds of 3 to 4.
Hi, is it possible to extract what time (or where) the speech of each speaker start and end? I want to extract speech of each speaker so it needs to know when the speech matched to the speakers and end.