Closed turian closed 3 years ago
By default, the audio is padded with window_size // 2
zeros on both sides. So a signal x
will produce 1 + int(len(x) // hop_size)
frames. The first frame is centered on sample index 0. You can turn off the padding and use your own if you need different behavior.
How do I retrieve the timestamps of the embedding? Are they centered?
Can I assume it starts at hop_size / 2? If the audio is not divisible by hop_size, where precisely does it end?
edit: it doesn't necessarily appear to do centering, based upon the number of frames. can you confirm this is correct?
https://github.com/neuralaudio/hear-baseline/blob/main/hearbaseline/torchcrepe.py#L126