Open taisa1 opened 1 week ago
Can you try adding the line ppgs.MAX_INFERENCE_FRAMES = 500
before ppgs.from_audio
? Let me know whether or not that fixes it.
I added that line but the output didn't change. Instead, it seems to be fixed by changing transformer.py as follows:
class Transformer(torch.nn.Module):
def __init__(
...
max_len=500
):
super().__init__()
self.position = PositionalEncoding(hidden_channels, max_len=5000)
...
I changed max_len
s to force chunking in the problematic frame length range.
That makes sense: beyond some length there's very little training data. Thanks for finding that.
@CameronChurchwell could I get you to work this into a new release? Training should still allow the longer maximum length, but for inference, we should use what @taisa1 is proposing here. You might want to break it into two variables: one describing the maximum positional encoding length during training, and one describing the optimal chunk size for inference, which I recall you previously mentioned was ~500.
The PPG extraction generally works well, but it appears to have issues with audio segments of specific lengths.
For example, when processing an audio file of approximately 7000 frames, the resulting PPG values appear correct.
However, if we extract only the first ~4500 frames and attempt to generate PPGs, the results are not as expected.
This issue seems to occur with audio files in the 1000 to 5000 frame range.
Can this issue be reproduced? I suspect this may be related to chunking behavior during processing.
My environment:
Code: