interactiveaudiolab / ppgs

High-Fidelity Neural Phonetic Posteriorgrams
https://maxrmorrison.com/sites/ppgs
MIT License
91 stars 6 forks source link

Inference issues with audio of specific lengths #17

Open taisa1 opened 1 week ago

taisa1 commented 1 week ago

The PPG extraction generally works well, but it appears to have issues with audio segments of specific lengths.

For example, when processing an audio file of approximately 7000 frames, the resulting PPG values appear correct. ppg_elevenlabs_long

However, if we extract only the first ~4500 frames and attempt to generate PPGs, the results are not as expected. ppg_elevenlabs_45s

This issue seems to occur with audio files in the 1000 to 5000 frame range.

Can this issue be reproduced? I suspect this may be related to chunking behavior during processing.

My environment:

Code:

import ppgs
import torch
import numpy as np
import matplotlib.pyplot as plt
audio_file = "audio.wav"
audio = ppgs.load.audio(audio_file)
ppg = ppgs.from_audio(audio[0].unsqueeze(0), ppgs.SAMPLE_RATE).to(torch.float32)

labels = ["aa","ae","ah","ao","aw","ay","b","ch","d","dh","eh","er","ey","f","g","hh","ih","iy","jh","k","l","m","n","ng","ow","oy","p","r","s","sh","t","th","uh","uw","v","w","y","z","zh","<silent>"]

plt.figure(figsize=(14,8))
plt.yticks(ticks=np.arange(len(labels)), labels=labels)

plt.imshow(ppg[0].numpy(), cmap='viridis', aspect='auto')
plt.colorbar() 
plt.title('PPGs Visualization')
plt.ylabel('phonemes')
plt.xlabel('frames')
plt.savefig("ppg_audio")
maxrmorrison commented 23 hours ago

Can you try adding the line ppgs.MAX_INFERENCE_FRAMES = 500 before ppgs.from_audio? Let me know whether or not that fixes it.

taisa1 commented 16 hours ago

I added that line but the output didn't change. Instead, it seems to be fixed by changing transformer.py as follows:

class Transformer(torch.nn.Module):

    def __init__(
...
        max_len=500
    ):
        super().__init__()
        self.position = PositionalEncoding(hidden_channels, max_len=5000)
...

I changed max_lens to force chunking in the problematic frame length range.

maxrmorrison commented 3 hours ago

That makes sense: beyond some length there's very little training data. Thanks for finding that.

@CameronChurchwell could I get you to work this into a new release? Training should still allow the longer maximum length, but for inference, we should use what @taisa1 is proposing here. You might want to break it into two variables: one describing the maximum positional encoding length during training, and one describing the optimal chunk size for inference, which I recall you previously mentioned was ~500.