myshell-ai / OpenVoice

Instant voice cloning by MIT and MyShell.
https://research.myshell.ai/open-voice
MIT License
29.89k stars 2.94k forks source link

AttributeError: 'SynthesizerTrn' object has no attribute 'enc_p'. Did you mean: 'enc_q'? #305

Open fury88 opened 2 months ago

fury88 commented 2 months ago

This one is related to And this one is related to https://github.com/myshell-ai/OpenVoice/issues/290 I see some of the configuration entries from v1 are left out of v2. Adding them does the trick, but at the end now I'm getting python errors based rep code.

@dosu, AttributeError: 'SynthesizerTrn' object has no attribute 'enc_p'. Did you mean: 'enc_q'?

dosubot[bot] commented 2 months ago

To resolve the AttributeError: 'SynthesizerTrn' object has no attribute 'enc_p', ensure that the n_speakers parameter is not set to 0 when initializing the SynthesizerTrn object. The enc_p attribute is only defined if n_speakers is not 0.

Here's the relevant part of the SynthesizerTrn class:

class SynthesizerTrn(nn.Module):
    def __init__(
        self,
        n_vocab,
        spec_channels,
        inter_channels,
        hidden_channels,
        filter_channels,
        n_heads,
        n_layers,
        kernel_size,
        p_dropout,
        resblock,
        resblock_kernel_sizes,
        resblock_dilation_sizes,
        upsample_rates,
        upsample_initial_channel,
        upsample_kernel_sizes,
        n_speakers=256,
        gin_channels=256,
        zero_g=False,
        **kwargs
    ):
        super().__init__()

        self.dec = Generator(
            inter_channels,
            resblock,
            resblock_kernel_sizes,
            resblock_dilation_sizes,
            upsample_rates,
            upsample_initial_channel,
            upsample_kernel_sizes,
            gin_channels=gin_channels,
        )
        self.enc_q = PosteriorEncoder(
            spec_channels,
            inter_channels,
            hidden_channels,
            5,
            1,
            16,
            gin_channels=gin_channels,
        )

        self.flow = ResidualCouplingBlock(inter_channels, hidden_channels, 5, 1, 4, gin_channels=gin_channels)

        self.n_speakers = n_speakers
        if n_speakers == 0:
            self.ref_enc = ReferenceEncoder(spec_channels, gin_channels)
        else:
            self.enc_p = TextEncoder(n_vocab,
                inter_channels,
                hidden_channels,
                filter_channels,
                n_heads,
                n_layers,
                kernel_size,
                p_dropout)
            self.sdp = StochasticDurationPredictor(hidden_channels, 192, 3, 0.5, 4, gin_channels=gin_channels)
            self.dp = DurationPredictor(hidden_channels, 256, 3, 0.5, gin_channels=gin_channels)
            self.emb_g = nn.Embedding(n_speakers, gin_channels)
        self.zero_g = zero_g

Ensure that n_speakers is not 0 when you initialize the SynthesizerTrn object:

synthesizer = SynthesizerTrn(
    n_vocab=100,  # example value
    spec_channels=80,  # example value
    inter_channels=192,  # example value
    hidden_channels=192,  # example value
    filter_channels=768,  # example value
    n_heads=2,  # example value
    n_layers=6,  # example value
    kernel_size=3,  # example value
    p_dropout=0.1,  # example value
    resblock="1",  # example value
    resblock_kernel_sizes=[3, 7, 11],  # example value
    resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]],  # example value
    upsample_rates=[8, 8, 2, 2],  # example value
    upsample_initial_channel=512,  # example value
    upsample_kernel_sizes=[16, 16, 4, 4],  # example value
    n_speakers=256,  # ensure this is not 0
    gin_channels=256,  # example value
    zero_g=False  # example value
)

Additionally, ensure that your configuration file correctly sets the n_speakers parameter. For example, in the OpenVoiceBaseClass initialization:

class OpenVoiceBaseClass(object):
    def __init__(self, 
                config_path, 
                device='cuda:0'):
        if 'cuda' in device:
            assert torch.cuda.is_available()

        hps = utils.get_hparams_from_file(config_path)

        model = SynthesizerTrn(
            len(getattr(hps, 'symbols', [])),
            hps.data.filter_length // 2 + 1,
            n_speakers=hps.data.n_speakers,
            **hps.model,
        ).to(device)

        model.eval()
        self.model = model
        self.hps = hps
        self.device = device

Ensure that hps.data.n_speakers is not set to 0 in your configuration file. This should resolve the AttributeError [1][2].

To continue talking to Dosu, mention @dosu.

fury88 commented 2 months ago

@dosu, ok I may have the n_speakers in the wrong place. I assume 'n_speakers' is tied to the 'speakers' array? Why not just use the length of the array?

dosubot[bot] commented 2 months ago

Yes, the n_speakers parameter in the SynthesizerTrn object is tied to the length of the 'speakers' array, and it can be set using the length of this array. The n_speakers parameter is used to initialize the nn.Embedding layer, which requires the number of speakers to be specified [1].

You can set n_speakers using the length of the 'speakers' array like this:

n_speakers = len(speakers_array)
synthesizer = SynthesizerTrn(
    n_vocab=100,  # example value
    spec_channels=80,  # example value
    inter_channels=192,  # example value
    hidden_channels=192,  # example value
    filter_channels=768,  # example value
    n_heads=2,  # example value
    n_layers=6,  # example value
    kernel_size=3,  # example value
    p_dropout=0.1,  # example value
    resblock="1",  # example value
    resblock_kernel_sizes=[3, 7, 11],  # example value
    resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]],  # example value
    upsample_rates=[8, 8, 2, 2],  # example value
    upsample_initial_channel=512,  # example value
    upsample_kernel_sizes=[16, 16, 4, 4],  # example value
    n_speakers=n_speakers,  # set using the length of the speakers array
    gin_channels=256,  # example value
    zero_g=False  # example value
)

This ensures that the n_speakers parameter correctly reflects the number of speakers in your array.

To continue talking to Dosu, mention @dosu.

fury88 commented 2 months ago

@dosu, I'm setting all of this in the configuration file, not directly in the python code. I would expect the configuration file is the proper way, otherwise we'd have to modify the code and put in a pull request if that needs a permanent change.

dosubot[bot] commented 2 months ago

You can indeed set the n_speakers parameter in the configuration file. This is the proper way to handle it without modifying the code directly. Ensure that your configuration file includes the n_speakers parameter, and it is set to the length of your speakers array.

Here is an example of how to set it in the configuration file:

{
    "n_vocab": 100,
    "spec_channels": 80,
    "inter_channels": 192,
    "hidden_channels": 192,
    "filter_channels": 768,
    "n_heads": 2,
    "n_layers": 6,
    "kernel_size": 3,
    "p_dropout": 0.1,
    "resblock": "1",
    "resblock_kernel_sizes": [3, 7, 11],
    "resblock_dilation_sizes": [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
    "upsample_rates": [8, 8, 2, 2],
    "upsample_initial_channel": 512,
    "upsample_kernel_sizes": [16, 16, 4, 4],
    "n_speakers": 256,  // Set this to the length of your speakers array
    "gin_channels": 256,
    "zero_g": false
}

When the configuration file is loaded, the n_speakers parameter will be used to initialize the SynthesizerTrn object. Ensure that the n_speakers value matches the number of speakers you intend to support.

The configuration file is parsed and loaded into the hps object using the utils.get_hparams_from_file function, and then passed to the SynthesizerTrn object as follows:

class OpenVoiceBaseClass(object):
    def __init__(self, 
                config_path, 
                device='cuda:0'):
        if 'cuda' in device:
            assert torch.cuda.is_available()

        hps = utils.get_hparams_from_file(config_path)

        model = SynthesizerTrn(
            len(getattr(hps, 'symbols', [])),
            hps.data.filter_length // 2 + 1,
            n_speakers=hps.data.n_speakers,
            **hps.model,
        ).to(device)

        model.eval()
        self.model = model
        self.hps = hps
        self.device = device

Ensure that your configuration file correctly sets the n_speakers parameter to avoid any issues [1][2].

To continue talking to Dosu, mention @dosu.

fury88 commented 2 months ago

@dosu, yes, then this goes back to having the code read the length of the array instead of having to put in the parameter. ;-)

dosubot[bot] commented 2 months ago

Yes, the code can be modified to automatically read the length of the speakers array instead of manually setting the n_speakers parameter in the configuration file. You can achieve this by changing the initialization of the SynthesizerTrn model to use the length of the speakers array from the hps object.

Here is the modified code:

class OpenVoiceBaseClass(object):
    def __init__(self, 
                config_path, 
                device='cuda:0'):
        if 'cuda' in device:
            assert torch.cuda.is_available()

        hps = utils.get_hparams_from_file(config_path)

        # Automatically set n_speakers based on the length of the speakers array
        n_speakers = len(getattr(hps, 'speakers', []))

        model = SynthesizerTrn(
            len(getattr(hps, 'symbols', [])),
            hps.data.filter_length // 2 + 1,
            inter_channels=hps.model.inter_channels,
            hidden_channels=hps.model.hidden_channels,
            filter_channels=hps.model.filter_channels,
            n_heads=hps.model.n_heads,
            n_layers=hps.model.n_layers,
            kernel_size=hps.model.kernel_size,
            p_dropout=hps.model.p_dropout,
            resblock=hps.model.resblock,
            resblock_kernel_sizes=hps.model.resblock_kernel_sizes,
            resblock_dilation_sizes=hps.model.resblock_dilation_sizes,
            upsample_rates=hps.model.upsample_rates,
            upsample_initial_channel=hps.model.upsample_initial_channel,
            upsample_kernel_sizes=hps.model.upsample_kernel_sizes,
            n_speakers=n_speakers,
            gin_channels=hps.model.gin_channels,
            zero_g=hps.model.zero_g,
        ).to(device)

        model.eval()
        self.model = model
        self.hps = hps
        self.device = device

In this modification, n_speakers is set to the length of the speakers array from the hps object, ensuring that it is automatically determined based on the configuration file [1][2].

To continue talking to Dosu, mention @dosu.

fury88 commented 2 months ago

@dosu, more missing parameters. Just FYI, I'm using the example so something is not right when you stray off the path. 'ref_enc' is another attribute missing.

dosubot[bot] commented 2 months ago

To resolve the missing ref_enc attribute error in the SynthesizerTrn class, you need to ensure that the ref_enc attribute is properly initialized. The current implementation initializes ref_enc only if n_speakers is 0. If n_speakers is not 0, ref_enc is not initialized, which leads to the missing attribute error.

To fix this, you should initialize ref_enc regardless of the value of n_speakers. Here is the corrected SynthesizerTrn class:

class SynthesizerTrn(nn.Module):
    """
    Synthesizer for Training
    """

    def __init__(
        self,
        n_vocab,
        spec_channels,
        inter_channels,
        hidden_channels,
        filter_channels,
        n_heads,
        n_layers,
        kernel_size,
        p_dropout,
        resblock,
        resblock_kernel_sizes,
        resblock_dilation_sizes,
        upsample_rates,
        upsample_initial_channel,
        upsample_kernel_sizes,
        n_speakers=256,
        gin_channels=256,
        zero_g=False,
        **kwargs
    ):
        super().__init__()

        self.dec = Generator(
            inter_channels,
            resblock,
            resblock_kernel_sizes,
            resblock_dilation_sizes,
            upsample_rates,
            upsample_initial_channel,
            upsample_kernel_sizes,
            gin_channels=gin_channels,
        )
        self.enc_q = PosteriorEncoder(
            spec_channels,
            inter_channels,
            hidden_channels,
            5,
            1,
            16,
            gin_channels=gin_channels,
        )

        self.flow = ResidualCouplingBlock(inter_channels, hidden_channels, 5, 1, 4, gin_channels=gin_channels)

        self.n_speakers = n_speakers
        self.ref_enc = ReferenceEncoder(spec_channels, gin_channels)  # Initialize ref_enc regardless of n_speakers
        if n_speakers != 0:
            self.enc_p = TextEncoder(n_vocab,
                inter_channels,
                hidden_channels,
                filter_channels,
                n_heads,
                n_layers,
                kernel_size,
                p_dropout)
            self.sdp = StochasticDurationPredictor(hidden_channels, 192, 3, 0.5, 4, gin_channels=gin_channels)
            self.dp = DurationPredictor(hidden_channels, 256, 3, 0.5, gin_channels=gin_channels)
            self.emb_g = nn.Embedding(n_speakers, gin_channels)
        self.zero_g = zero_g

This change ensures that ref_enc is always initialized, preventing the missing attribute error [1].

To continue talking to Dosu, mention @dosu.