nii-yamagishilab / self-attention-tacotron

An implementation of "Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language" https://arxiv.org/abs/1810.11960
BSD 3-Clause "New" or "Revised" License
113 stars 32 forks source link

Pitch accent label not match x >= y condition from tacotron2 #35

Closed philippbb closed 3 years ago

philippbb commented 3 years ago

Hello,

I try to make self attention tacotron run with pitch accent label.

In my case the pitch sequence has minimum 0 value and maximum 99. 0 is the accent_type_unknown pitch accent.

Whenever I set a higher number than 0 for accent_type_offset in pitch accent I keep hitting this line here:https://github.com/nii-yamagishilab/tacotron2/blob/5052c254f95fa6b51c4c1939a0171abb0c06835c/tacotron/modules.py#L65

[Condition x >= y did not hold element-wise:] [x (IteratorGetNext:4) = ] [[0 50 50...]...] [y (Const_1:0) = ] [1] [[{{node embedding_1/assert_greater_equal/Assert/AssertGuard/Assert}}]]

The default value from hparam is 0x3100, so I assume setting it to 0 produces an unwanted output. I can train it this way, but I think I make a mistake somewhere.

Do you have any idea why this could be happening?

Something tells me it should compare indexes and not the actual values, but I could not yet find the reason.

Kind Regards

philippbb commented 3 years ago

I am sorry for some reason the link didnt work where the error happens: https://github.com/nii-yamagishilab/tacotron2/blob/5052c254f95fa6b51c4c1939a0171abb0c06835c/tacotron/modules.py#L65

here are some additional infos from my hparams

"initial_learning_rate": 0.0005, "outputs_per_step": 2, "max_iters": 500, "attention": "forward", "cumulative_weights": false, "attention_kernel": 10, "attention_filters": 5, "use_zoneout_at_encoder": true, "decoder_version": "v2", "dataset": "jsut.dataset.DatasetSource", "target_file_extension": "target.tfrecord", "save_checkpoints_steps": 1343, "tacotron_model": "DualSourceSelfAttentionTacotronModel", "encoder": "SelfAttentionCBHGEncoderWithAccentType", "decoder": "DualSourceTransformerDecoder",

embedding_dim=224,

### accent
use_accent_type=True,
accent_type_embedding_dim=32,
num_accent_type=100,
accent_type_offset=0,#0xC80,
accent_type_unknown=0,#0xCE4, #0xC80,
accent_type_prenet_out_units=(32, 16),
encoder_prenet_out_units_if_accent=(224, 112),

In jsut dataset I configured the data like below and ajusted the rest of the file.

class SourceData(namedtuple("SourceData", ["id", "key", "source", "source_length", "openjsource", "text", ])): pass

def parse_preprocessed_source_data(proto): features = { 'id': tf.FixedLenFeature((), tf.int64), 'key': tf.FixedLenFeature((), tf.string), 'source': tf.FixedLenFeature((), tf.string), 'source_length': tf.FixedLenFeature((), tf.int64), 'openjsource': tf.FixedLenFeature((), tf.string), 'text': tf.FixedLenFeature((), tf.string), } parsed_features = tf.parse_single_example(proto, features) return parsed_features

philippbb commented 3 years ago

Looking at the code now, I think i must have missed updating some of the other functions from the dataset file.

TanUkkii007 commented 3 years ago

Sorry for my late response. Setting accent_type_offset=0, accent_type_unknown=0 looks good for your accent type label having minimum 0 value and maximum 99. Only cause I can think of is that your data contains accent type beyond [0, 99], so please make sure the range of accent type label.

philippbb commented 3 years ago

Thank you for your answer.

I was unsure if the offset value is coming from the pitch accent dataset or not. But your answer makes it clear.