make_data_splits has different parameters in na subclass

shuttle1987 commented 6 years ago

https://github.com/persephone-tools/persephone/blob/cfe5096e929edcb45b0eb8133c873b9f6e8361f0/persephone/datasets/na.py#L531

But in the base class we have:

https://github.com/persephone-tools/persephone/blob/cfe5096e929edcb45b0eb8133c873b9f6e8361f0/persephone/corpus.py#L292

shuttle1987 commented 6 years ago

Is this in some way related to #157?

alexis-michaud commented 6 years ago

Does that have to do with recent (2018) modifications to the Na data as hosted in GitHub? These modifications include distinguishing 2 types of separators above the syllable (in addition to vertical bar | for tone-group boundary, there's now a diamond ♢ before extrametrical syllables)

If so, let me know if I can help.

shuttle1987 commented 6 years ago

I would like @oadams to confirm but I suspect that ♢ is handled here:

https://github.com/persephone-tools/persephone/blob/cfe5096e929edcb45b0eb8133c873b9f6e8361f0/persephone/datasets/na.py#L172-L179

alexis-michaud commented 6 years ago

Oh yes I see! Apologies for my novice comments & questions.

Passing comment: it is fantastic to see all the great work being done on the code: clean & crisp comments, neat overall architecture... This is wonderfully crafted!

oadams commented 6 years ago

The way the diamond is handled by that above code is to treat it the same way as the vertical bar is treated: remove it if we are not predicting tone group boundaries (TGBs), or replace it with a vertical pipe if we are. Moving forward though, perhaps we might want to try predicting extrametrical syllables using the distinct diamond symbol? That isn't covered by the above code.

oadams commented 6 years ago

As for the OP: I don't think it's directly related to #157. The reason the parameters are different is that the Na data has some features that may not be common to all corpora. That is, the corpus is divided into a number of narratives. The valid_story and test_story parameters were used to specify which of these stories were used in the validation and test sets. This is as opposed the default behaviour, which randomly selects utterances from the whole corpus to serve as training, validation and test sets.

alexis-michaud commented 6 years ago

"Moving forward though, perhaps we might want to try predicting extrametrical syllables using the distinct diamond symbol?"

This is not feasible, I think. Extrametricality is a morpho-phonological concept, not a phonetic one. My hypothesis is that in some cases it is simply undetectable in the audio: extrametrical syllables cannot be identified as such from the acoustic signal unless there is a higher-level model (language model, not acoustic model).

For optimal results of a phonemic transcription tool, the transcription would need to be closer to the phonetic surface: instead of a diamond, there should be a bona fide tone-group boundary in those cases where the 'diamond' boundary makes a difference to the tonal string, and otherwise, no indication at all (no boundary). This would require an additional layer in the annotation: one that's closer to the phonetics.

Since I don't plan to do this improvement to the corpus anytime soon, the choice is between treating diamonds as tone-group boundaries, or overlooking the diamonds altogether. If someone had time to lavish on this issue, this could be empirically tested: comparing the results (error rates for tone-group boundaries) under these 2 settings. Pending this empirical test, I believe that the current setup is the better of the two: treating diamonds as tone-group boundaries.

oadams commented 6 years ago

Thanks for the thoughts. Let's leave the preprocessing of diamonds as is for now then.

persephone-tools / persephone

make_data_splits has different parameters in na subclass #160