persephone-tools / persephone

A tool for automatic phoneme transcription
Apache License 2.0
155 stars 26 forks source link

How to handle labels found in corpus but not in supplied label parameter? #170

Closed shuttle1987 closed 6 years ago

shuttle1987 commented 6 years ago

Since implementing #169 which deals with #167 I have the following situation I'd like some help designing the API for:

How do we handle the case where the Corpus contains a set of labels say for example {"a", "b", "c"} and then someone decides to construct a Corpus with a labels parameter containing only {"a", "b"}?

Do we want to supply some sort of message to the user that there are labels found in the corpus that were not in the the supplied label parameter? (for example: label "c" found in corpus but not in provided labels). Should the labels parameter act as some sort of a filter on the corpus or not? Or do we want to do something else entirely?

I'd appreciate some advice on how this case should be handled

alexis-michaud commented 6 years ago

That's an easy one: automate the sending of an e-mail message to the ethics committee back at the home institution of the author of the transcription & charge them with Deceit for mismatch between declared set of labels & actual set of labels.

Or maybe notify the user first: yes, it is very useful for the author of the transcription to get a report on which labels are used in the corpus. Among future users of Persephone there will be linguists who don't have the computational skills to do a 'sanity check' on their transcriptions (crank out the list of labels they actually used, so they can compare it to the list of labels that they think they use). Transcription systems change over the years (from the first fieldwork to transcription work done years or decades later) & updating texts is such a lot of work that not everyone carries out the updates consistently.

alexis-michaud commented 6 years ago

Quoting from a paper reporting on early tests with Yongning Na (section 4.1):

Data preparation offered a chance to check that all the data conformed to the phonological description. At data conversion from XML to plain text, a list of segments was produced, and compared with the list of sounds provided for the language (by the second author). This comparison brought out a handful of inconsistencies in the notation, such as the use of /ẽ/ for an interjection appearing in some of the texts. This prompted a return to the data, which revealed that these were in fact instances of ‘yes’ (canonical transcription: /ĩ/) that had been transcribed before the second author identified the nature of this morpheme. Systematic examination of the passages at issue revealed that this /ẽ/ was in most cases a response to a comment or yes/no question on the part of someone in the audience, confirming the interpretation of this morpheme as a sign of approval.

shuttle1987 commented 6 years ago

@alexis-michaud thanks for this, it shows that clear and immediate feedback on data discrepancies is important, so we will make sure to address this.

alexis-michaud commented 6 years ago

Yes, clear and immediate feedback on data discrepancies will be great. This relates to preprocessing: the feedback could be used as an early-stage 'diagnosis' of how much preprocessing remains to be done on the corpus before Persephone can be applied with best results.

In addition to the issue of transcriptional inconsistencies, there is the issue of code-switching. In principle, preprocessing needs to be done before using a corpus as input to Persephone, so that the input contains only data from one language. But I guess that in many (most?) corpora there will remain some words (or even entire sentences) in other languages, due to code-switching: in fieldwork, the investigator & the consultant often share a contact language that is native to neither of them (typically the national language: for instance, English or Chinese or Arabic or Portuguese) and they may talk in that language during part of the recordings. If this has not been properly encoded in the transcription files (as language information inside the XML tag for a given chunk of transcription: lang="cmn" or the like) and taken into account at preprocessing, it will make trouble for training an acoustic model.

This is a situation where interdisciplinary collaboration matters, because these cases, once spotted, are easy for the corpus author to explain (for her, it goes without saying, that's why she may not realize that it needs to be encoded as logical text) whereas they are hard for people who don't know the languages at issue. Each data set has its own context, and if sufficient information is not provided and neatly encoded, it can be excruciatingly hard for users to 'reconstruct' later.

For instance, in the Japhug corpus (transcribed in International Phonetic Alphabet), some Chinese loanwords are transcribed with Chinese characters, thus:

> tɕe ɯ-me nɯnɯ andi xiaoshuigou 小水沟 kɯre tha-zmɤrʑaβ-nɯ tɕe, nɯre thɯ-ɣe

There is an implicit convention that is clear to someone who knows Chinese: the author, Guillaume Jacques @rgyalrong, provided a transcription of the loan in Standard Mandarin romanization, xiaoshuigou, followed by the transcription in Chinese characters, 小水沟. Something needs to be done about this so that the data can serve as input to Persephone: simply excluding sentences containing such loans from the training corpus, or maybe converting the transcription of loanwords to IPA.

(The latter solution would be the most interesting thing to do linguistically, as it would add value to the transcription: speakers of Japhug don't pronounce this word in the usual Southwestern Mandarin way. It would also enrich the acoustic model, which might ultimately succeed in identifying the phonetic form of loanwords with high accuracy, too. And it would be good to choose the way to go as early as possible, because such adjustments are time-consuming: in this case it requires making up one's mind about the best way to transcribe the consultant's Chinese pronunciation, and this is a non-trivial linguistic issue.)

oadams commented 6 years ago

For now, if a user supplies labels and they are inconsistent with what is automatically raised, an exception is thrown. This is specifically for Corpus.init() constructor, which assumes user supplied data is already preprocessed. Therefore the only function the labels argument to this constructor serves is to ensure that what is automatically found is consistent with user's expectations.

For filtering, that is handled via LabelSegmenter's to other corpus constructors. For example, from_elan() will not expect pre-segemented units, so the behaviour here is going to be more sophisticated.

EDIT: Given Alexis's comments, I'm happy to re-open this for the purposes of thinking about automatic segmentation and filtering of utterances in constructors like from_elan().