Speakers parameter in `Corpus` is unused

shuttle1987 commented 6 years ago

Currently the speakers parameter in Corpus.__init__ is unused.

This strikes me as an issue because from_elan calls the Corpus constructor with this speakers parameter but then nothing happens.

shuttle1987 commented 6 years ago

This looks like a bit of a bigger task than I first thought because it implies that a fix will involve figuring out how to deal with speakers then implementing the code that will do that.

I added in a NotImplemented exception if speakers is passed in as a parameter since the parameter is unused right now and a loud failure seems to be the best immediate fix.

alexis-michaud commented 6 years ago

"figuring out how to deal with speakers": this is a very important feature for use of Persephone on a broader range of fieldwork data.

As I understand, the idea would be to sort the data (roughly speaking: the sentence-level units) by speakers. Then:

in case it's just a few sentences in total for a given speaker, simply exclude those sentences from the acoustic model.
in case there's enough data for each speaker (for instance, if the training corpus is a big corpus of dialogues with speaker identity encoded in the transcription), train 1 model per speaker?

By default, the acoustic model created is speaker-specific, right? So the simplest use of the speakers parameter is to exclude those portions of the corpus that are said by another speaker.

For instance, as part of efforts to improve the Na corpus, I recently added a tag to passages said by the investigator, so that they can be automatically identified for exclusion from the training set. The tag is inside the <S> markup. For example:

<S id="MUSHROOMS_S107" who="investigator">

In the present state of the Na corpus, there are just thirty <S> units said by someone other than the main consultant, and it's typically just a couple of words spoken as a rejoinder: hundreds or thousands of times less than the amount said by the main consultant. So the obvious thing to do is to exclude these <S> units. That is one of various ways to improve the acoustic model for Na in future. Nothing urgent or really important, as the improvement is likely to be very small for this corpus (though it would still be good fun to gain further lowering of error rates: trying to break our own records!), but for most other fieldwork corpora there are more speakers and more frequent changes in speech turns than in the Na corpus, so this feature is really important.

(As ever: apologies if this comment is too wide of the mark!)

shuttle1987 commented 6 years ago

Appreciate the feedback here.

"By default, the acoustic model created is speaker-specific, right?" while the library can do this I don't know that the implementation as it stands does this in all cases. This is something I'd very much like to clarify as this would let me improve the code that implements this functionality.

shuttle1987 commented 6 years ago

@oadams:

For instance, as part of efforts to improve the Na corpus, I recently added a tag to passages said by the investigator, so that they can be automatically identified for exclusion from the training set. The tag is inside the <S> markup. For example:

Is this something we should make a feature request issue for?

oadams commented 6 years ago

I added in a NotImplemented exception if speakers is passed in as a parameter since the parameter is unused right now and a loud failure seems to be the best immediate fix.

Sounds good!

Yes, the idea was to filter the corpus based on speakers, but I never actually got around to doing that.

Looking at the code again, it seems it doesn't really make sense to have a speakers parameter in the default Corpus constructor, since that constructor won't have any information about the speakers in the utterances anyway. If we want to filter on speakers from ELAN files or Pangloss XML files, that should be happening in their respective class method constructors. Or to minimize code reuse, there should be a class method that takes Utterances and can thus do speaker filtering. The ELAN corpus constructor would then call this, as would any other format-specific class methods such as a yet-to-be-implemented Pangloss one. Does this make sense?

shuttle1987 commented 6 years ago

I propose the Corpus keep a record of who the speakers are but that the base Corpus doesn't do the processing on the speakers. If this is parameter is not given it will just store an empty list.

Does this sound reasonable?

oadams commented 6 years ago

This sounds good to me. We might not want it passed as an argument though, but rather it can be inferred from the Utterance objects. There's not much gain knowing that speakers x,y,z spoke in a Corpus unless we have a mapping from utterances to speakers. The Utterance objects satisfy this.

So basically, methods that call the default constructor just provide Utterance objects and if they know the speaker ID, that is part of the utterance object. The constructor can infer the speakers from these utterances.

shuttle1987 commented 6 years ago

Given the ability to infer the speakers from the utterances I think we should get rid of the speakers parameter in the base Corpus.

oadams commented 6 years ago

Sounds good.

persephone-tools / persephone

Speakers parameter in `Corpus` is unused #145