Open shuttle1987 opened 6 years ago
This looks like a bit of a bigger task than I first thought because it implies that a fix will involve figuring out how to deal with speakers
then implementing the code that will do that.
I added in a NotImplemented exception if speakers
is passed in as a parameter since the parameter is unused right now and a loud failure seems to be the best immediate fix.
"figuring out how to deal with speakers
": this is a very important feature for use of Persephone on a broader range of fieldwork data.
As I understand, the idea would be to sort the data (roughly speaking: the sentence-level units) by speakers. Then:
By default, the acoustic model created is speaker-specific, right? So the simplest use of the speakers
parameter is to exclude those portions of the corpus that are said by another speaker.
For instance, as part of efforts to improve the Na corpus, I recently added a tag to passages said by the investigator, so that they can be automatically identified for exclusion from the training set. The tag is inside the <S>
markup. For example:
<S id="MUSHROOMS_S107" who="investigator">
In the present state of the Na corpus, there are just thirty <S>
units said by someone other than the main consultant, and it's typically just a couple of words spoken as a rejoinder: hundreds or thousands of times less than the amount said by the main consultant. So the obvious thing to do is to exclude these <S>
units.
That is one of various ways to improve the acoustic model for Na in future. Nothing urgent or really important, as the improvement is likely to be very small for this corpus (though it would still be good fun to gain further lowering of error rates: trying to break our own records!), but for most other fieldwork corpora there are more speakers and more frequent changes in speech turns than in the Na corpus, so this feature is really important.
(As ever: apologies if this comment is too wide of the mark!)
Appreciate the feedback here.
"By default, the acoustic model created is speaker-specific, right?" while the library can do this I don't know that the implementation as it stands does this in all cases. This is something I'd very much like to clarify as this would let me improve the code that implements this functionality.
@oadams:
For instance, as part of efforts to improve the Na corpus, I recently added a tag to passages said by the investigator, so that they can be automatically identified for exclusion from the training set. The tag is inside the
<S>
markup. For example:
Is this something we should make a feature request issue for?
I added in a NotImplemented exception if speakers is
passed
in as a parameter since the parameter is unused right now and a loud failure seems to be the best immediate fix.
Sounds good!
Yes, the idea was to filter the corpus based on speakers, but I never actually got around to doing that.
Looking at the code again, it seems it doesn't really make sense to have a speakers parameter in the default Corpus
constructor, since that constructor won't have any information about the speakers in the utterances anyway. If we want to filter on speakers from ELAN files or Pangloss XML files, that should be happening in their respective class method constructors. Or to minimize code reuse, there should be a class method that takes Utterance
s and can thus do speaker filtering. The ELAN corpus constructor would then call this, as would any other format-specific class methods such as a yet-to-be-implemented Pangloss one. Does this make sense?
I propose the Corpus
keep a record of who the speakers are but that the base Corpus
doesn't do the processing on the speakers. If this is parameter is not given it will just store an empty list.
Does this sound reasonable?
This sounds good to me. We might not want it passed as an argument though, but rather it can be inferred from the Utterance objects. There's not much gain knowing that speakers x,y,z spoke in a Corpus unless we have a mapping from utterances to speakers. The Utterance objects satisfy this.
So basically, methods that call the default constructor just provide Utterance objects and if they know the speaker ID, that is part of the utterance object. The constructor can infer the speakers from these utterances.
Given the ability to infer the speakers from the utterances I think we should get rid of the speakers parameter in the base Corpus.
Sounds good.
Currently the
speakers
parameter inCorpus.__init__
is unused.This strikes me as an issue because
from_elan
calls theCorpus
constructor with thisspeakers
parameter but then nothing happens.