Closed alecristia closed 6 years ago
datasets that could be used for this:
All of the following need to be processed:
Hopefully this is trivial; you could imagine just filtering the RTTM produced by Yunitator with sed and awk, replacing class labels other than CHI from the set of, as I recall:
CHI
FEM
MAL
SIL
I think the main idea is to train a new Yunitator on "ACLEW round 1-3" data in addition to Vandam and ASE (ACLEW Starter) - not sure if CHILDES and IDS/ADS samples are available.
I second Eric's comment -- it might be good to get started with the data you already have, even if we'll increase the training set later on (which might make for a nice experiment anyway). Remember that the data folks are waiting for this tool to do their next round of sampling.
All of the data I mentioned are already available. All that remains is to format the annotations in the same format you have used before. Here are the links:
CHILDES Paidologos:
IDS/ADS samples:
Oh good. I also just had to request the latest media.talkbank.org data password, and can supply if needed.
This is maybe the second situation where we'd like to train up a new Yunitator variant. Maybe we can generalize and create a task for Yun: produce documentation and give examples of how we ("a novice") can train Yunitator on new data to produce new models & class labels?
Closing this issue - this is a suboptimal solution, and we should focus our forces on better ones (eg 4- or 5-class labeling.)
I meant that if we can re-train a new Yunitator, it could be on 4- or 5-class labeled data, making it more optimal. Agree the number of classes currently produced by Yunitator is suboptimal.
Oh I meant the 2-class solution that I had brought up in this issue was even more suboptimal (even less optimal?). It was just an idea that came up given the scarcity of data -- but I think we can do better, for instance with the 3 class currently implemented. RE training, I don't think that is a priority given our current user base (ACLEW), since none of us independently have enough data for retraining. In fact, it seems that even all together we don't have enough data for training 3 classes! So a retrainable module sounds more useful in theory than in our current world...
that is, distinguish key child (one class) versus all others (another class containing mother, all female adults, male adults, other children, none)