mlml / autovot

Trainable algorithm for automatic measurement of voice onset time
GNU Lesser General Public License v3.0
62 stars 20 forks source link

Training and testing options? #32

Closed msonderegger closed 10 years ago

msonderegger commented 10 years ago

We currently have most of the parameters used by InitialVotTrain and InitialVotDecode just hard-coded into the python helper scripts, e.g. epochs=2, loss_eps=4, min_vot_length=5 in training. The exception is min_vot_length and max_vot_length in auto_vot_decode.py, which I put in as arguments specifically because they needed to be different for the voiced and voiceless CAS data (which is being posted as the working example) for best results.

In general, the user might need to change some of the parameters for best results. Yossi and I exchanged some email about this a while back, and I think the conclusion was we should make clear in the documentation somewhere what the default parameter values are, what each parameter means, and allow the user to change them by supplying a configuration file in some fixed format (so we don't have even more command line arguments, one per parameter!) . However, this will take a non-trivial amount of extra work on the wiki and changing the python scripts, and seems like something that's important, but that we can do after the beta is released. @thealk @jkeshet : how does that sound?

(We may even want to explicitly put in the documentation the example of using different min_vot_length values for testing on the voiced CAS data -- if you set this too high for voiced stops or too low for voiceless stops (in English), you get worse results.)

thealk commented 10 years ago

I think this sounds like a good idea. This would amount to the user being able to still toggle the parameters via the command line options, but wouldn't necessarily have to? In terms of the documentation that would clean things up as well - all default parameter settings could be defined ahead of time which would make usage descriptions for each file a lot simpler. This could actually be done now, I think (I was going to post a question about that), but it might actually make things more confusing at this point. I can at least mention the min/max vot length settings in the tutorial.

msonderegger commented 10 years ago

No -- unfortunately, this wouldn't affect the current command line options (except for min_vot_length and max_vot_length) for the python scripts. I'm thinking of the command line options for InitialVotDecode and InitialVotTrain (you can see them all by running either one with no arguments on the command line), for example:

Morgans-MacBook-Air:bin morgan$ ./InitialVotTrain

USAGE: InitialVotTrain [options] train_instances_filelist train_labels_filename classifier_filename

Initial VOT detection - Passive Aggressive training

OPTIONS:
 -val_instances_filelist validation instances filelist
 -val_labels_filename validation labels filename
 -epochs number of epochs [1]
 -min_vot_length min. vot duration in msec [10]
 -max_vot_length max. vot duration in msec [200]
 -max_onset max. time to onset in msec [150]
 -C Trade-off between regularization and loss (running PA)
 -loss_eps epsilon parameter of the loss
 -ep_on epsilon parameter of the onset loss
 -ep_off epsilon parameter of the offset loss
 -ignore_features ignore the following features. E.g., "3,7,19".
 -direct_loss use direct-loss update with the given epsilon [0.0]
 -vot_loss use the VOT loss instead of alignment loss
 -training_method PA, Pegasos, Perceptron, DirectLossMin
 -pos_only Assume only positive VOTs
 -load_classifier load classifier for adaptation
 -kernel_expansion use kernel expansion of type 'poly2' or 'rbf2'
 -sigma if kernel is rbf2 or rbf3 this is the sigma
 -verbose log reporting level [ERROR, WARNING, INFO, or DEBUG]

These are programs that are called by the python script, but the user never has to run them directly. Nonetheless, she may want to change some of these parameters, depending on the data. (For example, for a very small dataset, it would make sense to increase epochs.)

thealk commented 10 years ago

Ah, got it. Still sounds like a good idea to me, but yes, probably for next release.

jkeshet commented 10 years ago

We should definitely do that. I will add the support for the training parameter configuration files and you or I will add the description.

jkeshet commented 10 years ago

We defer this to the next version.