ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files
Mozilla Public License 2.0
358 stars 75 forks source link

Documentation examples or validation of command line arguments #31

Closed ftyers closed 7 years ago

ftyers commented 7 years ago

It would be handy to have some examples of how to use the training parameters in the documentation, or some kind of validation of command line arguments.

I am perhaps not the typical use case, but I have been trying to change the parameters for a while, but here is no indication which ones are valid/invalid,

$ udpipe --tokenizer none --tagger none --train tr-ud-train.0.udpipe < UD_Turkish/tr-ud-train.conllu 
$ udpipe --tokenizer none --tagger none --parser swap --train tr-ud-train.1.udpipe < UD_Turkish/tr-ud-train.conllu 
$ udpipe --tokenizer none --tagger none --parser "structured_interval=8" --train tr-ud-train.2.udpipe < UD_Turkish/tr-ud-train.conllu 

These all seem to produce the same output model. I'm sure I'm missing something, but I can't work out what it is from the documentation. I'm sure it is something I'm doing wrong, but I can't work out what :)

vinbo8 commented 7 years ago

I believe you need to add =s for every parameter, so your commands ought to be:

$ udpipe --tokenizer=none --tagger=none --train tr-ud-train.0.udpipe < UD_Turkish/tr-ud-train.conllu
$ udpipe --tokenizer=none --tagger=none --parser=swap --train tr-ud-train.1.udpipe < UD_Turkish/tr-ud-train.conllu
$ udpipe --tokenizer=none --tagger=none --parser=structured_interval=8 --train tr-ud-train.2.udpipe < UD_Turkish/tr-ud-train.conllu

Although you're likely getting the same models because structured_interval is supposed to be 8 by default. Doesn't explain the swap model being the same, though.

ftyers commented 7 years ago

I added some comments to the source, and it looks like you're right:

$ udpipe --tokenizer none --tagger none --parser "transition_system=swap" --train tr-ud-train.1.udpipe < UD_Turkish/tr-ud-train.conllu 
Loading training data: done.
Training the UDPipe model.

transition_system: swap

transition_oracle: static_lazy

structured_interval: 8
...

Also, it seems like structured_interval is 8 by default. So that wasn't a good thing to try testing.

Perhaps there could be a --config or --verbose option which prints out the internal config ?

foxik commented 7 years ago

The = are not really necessary, --parser a is the same as --parser=a.

The option parser is currently ignoring unknown options, which can unfortunately cause problems, as with the --parser=swap, which is just ignored, because it is an unknown option.

I will think about improving the situation for UDPipe 1.1 -- at least add the --verbose option which would report used configuration; or even failing when unknown options are given.

ftyers commented 7 years ago

That's great thanks! :)

foxik commented 7 years ago

UDPipe 1.1 now prints all (configurable) options during training, so you can check that correct ones are used. (No --verbose is needed.)

In the end UDPipe does not fail if unknown tokenizer/tagger/parser options are passed, but I may revisit that decision in the future.