jflanigan / jamr

JAMR Parser and Generator
BSD 2-Clause "Simplified" License
193 stars 50 forks source link

Little Prince #40

Open Fije opened 5 years ago

Fije commented 5 years ago

Hi,

In the README it says there should be a parser model trained on the Little Prince data, with corresponding config file. Is it correct that it isn't there and if so could you please share it? Would be very helpful!

Many thanks.

goodmami commented 5 years ago

@Fije I also found that scripts/config_Little_Prince.sh, mentioned in the Parser_Performance document, did not exist. However the Step by Step Training document had a link to http://cs.cmu.edu/~jmflanig/models.tgz which contains the Little Prince model. Then you just need a config to point to these files. There is a scripts/old_configs/config_ACL2014_Little_Prince.sh config, but it may be too old to be usable, as I get the following error:

...
 ### Running JAMR ###
Stage1 features = List(bias, length, fromNERTagger, conceptGivenPhrase)
Exception in thread "main" java.util.NoSuchElementException: key not found: 'stage1PhraseCounts
    at scala.collection.MapLike$class.default(MapLike.scala:228)
    at scala.collection.AbstractMap.default(Map.scala:58)
    at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
    at edu.cmu.lti.nlp.amr.ConceptInvoke.package$.Decoder(package.scala:25)
    at edu.cmu.lti.nlp.amr.AMRParser$.main(AMRParser.scala:113)
    at edu.cmu.lti.nlp.amr.AMRParser.main(AMRParser.scala)

There is no wordCounts.train in the models directory, and also no --stage1-phrase-counts option given in PARSER_OPTIONS in the old config. If you can find the tokenized input file (e.g., /tmp/jamr-1234.snt.tok) then you can create these by adapting the command in scripts/training/cmd.conceptTable.train:

$ cat /tmp/jamr-1234.snt.tok | scripts/training/wf | awk '{print $2,$1}' > models/Little_Prince/wordCounts.train

Then add the following to the config:

export PARSER_OPTIONS="
    --stage1-phrase-counts ${MODEL_DIR}/wordCounts.train  # <-- add this
    ...

This, at least, got me to decoding, but I'm not sure what other parameters that old config is missing, so the results may not be the same. When I parsed the little prince data using the standard config, smatch gave me an F-score of 0.57. When I parsed using the old config and the Little_Prince model, it gave 0.64, so it's an improvement.

Also see: #22 and #25, which are related to the error I had above.

Fije commented 5 years ago

Hi @goodmami,

Many thanks for your extensive explanation. In cmd.conceptTable.train, it also asks for an aligned input file. ("$INPUT".aligned) I'm not exactly sure what this should be: was this the original input file with alignments for training the little prince model?

From my side, I solely have a text file with 1 sentence, to create an example. Thanks again!

goodmami commented 5 years ago

@Fije I'm not sure about that one, but it wasn't required to create wordCounts.train as long as $INPUT.snt.tok already existed.

I just did grep -R '\.aligned' in the scripts/ directory and it looks like scripts/preprocessing/cmd.aligned is what writes that file. Maybe try running that or scripts/preprocessing/PREPROCESS.sh (which calls cmd.aligned) and see what happens? I'd be happy to hear what you discover; I'll probably try again in a few days.