Closed nicolabertoldi closed 7 years ago
Hi Nicola,
did you reimport your tasks? Unfortunately, MT-ComparEval can't compute new metrics once the task was imported. So, I suggest you to run this to reimport all tasks:
rm storage/database && sqlite3 storage/database < schema.sql
find data -name ".imported" | xargs rm
./bin/watcher.sh
Our demo at http://wmt.ufal.cz uses different configuration to use same names for BLEU and BLEU-cased as http://matrix.statmt.org.
Ondrej
To be sure, I created a new experiments and load two tasks. But the problem is still there
Ok, I will look at it. Thank you.
Hi Nicola,
I found the problem. After changing the configuration file, cache has to be removed with:
rm -rf temp/cache/*
.
This is little bit annoying and it can be fixed by deleting this line: https://github.com/choko/MT-ComparEval/blob/master/app/bootstrap.php#L7 (from some reason we need that line for our deployments).
I will disccuss deletion of this line with @martinpopel.
Thank you very much again.
Ondrej
after the removal of the temp directory everything works fine
thanks for your help
I started to use your MT-ComparEval toolkit I installed it onto my machine
Almost everything works as expected but the statistics based on bootstrap sampling For ‘bleu-cis’ the plots are correct, whereas for all the other metrics, I got empty plots. By improving the log outputs in "./app/templates/Tasks/compare.latte” I discovered that the this variable “data.samples.data.length” is actually 0 (line 215); I suspect that something in the sampling went wrong.
Hence, I digged in the code and logs, and (I think) I understand that the sampling is done once each single task is loaded. I suspect that the statistics based on score differences (like the paired Bootstrap Sampling) exploits the sampling mentioned above; correct?
However it seems that the computation of this score difference is done only for ‘bleu-cis’
as I see in the watcher log "Generating BLEU-cis samples for system1.”
This is somehow confirmed by the config “app/config/config.neon”, where the only metric for which the flag compute_bootstrap is True is “blue-cis” whereas for the rest this flag is set to False.
I tried to activate this flag for all measures, but I did not see any change.
Just a final note which can be helpful. With respect to the demo at http://wmt.ufal.cz the list of available metrics differ In our version, instead, we have: brevity-penalty, bleu-cis, bleu, precision, recall, f1-measure (in this order which is the same of the config.neon file)
Nicola