m-popovic / hjerson

Hjerson: An Open Source Tool for Automatic Error Classification of Machine Translation Output
Other
0 stars 0 forks source link

how to use? #1

Closed genbei closed 3 years ago

genbei commented 3 years ago

Hello, I have a question. What's the difference?--ref reference ,--baseref reference.base I thought that after giving a machine translation and a manually modified translation, I could calculate the error classification. In addition, the specific use of scripts, such as python hjerson.py -R data/dev.pe -H data/dev.mt -B TEXT -b TEXT -s -m Are these four documents necessary? -R -H -B -b,Because I only have two documents on hand:dev.mt dev.pe Looking forward to your reply

m-popovic commented 3 years ago

Hi, --ref is human reference translation, --baseref are base forms (lemmas) of the human reference translation. You need a morphological analyser for the target language to get those. If you don't have it, the tool simulates the base forms by taking first 4 letters of each word. Machine translation output is given by -H (directly) and -b (base forms of MT output, or 4 letters) In your case, you can do hjerson.py -R dev.pe -H dev.mt and the tool will truncate the words to 4 letters to provide the missing "base" files. It's a good approximation if you do not have a morphological analyser for your language, but if you do, I advise to generate the base forms and use them because it's better.

genbei commented 3 years ago

This script can't be run python hjerson.py -R dev.pe -H dev.mt,as shown in the figure

h

Only the following prompts will appear ` hjerson.py -R, --ref reference -H, --hyp hypothesis -B, --baseref reference.base -b, --basehyp hypothesis.base

optional inputs: -A, --addref reference.additional -a, --addhyp hypothesis.additional

optional outputs: -s, --sent file.sent write sentence error rates
-m, --html file.html write error categories in a html file -c, --cats file.cats write error categories in a text file `

Hi, --ref is human reference translation, --baseref are base forms (lemmas) of the human reference translation. You need a morphological analyser for the target language to get those. If you don't have it, the tool simulates the base forms by taking first 4 letters of each word. Machine translation output is given by -H (directly) and -b (base forms of MT output, or 4 letters) In your case, you can do hjerson.py -R dev.pe -H dev.mt and the tool will truncate the words to 4 letters to provide the missing "base" files. It's a good approximation if you do not have a morphological analyser for your language, but if you do, I advise to generate the base forms and use them because it's better.

This script can't be run python hjerson.py -R dev.pe -H dev.mt,as shown in the figure

h

Only the following prompts will appear ` hjerson.py -R, --ref reference -H, --hyp hypothesis -B, --baseref reference.base -b, --basehyp hypothesis.base

optional inputs: -A, --addref reference.additional -a, --addhyp hypothesis.additional

optional outputs: -s, --sent file.sent write sentence error rates
-m, --html file.html write error categories in a html file -c, --cats file.cats write error categories in a text file `

m-popovic commented 3 years ago

that's strange, I've just checked it on my computer and it works, it backs off to four letters

the following function enables this back off: def take_four_letters(line): bline="" words = line.strip().split() for w in words: bline+=w[:4]+" "

and this check whether there are base forms or not: if not(args.reference_base or args.hypothesis_base): baserline = take_four_letters(rline) basehline = take_four_letters(hline) else: baserline = args.reference_base.readline() basehline = args.hypothesis_base.readline()

Maybe to run it with python3 will help?

genbei commented 3 years ago

that's strange, I've just checked it on my computer and it works, it backs off to four letters

the following function enables this back off: def take_four_letters(line): bline="" words = line.strip().split() for w in words: bline+=w[:4]+" "

and this check whether there are base forms or not: if not(args.reference_base or args.hypothesis_base): baserline = take_four_letters(rline) basehline = take_four_letters(hline) else: baserline = args.reference_base.readline() basehline = args.hypothesis_base.readline()

Maybe to run it with python3 will help?

Ok, I can run normally now. Maybe there is something wrong with the code I downloaded before. Thank you very much for your answer

m-popovic commented 3 years ago

You're welcome :)