google-research / bleurt

BLEURT is a metric for Natural Language Generation based on transfer learning.
https://arxiv.org/abs/2004.04696
Apache License 2.0
697 stars 85 forks source link

Answer comparison inaccuracy #16

Closed npb12 closed 3 years ago

npb12 commented 4 years ago

I'm wondering if I'm doing something wrong. Setting the candidate and reference with the command:

python -m bleurt.score   -candidate_file=bleurt/test_data/candidates   -reference_file=bleurt/test_data/references   -bleurt_checkpoint=bleurt/test_checkpoint   -scores_file=scores

I'm comparing the candidate and reference of the following:

"A group of tasks that you monitor as a single unit." "Aggregates a set of tasks and synchronize behaviors on the group."

The result yields: -0.4743865132331848

My experience with NLP is still early, so I apologize if my expectations have been set too high with this.

tsellam commented 3 years ago

Hi, thanks for your message! I am not sure I understand, what was the expected output here?