Answer comparison inaccuracy

I'm wondering if I'm doing something wrong. Setting the candidate and reference with the command:

python -m bleurt.score   -candidate_file=bleurt/test_data/candidates   -reference_file=bleurt/test_data/references   -bleurt_checkpoint=bleurt/test_checkpoint   -scores_file=scores

I'm comparing the candidate and reference of the following:

"A group of tasks that you monitor as a single unit." "Aggregates a set of tasks and synchronize behaviors on the group."

The result yields: -0.4743865132331848

My experience with NLP is still early, so I apologize if my expectations have been set too high with this.

google-research / bleurt

Answer comparison inaccuracy #16