Before using EaaS, please see the terms of use. Detailed documentation can be found here. To install the EaaS, simply run
pip install eaas
A minimal EaaS application looks something like this:
from eaas import Config, Client
client = Client(Config())
inputs = [{
"source": "Hello, my world",
"references": ["Hello, world", "Hello my world"],
"hypothesis": "Hi, my world"
}]
metrics = ["rouge1", "bleu", "chrf"]
score_dic = client.score(inputs, metrics=metrics)
If eaas
has been installed successfully, you should get the results
below by printing score_dic
. Each entry corresponds to the metrics passed
to metrics
(in the same order). The corpus
entry indicates the corpus-level
score, sample
entry is a list of sample-level scores:
score_dic = {'scores':
[
{'corpus': 0.6666666666666666, 'sample': [0.6666666666666666]},
{'corpus': 0.35355339059327373, 'sample': [0.35355339059327373]},
{'corpus': 0.4900623006253688, 'sample': [0.4900623006253688]}
]
}
Notably:
source
(string, optional), references
(list of string, optional) and hypothesis
(string, required). source
and references
are optional based on the metrics you want to use. source
, references
or hypothesis
. Currently, EaaS supports the following metrics:
bart_score_en_ref
: BARTScore is a sequence to sequence framework based on pre-trained language model BART. bart_score_cnn_hypo_ref
uses the CNNDM finetuned BART. It calculates the average generation score of Score(hypothesis|reference)
and Score(reference|hypothesis)
.bart_score_en_src
: BARTScore using the CNNDM finetuned BART. It calculates Score(hypothesis|source)
.bert_score_p
: BERTScore is a metric designed for evaluating translated text using BERT-based matching framework. bert_score_p
calculates the BERTScore precision.bert_score_r
: BERTScore recall.bert_score_f
: BERTScore f score.bleu
: BLEU measures modified ngram matches between each candidate translation and the reference translations. chrf
: CHRF measures the character-level ngram matches between hypothesis and reference.comet
: COMET is a neural framework for training multilingual machine translation evaluation models. comet
uses the wmt20-comet-da
checkpoint which utilizes source, hypothesis and reference.comet_qe
: COMET for quality estimation. comet_qe
uses the wmt20-comet-qe-da
checkpoint which utilizes only source and hypothesis.mover_score
: MoverScore is a metric similar to BERTScore. Different from BERTScore, it uses the Earth Mover’s Distance instead of the Euclidean Distance.prism
: PRISM is a sequence to sequence framework trained from scratch. prism
calculates the average generation score of Score(hypothesis|reference)
and Score(reference|hypothesis)
.prism_qe
: PRISM for quality estimation. It calculates Score(hypothesis| source)
.rouge1
: ROUGE-1 refers to the overlap of unigram (each word) between the system and reference summaries.rouge2
: ROUGE-2 refers to the overlap of bigrams between the system and reference summaries.rougeL
: ROUGE-L refers to the longest common subsequence between the system and reference summaries.The default configurations for each metric can refer to this doc
If you want to make a call to the EaaS server to calculate some metrics and continue local computation while waiting for the result, you can do so as follows:
from eaas import Config
from eaas.async_client import AsyncClient
config = Config()
client = AsyncClient(config)
inputs = ...
req = client.async_score(inputs, metrics=["bleu"])
# do some other computation
result = req.get_result()