This is yet another type of the test failures I observed in CI.
test_eaas_decomposabiltiy in TestMetric class in explainaboard/tests/test_metric.py is failing.
The test fails when the metric is "bleu" and "chrf". It seems better to double check the test script and code
to compute theses scores. See output below for details.
test_accuracy (explainaboard.tests.test_metric.TestMetric) ... ok
test_eaas_decomposabiltiy (explainaboard.tests.test_metric.TestMetric) ... WARNING: corpus-level bleu is currently calculated as the average of sentence-level bleu, which is not technically correct. This is a known issue that we are working on: https://github.com/neulab/ExplainaBoard/issues/161
WARNING: corpus-level chrf is currently calculated as the average of sentence-level chrf, which is not technically correct. This is a known issue that we are working on: https://github.com/neulab/ExplainaBoard/issues/161
EaaS: Your request has been sent.
EaaS: Your request has been sent.
Calculating scores.: 0%| | 0/1 [00:00<?, ?it/s]
Calculating scores.: 0%| | 0/1 [00:00<?, ?it/s][A
Calculating scores.: 100%|██████████| 1/1 [00:03<00:00, 3.58s/it]
Calculating scores.: 100%|██████████| 1/1 [00:03<00:00, 3.58s/it]
Calculating scores.: 100%|██████████| 1/1 [00:03<00:00, 3.57s/it][A
Calculating scores.: 100%|██████████| 1/1 [00:03<00:00, 3.57s/it]
test_f1_macro (explainaboard.tests.test_metric.TestMetric) ... ok
test_f1_micro (explainaboard.tests.test_metric.TestMetric) ... ok
test_hits (explainaboard.tests.test_metric.TestMetric) ... ok
test_mrr (explainaboard.tests.test_metric.TestMetric) ... ok
test_ner_f1 (explainaboard.tests.test_metric.TestMetric) ... ok
test_qa_metrics (explainaboard.tests.test_metric.TestMetric) ...
The dataset hasn't been supported by DataLab so no training set dependent features will
be supported by ExplainaBoard. You can add the dataset by:
https://github.com/ExpressAI/DataLab/blob/main/docs/SDK/add_new_datasets_into_sdk.md
featurizing: 0it [00:00, ?it/s]
featurizing: 1190it [00:00, 141635.12it/s]
bucketing: 0%| | 0/3 [00:00<?, ?it/s]
bucketing: 33%|███▎ | 1/3 [00:00<00:01, 1.87it/s]
bucketing: 67%|██████▋ | 2/3 [00:01<00:00, 1.91it/s]
bucketing: 100%|██████████| 3/3 [00:01<00:00, 1.93it/s]
bucketing: 100%|██████████| 3/3 [00:01<00:00, 1.92it/s]
the information of #context_length#
bucket_interval F1ScoreQA #samples
[31.0,108.0] 0.8266471986900867 302
[109.0,135.0] 0.7978706534189427 307
[136.0,183.0] 0.8203817460545562 301
[186.0,1000000] 0.853220377158347 280
the information of #context_length#
bucket_interval ExactMatchQA #samples
[31.0,108.0] 0.7086092715231788 302
[109.0,135.0] 0.6710097719869706 307
[136.0,183.0] 0.6744186046511628 301
[186.0,1000000] 0.7428571428571429 280
the information of #question_length#
bucket_interval F1ScoreQA #samples
[3.0,9.0] 0.8224365988003545 357
[10.0,12.0] 0.8354749735105328 388
[13.0,16.0] 0.8015834541483496 320
[17.0,1000000] 0.8491959595959596 125
the information of #question_length#
bucket_interval ExactMatchQA #samples
[3.0,9.0] 0.6862745098039216 357
[10.0,12.0] 0.7190721649484536 388
[13.0,16.0] 0.678125 320
[17.0,1000000] 0.72 125
the information of #answer_length#
bucket_interval F1ScoreQA #samples
[1.0,] 0.8670023564981548 357
[2.0,3.0] 0.8389183753889637 481
[4.0,11.0] 0.7699579311537493 305
[12.0,1000000] 0.6926299348288045 47
the information of #answer_length#
bucket_interval ExactMatchQA #samples
[1.0,] 0.7983193277310925 357
[2.0,3.0] 0.7442827442827443 481
[4.0,11.0] 0.5540983606557377 305
[12.0,1000000] 0.40425531914893614 47
ok
======================================================================
FAIL: test_eaas_decomposabiltiy (explainaboard.tests.test_metric.TestMetric) [bleu]
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/t/ExplainaBoard/explainaboard/tests/test_metric.py", line 147, in test_eaas_decomposabiltiy
metric.evaluate_from_stats(full_stats).value,
AssertionError: 33.111963381862374 != 31.397278899799424 within 7 places (1.7146844820629497 difference)
======================================================================
FAIL: test_eaas_decomposabiltiy (explainaboard.tests.test_metric.TestMetric) [chrf]
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/t/ExplainaBoard/explainaboard/tests/test_metric.py", line 147, in test_eaas_decomposabiltiy
metric.evaluate_from_stats(full_stats).value,
AssertionError: 52.8149705325668 != 51.484112387583785 within 7 places (1.3308581449830115 difference)
----------------------------------------------------------------------
Ran 8 tests in 14.647s
FAILED (failures=2)
This is yet another type of the test failures I observed in CI.
test_eaas_decomposabiltiy
inTestMetric
class inexplainaboard/tests/test_metric.py
is failing. The test fails when the metric is "bleu" and "chrf". It seems better to double check the test script and code to compute theses scores. See output below for details.Environment
Repro steps
Output