neulab / ExplainaBoard

Interpretable Evaluation for AI Systems
MIT License
361 stars 36 forks source link

test_eaas_decomposabiltiy in TestMetric fails #219

Closed tetsuok closed 2 years ago

tetsuok commented 2 years ago

This is yet another type of the test failures I observed in CI. test_eaas_decomposabiltiy in TestMetric class in explainaboard/tests/test_metric.py is failing. The test fails when the metric is "bleu" and "chrf". It seems better to double check the test script and code to compute theses scores. See output below for details.

Environment

Repro steps

git clone git@github.com:tetsuok/ExplainaBoard.git
cd ExplainaBoard
python3 -m venv venv
source venv/bin/activate
python -m pip install --upgrade pip
pip install .
python -m spacy download en_core_web_sm
python -m nltk.downloader punkt
python -m unittest -v explainaboard/tests/test_metric.py

Output

test_accuracy (explainaboard.tests.test_metric.TestMetric) ... ok
test_eaas_decomposabiltiy (explainaboard.tests.test_metric.TestMetric) ... WARNING: corpus-level bleu is currently calculated as the average of sentence-level bleu, which is not technically correct. This is a known issue that we are working on: https://github.com/neulab/ExplainaBoard/issues/161
WARNING: corpus-level chrf is currently calculated as the average of sentence-level chrf, which is not technically correct. This is a known issue that we are working on: https://github.com/neulab/ExplainaBoard/issues/161
EaaS: Your request has been sent.
EaaS: Your request has been sent.

Calculating scores.:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating scores.:   0%|          | 0/1 [00:00<?, ?it/s]
Calculating scores.: 100%|██████████| 1/1 [00:03<00:00,  3.58s/it]
Calculating scores.: 100%|██████████| 1/1 [00:03<00:00,  3.58s/it]

Calculating scores.: 100%|██████████| 1/1 [00:03<00:00,  3.57s/it]
Calculating scores.: 100%|██████████| 1/1 [00:03<00:00,  3.57s/it]
test_f1_macro (explainaboard.tests.test_metric.TestMetric) ... ok
test_f1_micro (explainaboard.tests.test_metric.TestMetric) ... ok
test_hits (explainaboard.tests.test_metric.TestMetric) ... ok
test_mrr (explainaboard.tests.test_metric.TestMetric) ... ok
test_ner_f1 (explainaboard.tests.test_metric.TestMetric) ... ok
test_qa_metrics (explainaboard.tests.test_metric.TestMetric) ... 
The dataset hasn't been supported by DataLab so no training set dependent features will
be supported by ExplainaBoard. You can add the dataset by:
https://github.com/ExpressAI/DataLab/blob/main/docs/SDK/add_new_datasets_into_sdk.md

featurizing: 0it [00:00, ?it/s]
featurizing: 1190it [00:00, 141635.12it/s]

bucketing:   0%|          | 0/3 [00:00<?, ?it/s]
bucketing:  33%|███▎      | 1/3 [00:00<00:01,  1.87it/s]
bucketing:  67%|██████▋   | 2/3 [00:01<00:00,  1.91it/s]
bucketing: 100%|██████████| 3/3 [00:01<00:00,  1.93it/s]
bucketing: 100%|██████████| 3/3 [00:01<00:00,  1.92it/s]
the information of #context_length#
bucket_interval F1ScoreQA   #samples
[31.0,108.0]    0.8266471986900867  302
[109.0,135.0]   0.7978706534189427  307
[136.0,183.0]   0.8203817460545562  301
[186.0,1000000] 0.853220377158347   280

the information of #context_length#
bucket_interval ExactMatchQA    #samples
[31.0,108.0]    0.7086092715231788  302
[109.0,135.0]   0.6710097719869706  307
[136.0,183.0]   0.6744186046511628  301
[186.0,1000000] 0.7428571428571429  280

the information of #question_length#
bucket_interval F1ScoreQA   #samples
[3.0,9.0]   0.8224365988003545  357
[10.0,12.0] 0.8354749735105328  388
[13.0,16.0] 0.8015834541483496  320
[17.0,1000000]  0.8491959595959596  125

the information of #question_length#
bucket_interval ExactMatchQA    #samples
[3.0,9.0]   0.6862745098039216  357
[10.0,12.0] 0.7190721649484536  388
[13.0,16.0] 0.678125    320
[17.0,1000000]  0.72    125

the information of #answer_length#
bucket_interval F1ScoreQA   #samples
[1.0,]  0.8670023564981548  357
[2.0,3.0]   0.8389183753889637  481
[4.0,11.0]  0.7699579311537493  305
[12.0,1000000]  0.6926299348288045  47

the information of #answer_length#
bucket_interval ExactMatchQA    #samples
[1.0,]  0.7983193277310925  357
[2.0,3.0]   0.7442827442827443  481
[4.0,11.0]  0.5540983606557377  305
[12.0,1000000]  0.40425531914893614 47

ok

======================================================================
FAIL: test_eaas_decomposabiltiy (explainaboard.tests.test_metric.TestMetric) [bleu]
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/t/ExplainaBoard/explainaboard/tests/test_metric.py", line 147, in test_eaas_decomposabiltiy
    metric.evaluate_from_stats(full_stats).value,
AssertionError: 33.111963381862374 != 31.397278899799424 within 7 places (1.7146844820629497 difference)

======================================================================
FAIL: test_eaas_decomposabiltiy (explainaboard.tests.test_metric.TestMetric) [chrf]
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/t/ExplainaBoard/explainaboard/tests/test_metric.py", line 147, in test_eaas_decomposabiltiy
    metric.evaluate_from_stats(full_stats).value,
AssertionError: 52.8149705325668 != 51.484112387583785 within 7 places (1.3308581449830115 difference)

----------------------------------------------------------------------
Ran 8 tests in 14.647s

FAILED (failures=2)
pfliu-nlp commented 2 years ago

Aha, yes, these two are also expected, we haven't fixed them since we're waiting for other functionalities to be implemented. But thank you a lot!

neubig commented 2 years ago

Fixed through #220