neulab / ExplainaBoard

Interpretable Evaluation for AI Systems
MIT License
359 stars 36 forks source link

test_generate_system_analysis in integration_tests.summarization_test.SummarizationTest is too slow #478

Closed tetsuok closed 1 year ago

tetsuok commented 2 years ago

commit 8c514c3d81a079d967d208f8bc330c2f202620bb (#437) increases the execution time of integration_tests.summarization_test.SummarizationTest. When I measured on my GCP VM, the time of the test increased by 430 seconds (from 6 seconds to 436 seconds), which is too slow to run as automated tests in pull requests. Slow tests need to be removed or replaced with more focused and fast tests. In general, having slow tests leads to productivity drains: Time to update pull requests takes longer, developers would try to include large commits into pull requests to work around slow CI time, pull requests become expensive to review, which makes identifying bugs or design flaws in code review difficult.

Repro steps

rm -rf ~/.cache/explainaboard
time python -m unittest -v integration_tests.summarization_test.SummarizationTest

Output

test_datalab_loader (integration_tests.summarization_test.SummarizationTest) ... skipped 'time consuming'
test_default_features_dont_modify_condgen (integration_tests.summarization_test.SummarizationTest) ... ok
test_generate_system_analysis (integration_tests.summarization_test.SummarizationTest) ... WARNING:datalabs.load:Couldn't find a directory or a dataset named 'cnn_dailymail' in this version. It was picked from the master branch on github instead.
WARNING:datalabs.builder:No config specified, defaulting to: cnn_dailymail/3.0.0
WARNING:datalabs.builder:Reusing dataset cnn_dailymail (/home/t/.cache/expressai/datalab/cnn_dailymail/3.0.0/3.0.0/6e2f5d689f0225c4f22eb78d11ba7a21399810c5cb853edafe39b1d006a1ff95)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 287113/287113 [06:20<00:00, 755.03it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 287113/287113 [00:29<00:00, 9616.19it/s]
INFO:explainaboard:caching stats for cnn_dailymail None
calculating example-level features: 3it [00:00, 51.88it/s]
calculating token-level features: 3it [00:00, 139.83it/s]
/home/t/explainaboard-fork/explainaboard/metrics/metric.py:336: DeprecationWarning: Use of keyword argument `alpha` for method `interval` is deprecated. Use first positional argument or keyword argument `confidence` instead.
  return stats_t.interval(
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 349.50it/s]
ok
test_generate_system_human_eval (integration_tests.summarization_test.SummarizationTest) ... skipped 'Not yet fixed in v0.11'
test_load_tsv (integration_tests.summarization_test.SummarizationTest) ... ok

----------------------------------------------------------------------
Ran 5 tests in 438.659s

OK (skipped=2)
python -m unittest -v integration_tests.summarization_test.SummarizationTest  434.35s user 2.58s system 98% cpu 7:22.46 total
tetsuok commented 2 years ago

RFC: @odashi @neubig @pfliu-nlp

odashi commented 2 years ago

Yes it is not negligible amount of degradation that should be fixed.

I also think the current integration tests are basically harmful for rapid development because many of them require to download large amount of data from the Internet. I think we have to:

pfliu-nlp commented 2 years ago

@tetsuok Thanks. I also have the same feeling and strongly think that we need to re-organize existing integration tests, remove or skip or slim them. (BTW, the reason why some of the existing tests cost too much is that testing training-set dependent features require loading the whole training set. But we probably could add some dummy dataset in DataLab.)

odashi commented 2 years ago

@pfliu-nlp Having heavy tests is not bad if we really need them to achieve better quality, but they should be installed on a separate space to prevent unnecessary burden. I think it may be better to set up some independent place to continuously run these heavy tests.

neubig commented 2 years ago

As a quick solution, I wonder if it'd be possible to set up the heavy tests to run on CI only when we update version.py?

(EDIT: for now... I agree with @odashi 's solution as a good long-term one)

odashi commented 2 years ago

@neubig I can set up some independent CI using our infrastructure. It is not so difficult to configure I think.

only when we update version.py

Though I thought it is too optimistic, it can also be configured.

(related to this topic, recently I and @tetsuok discussed about versioning of this repository) @tetsuok

tetsuok commented 2 years ago

Version for a Python package needs to be defined in a single place. (DataLab defines versions in two places)

odashi commented 2 years ago

Correct versioning may be required to follow the newest practice around setup.py. Created another issue: #490

@tetsuok Maybe you can work with #490 first.

tetsuok commented 2 years ago

@odashi Sure.

neubig commented 1 year ago

I think this is fixed by #549. @pfliu-nlp and @tetsuok could you confirm and close the issue if so?

tetsuok commented 1 year ago

Thanks. Verified.

tetsuok commented 1 year ago

Closing.