Open JoelNiklaus opened 3 days ago
Hi @JoelNiklaus , thanks a lot for this PR, I'll take a deeper look hopefully bf Monday
(it's looking good from a first glance but I want to take some time to test it deeply)
Hi! You get quite a huge difference in the bootstrap from the results we hardcoded in our test suite (like, an order of magnitude) - can you check why? (I was expecting a diff on a few decimal points, not stg this huge)
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
I am trying to run your tests with python -m pytest tests/test_main.py
but it just hangs.
It's should be taking some time (around 30 min if you're on CPU), as it first needs to generate a bunch of predictions using a gpt2 model. It will be way faster if you have a GPU available
I aborted it after 30min on an A100 GPU and I only ran the lite version.
Which metrics did you check? BLEU or CHRF? For these the original version computes corpus level metrics. I switched to sample level metrics to speed up computation. There it would make sense to me that the stderr differs since the corpora are different for the different samples distributions in bootstrapping. For the metrics that are already sample level, I don't see a reason why they should be different.
30min on an A100 is not normal. I wonder if there's an issue with the command you're running. Let me share the raw logs with you. (Do you have the rights to access them, by clicking on "Details" next to the failing test in the check list?)
2024-11-29T07:08:27.7720188Z ##[endgroup]
2024-11-29T07:08:34.7243761Z ============================= test session starts ==============================
2024-11-29T07:08:34.7245155Z platform linux -- Python 3.10.15, pytest-7.4.0, pluggy-1.5.0
2024-11-29T07:08:34.7245990Z rootdir: /home/runner/work/lighteval/lighteval
2024-11-29T07:08:34.7246821Z plugins: anyio-4.6.2.post1
2024-11-29T07:08:34.7247425Z collected 576 items
2024-11-29T07:08:34.7247713Z
2024-11-29T07:43:28.1461575Z tests/test_main.py .F...F.F.F.F.F.F.F.F...F.F.F.F.F.F.F.F.F.F.F.F.F.F.F. [ 9%]
2024-11-29T07:43:28.2989743Z F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F...F.F.F.F.F.F...F.F.F.F.F.F.F.F.F. [ 21%]
2024-11-29T07:43:28.4408527Z F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F...F.F.F.F.F.F.F...F.F.F [ 33%]
2024-11-29T07:43:28.6269725Z tests/test_unit_base_metrics.py .........sss [ 35%]
2024-11-29T07:43:28.6296287Z tests/test_unit_harness_metrics.py . [ 35%]
2024-11-29T07:43:28.6317847Z tests/test_unit_harness_prompts.py . [ 35%]
2024-11-29T07:43:28.6340754Z tests/test_unit_harness_metrics.py . [ 35%]
2024-11-29T07:43:28.6357885Z tests/test_unit_harness_prompts.py . [ 36%]
2024-11-29T07:43:28.6380080Z tests/test_unit_harness_metrics.py . [ 36%]
2024-11-29T07:43:28.6417832Z tests/test_unit_harness_prompts.py . [ 36%]
2024-11-29T07:43:28.6442610Z tests/test_unit_harness_metrics.py . [ 36%]
2024-11-29T07:43:28.6636677Z tests/test_unit_harness_prompts.py . [ 36%]
2024-11-29T07:43:28.6657953Z tests/test_unit_harness_metrics.py . [ 36%]
2024-11-29T07:43:28.6671555Z tests/test_unit_harness_prompts.py . [ 37%]
2024-11-29T07:43:28.6693571Z tests/test_unit_harness_metrics.py . [ 37%]
2024-11-29T07:43:28.6709853Z tests/test_unit_harness_prompts.py . [ 37%]
2024-11-29T07:43:28.6733091Z tests/test_unit_harness_metrics.py . [ 37%]
2024-11-29T07:43:28.6745199Z tests/test_unit_harness_prompts.py . [ 37%]
2024-11-29T07:43:28.6766346Z tests/test_unit_harness_metrics.py . [ 38%]
2024-11-29T07:43:28.6780038Z tests/test_unit_harness_prompts.py . [ 38%]
2024-11-29T07:43:28.6801796Z tests/test_unit_harness_metrics.py . [ 38%]
2024-11-29T07:43:28.6814506Z tests/test_unit_harness_prompts.py . [ 38%]
2024-11-29T07:43:28.6837702Z tests/test_unit_harness_metrics.py . [ 38%]
2024-11-29T07:43:28.6850587Z tests/test_unit_harness_prompts.py . [ 38%]
2024-11-29T07:43:28.6871904Z tests/test_unit_harness_metrics.py . [ 39%]
2024-11-29T07:43:28.6885430Z tests/test_unit_harness_prompts.py . [ 39%]
2024-11-29T07:43:28.6908359Z tests/test_unit_harness_metrics.py . [ 39%]
2024-11-29T07:43:28.6921181Z tests/test_unit_harness_prompts.py . [ 39%]
2024-11-29T07:43:28.6942161Z tests/test_unit_harness_metrics.py . [ 39%]
2024-11-29T07:43:28.6958442Z tests/test_unit_harness_prompts.py . [ 39%]
2024-11-29T07:43:28.6980167Z tests/test_unit_harness_metrics.py . [ 40%]
2024-11-29T07:43:28.6992955Z tests/test_unit_harness_prompts.py . [ 40%]
2024-11-29T07:43:28.7016468Z tests/test_unit_harness_metrics.py . [ 40%]
2024-11-29T07:43:28.7029699Z tests/test_unit_harness_prompts.py . [ 40%]
2024-11-29T07:43:28.7052872Z tests/test_unit_harness_metrics.py . [ 40%]
2024-11-29T07:43:28.7065482Z tests/test_unit_harness_prompts.py . [ 40%]
2024-11-29T07:43:28.7089440Z tests/test_unit_harness_metrics.py . [ 41%]
2024-11-29T07:43:28.7102445Z tests/test_unit_harness_prompts.py . [ 41%]
2024-11-29T07:43:28.7126781Z tests/test_unit_harness_metrics.py . [ 41%]
2024-11-29T07:43:28.7139341Z tests/test_unit_harness_prompts.py . [ 41%]
2024-11-29T07:43:28.7165058Z tests/test_unit_harness_metrics.py . [ 41%]
2024-11-29T07:43:28.7178028Z tests/test_unit_harness_prompts.py . [ 42%]
2024-11-29T07:43:28.7202080Z tests/test_unit_harness_metrics.py . [ 42%]
2024-11-29T07:43:28.7218747Z tests/test_unit_harness_prompts.py . [ 42%]
2024-11-29T07:43:28.7241644Z tests/test_unit_harness_metrics.py . [ 42%]
2024-11-29T07:43:28.7255891Z tests/test_unit_harness_prompts.py . [ 42%]
2024-11-29T07:43:28.7278849Z tests/test_unit_harness_metrics.py . [ 42%]
2024-11-29T07:43:28.7294533Z tests/test_unit_harness_prompts.py . [ 43%]
2024-11-29T07:43:28.7330266Z tests/test_unit_harness_metrics.py . [ 43%]
2024-11-29T07:43:28.7346074Z tests/test_unit_harness_prompts.py . [ 43%]
2024-11-29T07:43:28.7381971Z tests/test_unit_harness_metrics.py . [ 43%]
2024-11-29T07:43:28.7395807Z tests/test_unit_harness_prompts.py . [ 43%]
2024-11-29T07:43:28.7431748Z tests/test_unit_harness_metrics.py . [ 43%]
2024-11-29T07:43:28.7464638Z tests/test_unit_harness_prompts.py . [ 44%]
2024-11-29T07:43:28.7488992Z tests/test_unit_harness_metrics.py . [ 44%]
2024-11-29T07:43:28.7504537Z tests/test_unit_harness_prompts.py . [ 44%]
2024-11-29T07:43:28.7527730Z tests/test_unit_harness_metrics.py . [ 44%]
2024-11-29T07:43:28.7540260Z tests/test_unit_harness_prompts.py . [ 44%]
2024-11-29T07:43:28.7562518Z tests/test_unit_harness_metrics.py . [ 44%]
2024-11-29T07:43:28.7574965Z tests/test_unit_harness_prompts.py . [ 45%]
2024-11-29T07:43:28.7596202Z tests/test_unit_harness_metrics.py . [ 45%]
2024-11-29T07:43:28.7609130Z tests/test_unit_harness_prompts.py . [ 45%]
2024-11-29T07:43:28.7630209Z tests/test_unit_harness_metrics.py . [ 45%]
2024-11-29T07:43:28.7643498Z tests/test_unit_harness_prompts.py . [ 45%]
2024-11-29T07:43:28.7665104Z tests/test_unit_harness_metrics.py . [ 46%]
2024-11-29T07:43:28.7678031Z tests/test_unit_harness_prompts.py . [ 46%]
2024-11-29T07:43:28.7699259Z tests/test_unit_harness_metrics.py . [ 46%]
2024-11-29T07:43:28.7713917Z tests/test_unit_harness_prompts.py . [ 46%]
2024-11-29T07:43:28.7735261Z tests/test_unit_harness_metrics.py . [ 46%]
2024-11-29T07:43:28.7748116Z tests/test_unit_harness_prompts.py . [ 46%]
2024-11-29T07:43:28.7769310Z tests/test_unit_harness_metrics.py . [ 47%]
2024-11-29T07:43:28.7782113Z tests/test_unit_harness_prompts.py . [ 47%]
2024-11-29T07:43:28.7805247Z tests/test_unit_harness_metrics.py . [ 47%]
2024-11-29T07:43:28.7818063Z tests/test_unit_harness_prompts.py . [ 47%]
2024-11-29T07:43:28.7839104Z tests/test_unit_harness_metrics.py . [ 47%]
2024-11-29T07:43:28.7852034Z tests/test_unit_harness_prompts.py . [ 47%]
2024-11-29T07:43:28.7873028Z tests/test_unit_harness_metrics.py . [ 48%]
2024-11-29T07:43:28.7886250Z tests/test_unit_harness_prompts.py . [ 48%]
2024-11-29T07:43:28.7912642Z tests/test_unit_harness_metrics.py . [ 48%]
2024-11-29T07:43:28.8116588Z tests/test_unit_harness_prompts.py . [ 48%]
2024-11-29T07:43:28.8139064Z tests/test_unit_harness_metrics.py . [ 48%]
2024-11-29T07:43:28.8160162Z tests/test_unit_harness_prompts.py . [ 48%]
2024-11-29T07:43:28.8181821Z tests/test_unit_harness_metrics.py . [ 49%]
2024-11-29T07:43:28.8194800Z tests/test_unit_harness_prompts.py . [ 49%]
2024-11-29T07:43:28.8216495Z tests/test_unit_harness_metrics.py . [ 49%]
2024-11-29T07:43:28.8229316Z tests/test_unit_harness_prompts.py . [ 49%]
2024-11-29T07:43:28.8250711Z tests/test_unit_harness_metrics.py . [ 49%]
2024-11-29T07:43:28.8263436Z tests/test_unit_harness_prompts.py . [ 50%]
2024-11-29T07:43:28.8285262Z tests/test_unit_harness_metrics.py . [ 50%]
2024-11-29T07:43:28.8299482Z tests/test_unit_harness_prompts.py . [ 50%]
2024-11-29T07:43:28.8321526Z tests/test_unit_harness_metrics.py . [ 50%]
2024-11-29T07:43:28.8356679Z tests/test_unit_harness_prompts.py . [ 50%]
2024-11-29T07:43:28.8378065Z tests/test_unit_harness_metrics.py . [ 50%]
2024-11-29T07:43:28.8398278Z tests/test_unit_harness_prompts.py . [ 51%]
2024-11-29T07:43:28.8419146Z tests/test_unit_harness_metrics.py . [ 51%]
2024-11-29T07:43:28.8433857Z tests/test_unit_harness_prompts.py . [ 51%]
2024-11-29T07:43:28.8455390Z tests/test_unit_harness_metrics.py . [ 51%]
2024-11-29T07:43:28.8468444Z tests/test_unit_harness_prompts.py . [ 51%]
2024-11-29T07:43:28.8490077Z tests/test_unit_harness_metrics.py . [ 51%]
2024-11-29T07:43:28.8502997Z tests/test_unit_harness_prompts.py . [ 52%]
2024-11-29T07:43:28.8524713Z tests/test_unit_harness_metrics.py . [ 52%]
2024-11-29T07:43:28.8537241Z tests/test_unit_harness_prompts.py . [ 52%]
2024-11-29T07:43:28.8558792Z tests/test_unit_harness_metrics.py . [ 52%]
2024-11-29T07:43:28.8575050Z tests/test_unit_harness_prompts.py . [ 52%]
2024-11-29T07:43:28.8596437Z tests/test_unit_harness_metrics.py . [ 52%]
2024-11-29T07:43:28.8610039Z tests/test_unit_harness_prompts.py . [ 53%]
2024-11-29T07:43:28.8631025Z tests/test_unit_harness_metrics.py . [ 53%]
2024-11-29T07:43:28.8644056Z tests/test_unit_harness_prompts.py . [ 53%]
2024-11-29T07:43:28.8665090Z tests/test_unit_harness_metrics.py . [ 53%]
2024-11-29T07:43:28.8678070Z tests/test_unit_harness_prompts.py . [ 53%]
2024-11-29T07:43:28.8699420Z tests/test_unit_harness_metrics.py . [ 53%]
2024-11-29T07:43:28.8712319Z tests/test_unit_harness_prompts.py . [ 54%]
2024-11-29T07:43:28.8733862Z tests/test_unit_harness_metrics.py . [ 54%]
2024-11-29T07:43:28.9868071Z tests/test_unit_harness_prompts.py . [ 54%]
2024-11-29T07:43:28.9890873Z tests/test_unit_harness_metrics.py . [ 54%]
2024-11-29T07:43:28.9903844Z tests/test_unit_harness_prompts.py . [ 54%]
2024-11-29T07:43:28.9927274Z tests/test_unit_harness_metrics.py . [ 55%]
2024-11-29T07:43:28.9942995Z tests/test_unit_harness_prompts.py . [ 55%]
2024-11-29T07:43:28.9964596Z tests/test_unit_harness_metrics.py . [ 55%]
2024-11-29T07:43:28.9981129Z tests/test_unit_harness_prompts.py . [ 55%]
2024-11-29T07:43:29.0002443Z tests/test_unit_harness_metrics.py . [ 55%]
2024-11-29T07:43:29.0024719Z tests/test_unit_harness_prompts.py . [ 55%]
2024-11-29T07:43:29.0046243Z tests/test_unit_harness_metrics.py . [ 56%]
2024-11-29T07:43:29.0059105Z tests/test_unit_harness_prompts.py . [ 56%]
2024-11-29T07:43:29.0080273Z tests/test_unit_harness_metrics.py . [ 56%]
2024-11-29T07:43:29.0165262Z tests/test_unit_harness_prompts.py . [ 56%]
2024-11-29T07:43:29.0186303Z tests/test_unit_harness_metrics.py . [ 56%]
2024-11-29T07:43:29.0216954Z tests/test_unit_harness_prompts.py . [ 56%]
2024-11-29T07:43:29.0238102Z tests/test_unit_harness_metrics.py . [ 57%]
2024-11-29T07:43:29.0251189Z tests/test_unit_harness_prompts.py . [ 57%]
2024-11-29T07:43:29.0272105Z tests/test_unit_harness_metrics.py . [ 57%]
2024-11-29T07:43:29.0285383Z tests/test_unit_harness_prompts.py . [ 57%]
2024-11-29T07:43:29.0306486Z tests/test_unit_harness_metrics.py . [ 57%]
2024-11-29T07:43:29.0319497Z tests/test_unit_harness_prompts.py . [ 57%]
2024-11-29T07:43:29.0340447Z tests/test_unit_harness_metrics.py . [ 58%]
2024-11-29T07:43:29.0353097Z tests/test_unit_harness_prompts.py . [ 58%]
2024-11-29T07:43:29.0374131Z tests/test_unit_harness_metrics.py . [ 58%]
2024-11-29T07:43:29.0386914Z tests/test_unit_harness_prompts.py . [ 58%]
2024-11-29T07:43:29.0408470Z tests/test_unit_harness_metrics.py . [ 58%]
2024-11-29T07:43:29.0420934Z tests/test_unit_harness_prompts.py . [ 59%]
2024-11-29T07:43:29.0441956Z tests/test_unit_harness_metrics.py . [ 59%]
2024-11-29T07:43:29.0454504Z tests/test_unit_harness_prompts.py . [ 59%]
2024-11-29T07:43:29.0475302Z tests/test_unit_harness_metrics.py . [ 59%]
2024-11-29T07:43:29.0488403Z tests/test_unit_harness_prompts.py . [ 59%]
2024-11-29T07:43:29.0510043Z tests/test_unit_harness_metrics.py . [ 59%]
2024-11-29T07:43:29.0523273Z tests/test_unit_harness_prompts.py . [ 60%]
2024-11-29T07:43:29.0543852Z tests/test_unit_harness_metrics.py . [ 60%]
2024-11-29T07:43:29.0556996Z tests/test_unit_harness_prompts.py . [ 60%]
2024-11-29T07:43:29.0577935Z tests/test_unit_harness_metrics.py . [ 60%]
2024-11-29T07:43:29.0590646Z tests/test_unit_harness_prompts.py . [ 60%]
2024-11-29T07:43:29.0611937Z tests/test_unit_harness_metrics.py . [ 60%]
2024-11-29T07:43:29.0625526Z tests/test_unit_harness_prompts.py . [ 61%]
2024-11-29T07:43:29.0646575Z tests/test_unit_harness_metrics.py . [ 61%]
2024-11-29T07:43:29.0691670Z tests/test_unit_harness_prompts.py . [ 61%]
2024-11-29T07:43:29.1441540Z tests/test_unit_harness_metrics.py ..................................... [ 67%]
2024-11-29T07:43:29.2921174Z ........................................................................ [ 80%]
2024-11-29T07:43:29.6749320Z ................................................................... [ 92%]
2024-11-29T07:43:29.8402245Z tests/test_unit_reorder.py .. [ 92%]
2024-11-29T07:43:29.8901694Z tests/logging/test_evaluation_tracker.py ...s [ 93%]
2024-11-29T07:43:31.7466737Z tests/metrics/test_metric_requests.py ... [ 93%]
2024-11-29T07:43:31.7499546Z tests/metrics/test_normalizations.py .... [ 94%]
2024-11-29T07:43:35.1781251Z tests/models/test_abstract_model.py . [ 94%]
2024-11-29T07:43:36.2062836Z tests/models/test_base_model.py . [ 94%]
2024-11-29T07:43:37.5228883Z tests/tasks/test_lighteval_task.py .. [ 94%]
2024-11-29T07:43:37.5541528Z tests/tasks/test_registry.py ........ [ 96%]
2024-11-29T07:43:37.5572620Z tests/tasks/templates/test_continuation.py .... [ 97%]
2024-11-29T07:43:37.5591099Z tests/tasks/templates/test_copa.py .. [ 97%]
2024-11-29T07:43:37.5622767Z tests/tasks/templates/test_hellaswag.py .... [ 98%]
2024-11-29T07:43:37.5659938Z tests/tasks/templates/test_multichoice.py ..... [ 98%]
2024-11-29T07:43:37.5683530Z tests/tasks/templates/test_nli.py ... [ 99%]
2024-11-29T07:43:37.5771004Z tests/tasks/templates/test_translation.py ... [100%]
2024-11-29T07:43:37.5771703Z
2024-11-29T07:43:37.5771957Z =================================== FAILURES ===================================
2024-11-29T07:43:37.5772928Z ___ test_model_prediction[gpt2_lite_leaderboard|arc:challenge|25_acc_stderr] ___
2024-11-29T07:43:37.5773560Z
2024-11-29T07:43:37.5775505Z model_input = ('gpt2', 'lite', 'leaderboard|arc:challenge|25', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object ...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.5777471Z
2024-11-29T07:43:37.5777880Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5778693Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5779605Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5780707Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5782007Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5782843Z > assert reference == approx(
2024-11-29T07:43:37.5783380Z prediction, rel=1e-4
2024-11-29T07:43:37.5784202Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5785593Z E AssertionError: Model gpt2 on lite samples, for eval leaderboard|arc:challenge|25, metric acc_stderr incorrect
2024-11-29T07:43:37.5787101Z E assert 0.15275252316519466 == 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.5787772Z E comparison failed
2024-11-29T07:43:37.5788257Z E Obtained: 0.15275252316519466
2024-11-29T07:43:37.5788826Z E Expected: 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.5789153Z
2024-11-29T07:43:37.5789319Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5790029Z _ test_model_prediction[gpt2_lite_leaderboard|truthfulqa:mc|0_truthfulqa_mc1_stderr] _
2024-11-29T07:43:37.5790612Z
2024-11-29T07:43:37.5791970Z model_input = ('gpt2', 'lite', 'leaderboard|truthfulqa:mc|0', 'truthfulqa_mc1_stderr', functools.partial(<functools._lru_cache_wrapp...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.5793419Z
2024-11-29T07:43:37.5793699Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5794349Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5795072Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5795897Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5796415Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5796803Z > assert reference == approx(
2024-11-29T07:43:37.5797060Z prediction, rel=1e-4
2024-11-29T07:43:37.5797720Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5798396Z E AssertionError: Model gpt2 on lite samples, for eval leaderboard|truthfulqa:mc|0, metric truthfulqa_mc1_stderr incorrect
2024-11-29T07:43:37.5799037Z E assert 0.15275252316519466 == 0.004619651629850591 ± 4.6e-07
2024-11-29T07:43:37.5799345Z E comparison failed
2024-11-29T07:43:37.5799570Z E Obtained: 0.15275252316519466
2024-11-29T07:43:37.5799885Z E Expected: 0.004619651629850591 ± 4.6e-07
2024-11-29T07:43:37.5800075Z
2024-11-29T07:43:37.5800173Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5800564Z _ test_model_prediction[gpt2_lite_leaderboard|truthfulqa:mc|0_truthfulqa_mc2_stderr] _
2024-11-29T07:43:37.5800881Z
2024-11-29T07:43:37.5801581Z model_input = ('gpt2', 'lite', 'leaderboard|truthfulqa:mc|0', 'truthfulqa_mc2_stderr', functools.partial(<functools._lru_cache_wrapp...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.14105533101540416)
2024-11-29T07:43:37.5802385Z
2024-11-29T07:43:37.5802549Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5802923Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5803340Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5804015Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5804517Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5804889Z > assert reference == approx(
2024-11-29T07:43:37.5805136Z prediction, rel=1e-4
2024-11-29T07:43:37.5805516Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5806164Z E AssertionError: Model gpt2 on lite samples, for eval leaderboard|truthfulqa:mc|0, metric truthfulqa_mc2_stderr incorrect
2024-11-29T07:43:37.5806922Z E assert 0.14105533101540416 == 0.004258753966872427 ± 4.3e-07
2024-11-29T07:43:37.5807230Z E comparison failed
2024-11-29T07:43:37.5807468Z E Obtained: 0.14105533101540416
2024-11-29T07:43:37.5807776Z E Expected: 0.004258753966872427 ± 4.3e-07
2024-11-29T07:43:37.5807958Z
2024-11-29T07:43:37.5808056Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5808427Z _____ test_model_prediction[gpt2_lite_leaderboard|hellaswag|10_acc_stderr] _____
2024-11-29T07:43:37.5808713Z
2024-11-29T07:43:37.5809414Z model_input = ('gpt2', 'lite', 'leaderboard|hellaswag|10', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object at 0...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5810184Z
2024-11-29T07:43:37.5810346Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5810721Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5811126Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5811619Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5812110Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5812527Z > assert reference == approx(
2024-11-29T07:43:37.5812870Z prediction, rel=1e-4
2024-11-29T07:43:37.5813529Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5814260Z E AssertionError: Model gpt2 on lite samples, for eval leaderboard|hellaswag|10, metric acc_stderr incorrect
2024-11-29T07:43:37.5814851Z E assert 0.16329931618554522 == 0.004968770338693327 ± 5.0e-07
2024-11-29T07:43:37.5815164Z E comparison failed
2024-11-29T07:43:37.5815391Z E Obtained: 0.16329931618554522
2024-11-29T07:43:37.5815698Z E Expected: 0.004968770338693327 ± 5.0e-07
2024-11-29T07:43:37.5815879Z
2024-11-29T07:43:37.5815978Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5816346Z __ test_model_prediction[gpt2_lite_leaderboard|hellaswag|10_acc_norm_stderr] ___
2024-11-29T07:43:37.5816629Z
2024-11-29T07:43:37.5817340Z model_input = ('gpt2', 'lite', 'leaderboard|hellaswag|10', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper object...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5818107Z
2024-11-29T07:43:37.5818272Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5818638Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5819050Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5819553Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5820043Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5820411Z > assert reference == approx(
2024-11-29T07:43:37.5820658Z prediction, rel=1e-4
2024-11-29T07:43:37.5821036Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5821794Z E AssertionError: Model gpt2 on lite samples, for eval leaderboard|hellaswag|10, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.5822396Z E assert 0.16329931618554522 == 0.004785693561320304 ± 4.8e-07
2024-11-29T07:43:37.5822701Z E comparison failed
2024-11-29T07:43:37.5822926Z E Obtained: 0.16329931618554522
2024-11-29T07:43:37.5823234Z E Expected: 0.004785693561320304 ± 4.8e-07
2024-11-29T07:43:37.5823423Z
2024-11-29T07:43:37.5823517Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5824022Z _ test_model_prediction[gpt2_lite_leaderboard|mmlu:abstract_algebra|5_acc_stderr] _
2024-11-29T07:43:37.5824328Z
2024-11-29T07:43:37.5825018Z model_input = ('gpt2', 'lite', 'leaderboard|mmlu:abstract_algebra|5', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper ...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5825793Z
2024-11-29T07:43:37.5825950Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5826324Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5826733Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5827223Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5827721Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5828098Z > assert reference == approx(
2024-11-29T07:43:37.5828344Z prediction, rel=1e-4
2024-11-29T07:43:37.5828722Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5829365Z E AssertionError: Model gpt2 on lite samples, for eval leaderboard|mmlu:abstract_algebra|5, metric acc_stderr incorrect
2024-11-29T07:43:37.5829988Z E assert 0.16329931618554522 == 0.00497641685...3716 ± 5.0e-07
2024-11-29T07:43:37.5830294Z E comparison failed
2024-11-29T07:43:37.5830524Z E Obtained: 0.16329931618554522
2024-11-29T07:43:37.5830835Z E Expected: 0.0049764168560043716 ± 5.0e-07
2024-11-29T07:43:37.5831016Z
2024-11-29T07:43:37.5831115Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5831500Z _ test_model_prediction[gpt2_lite_leaderboard|mmlu:college_chemistry|5_acc_stderr] _
2024-11-29T07:43:37.5831802Z
2024-11-29T07:43:37.5832518Z model_input = ('gpt2', 'lite', 'leaderboard|mmlu:college_chemistry|5', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.09999999999999999)
2024-11-29T07:43:37.5833789Z
2024-11-29T07:43:37.5833958Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5834323Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5834731Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5835285Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5835772Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5836148Z > assert reference == approx(
2024-11-29T07:43:37.5836401Z prediction, rel=1e-4
2024-11-29T07:43:37.5836777Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5837691Z E AssertionError: Model gpt2 on lite samples, for eval leaderboard|mmlu:college_chemistry|5, metric acc_stderr incorrect
2024-11-29T07:43:37.5838334Z E assert 0.09999999999999999 == 0.002971327782118411 ± 3.0e-07
2024-11-29T07:43:37.5838650Z E comparison failed
2024-11-29T07:43:37.5838879Z E Obtained: 0.09999999999999999
2024-11-29T07:43:37.5839189Z E Expected: 0.002971327782118411 ± 3.0e-07
2024-11-29T07:43:37.5839377Z
2024-11-29T07:43:37.5839473Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5840039Z _ test_model_prediction[gpt2_lite_leaderboard|mmlu:computer_security|5_acc_stderr] _
2024-11-29T07:43:37.5840358Z
2024-11-29T07:43:37.5841054Z model_input = ('gpt2', 'lite', 'leaderboard|mmlu:computer_security|5', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.09999999999999999)
2024-11-29T07:43:37.5842026Z
2024-11-29T07:43:37.5842188Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5842559Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5842968Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5843463Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5843953Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5844335Z > assert reference == approx(
2024-11-29T07:43:37.5844592Z prediction, rel=1e-4
2024-11-29T07:43:37.5844981Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5845630Z E AssertionError: Model gpt2 on lite samples, for eval leaderboard|mmlu:computer_security|5, metric acc_stderr incorrect
2024-11-29T07:43:37.5846251Z E assert 0.09999999999999999 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.5846564Z E comparison failed
2024-11-29T07:43:37.5846787Z E Obtained: 0.09999999999999999
2024-11-29T07:43:37.5847095Z E Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.5847280Z
2024-11-29T07:43:37.5847384Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5847771Z _ test_model_prediction[gpt2_lite_leaderboard|mmlu:us_foreign_policy|5_acc_stderr] _
2024-11-29T07:43:37.5848079Z
2024-11-29T07:43:37.5848770Z model_input = ('gpt2', 'lite', 'leaderboard|mmlu:us_foreign_policy|5', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.5849541Z
2024-11-29T07:43:37.5849700Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5850066Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5850472Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5851029Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5851903Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5852548Z > assert reference == approx(
2024-11-29T07:43:37.5852798Z prediction, rel=1e-4
2024-11-29T07:43:37.5853178Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5853825Z E AssertionError: Model gpt2 on lite samples, for eval leaderboard|mmlu:us_foreign_policy|5, metric acc_stderr incorrect
2024-11-29T07:43:37.5854452Z E assert 0.15275252316519466 == 0.004633704913049727 ± 4.6e-07
2024-11-29T07:43:37.5854757Z E comparison failed
2024-11-29T07:43:37.5854986Z E Obtained: 0.15275252316519466
2024-11-29T07:43:37.5855297Z E Expected: 0.004633704913049727 ± 4.6e-07
2024-11-29T07:43:37.5855486Z
2024-11-29T07:43:37.5855585Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5855954Z ___ test_model_prediction[gpt2_lite_helm|mmlu:abstract_algebra|5_em_stderr] ____
2024-11-29T07:43:37.5856244Z
2024-11-29T07:43:37.5856924Z model_input = ('gpt2', 'lite', 'helm|mmlu:abstract_algebra|5', 'em_stderr', functools.partial(<functools._lru_cache_wrapper object a...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5857697Z
2024-11-29T07:43:37.5858009Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5858382Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5858783Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5859278Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5859767Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5860443Z > assert reference == approx(
2024-11-29T07:43:37.5860690Z prediction, rel=1e-4
2024-11-29T07:43:37.5861067Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5861671Z E AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:abstract_algebra|5, metric em_stderr incorrect
2024-11-29T07:43:37.5862260Z E assert 0.16329931618554522 == 0.00497641685...3716 ± 5.0e-07
2024-11-29T07:43:37.5862574Z E comparison failed
2024-11-29T07:43:37.5862802Z E Obtained: 0.16329931618554522
2024-11-29T07:43:37.5863110Z E Expected: 0.0049764168560043716 ± 5.0e-07
2024-11-29T07:43:37.5863292Z
2024-11-29T07:43:37.5863390Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5863760Z __ test_model_prediction[gpt2_lite_helm|mmlu:abstract_algebra|5_pqem_stderr] ___
2024-11-29T07:43:37.5864053Z
2024-11-29T07:43:37.5864743Z model_input = ('gpt2', 'lite', 'helm|mmlu:abstract_algebra|5', 'pqem_stderr', functools.partial(<functools._lru_cache_wrapper object...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5865529Z
2024-11-29T07:43:37.5865689Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5866054Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5866464Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5866956Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5867445Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5867820Z > assert reference == approx(
2024-11-29T07:43:37.5868061Z prediction, rel=1e-4
2024-11-29T07:43:37.5868436Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5869052Z E AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:abstract_algebra|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.5869640Z E assert 0.16329931618554522 == 0.00497641685...3716 ± 5.0e-07
2024-11-29T07:43:37.5869948Z E comparison failed
2024-11-29T07:43:37.5870174Z E Obtained: 0.16329931618554522
2024-11-29T07:43:37.5870481Z E Expected: 0.0049764168560043716 ± 5.0e-07
2024-11-29T07:43:37.5870669Z
2024-11-29T07:43:37.5870772Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5871224Z ___ test_model_prediction[gpt2_lite_helm|mmlu:college_chemistry|5_em_stderr] ___
2024-11-29T07:43:37.5871743Z
2024-11-29T07:43:37.5872652Z model_input = ('gpt2', 'lite', 'helm|mmlu:college_chemistry|5', 'em_stderr', functools.partial(<functools._lru_cache_wrapper object ...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.5873425Z
2024-11-29T07:43:37.5873581Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5873947Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5874352Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5874845Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5875468Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5875857Z > assert reference == approx(
2024-11-29T07:43:37.5876102Z prediction, rel=1e-4
2024-11-29T07:43:37.5876476Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5877083Z E AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:college_chemistry|5, metric em_stderr incorrect
2024-11-29T07:43:37.5878044Z E assert 0.15275252316519466 == 0.00457283509661358 ± 4.6e-07
2024-11-29T07:43:37.5878353Z E comparison failed
2024-11-29T07:43:37.5878580Z E Obtained: 0.15275252316519466
2024-11-29T07:43:37.5878887Z E Expected: 0.00457283509661358 ± 4.6e-07
2024-11-29T07:43:37.5879074Z
2024-11-29T07:43:37.5879170Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5879541Z __ test_model_prediction[gpt2_lite_helm|mmlu:college_chemistry|5_pqem_stderr] __
2024-11-29T07:43:37.5879830Z
2024-11-29T07:43:37.5880521Z model_input = ('gpt2', 'lite', 'helm|mmlu:college_chemistry|5', 'pqem_stderr', functools.partial(<functools._lru_cache_wrapper objec...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5881299Z
2024-11-29T07:43:37.5881458Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5881825Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5882237Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5882728Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5883213Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5883588Z > assert reference == approx(
2024-11-29T07:43:37.5883836Z prediction, rel=1e-4
2024-11-29T07:43:37.5884220Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5884831Z E AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:college_chemistry|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.5885422Z E assert 0.16329931618554522 == 0.004802280906184263 ± 4.8e-07
2024-11-29T07:43:37.5885732Z E comparison failed
2024-11-29T07:43:37.5885956Z E Obtained: 0.16329931618554522
2024-11-29T07:43:37.5886267Z E Expected: 0.004802280906184263 ± 4.8e-07
2024-11-29T07:43:37.5886457Z
2024-11-29T07:43:37.5886551Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5886918Z ___ test_model_prediction[gpt2_lite_helm|mmlu:computer_security|5_em_stderr] ___
2024-11-29T07:43:37.5887212Z
2024-11-29T07:43:37.5887893Z model_input = ('gpt2', 'lite', 'helm|mmlu:computer_security|5', 'em_stderr', functools.partial(<functools._lru_cache_wrapper object ...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.09999999999999999)
2024-11-29T07:43:37.5888674Z
2024-11-29T07:43:37.5888958Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5889593Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5890150Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5890644Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5891136Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5891542Z > assert reference == approx(
2024-11-29T07:43:37.5891788Z prediction, rel=1e-4
2024-11-29T07:43:37.5892167Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5892776Z E AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:computer_security|5, metric em_stderr incorrect
2024-11-29T07:43:37.5893541Z E assert 0.09999999999999999 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.5893854Z E comparison failed
2024-11-29T07:43:37.5894080Z E Obtained: 0.09999999999999999
2024-11-29T07:43:37.5894393Z E Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.5894575Z
2024-11-29T07:43:37.5894675Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5895043Z __ test_model_prediction[gpt2_lite_helm|mmlu:computer_security|5_pqem_stderr] __
2024-11-29T07:43:37.5895464Z
2024-11-29T07:43:37.5896154Z model_input = ('gpt2', 'lite', 'helm|mmlu:computer_security|5', 'pqem_stderr', functools.partial(<functools._lru_cache_wrapper objec...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519464)
2024-11-29T07:43:37.5896915Z
2024-11-29T07:43:37.5897075Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5897448Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5897862Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5898352Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5898841Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5899213Z > assert reference == approx(
2024-11-29T07:43:37.5899462Z prediction, rel=1e-4
2024-11-29T07:43:37.5899846Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5900461Z E AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:computer_security|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.5901053Z E assert 0.15275252316519464 == 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.5901360Z E comparison failed
2024-11-29T07:43:37.5901589Z E Obtained: 0.15275252316519464
2024-11-29T07:43:37.5901901Z E Expected: 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.5902091Z
2024-11-29T07:43:37.5902193Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5902558Z ___ test_model_prediction[gpt2_lite_helm|mmlu:us_foreign_policy|5_em_stderr] ___
2024-11-29T07:43:37.5902842Z
2024-11-29T07:43:37.5903512Z model_input = ('gpt2', 'lite', 'helm|mmlu:us_foreign_policy|5', 'em_stderr', functools.partial(<functools._lru_cache_wrapper object ...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.5904269Z
2024-11-29T07:43:37.5904431Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5904795Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5905199Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5905684Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5906174Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5906553Z > assert reference == approx(
2024-11-29T07:43:37.5906888Z prediction, rel=1e-4
2024-11-29T07:43:37.5907289Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5907895Z E AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:us_foreign_policy|5, metric em_stderr incorrect
2024-11-29T07:43:37.5908875Z E assert 0.15275252316519466 == 0.004633704913049727 ± 4.6e-07
2024-11-29T07:43:37.5909278Z E comparison failed
2024-11-29T07:43:37.5909502Z E Obtained: 0.15275252316519466
2024-11-29T07:43:37.5909815Z E Expected: 0.004633704913049727 ± 4.6e-07
2024-11-29T07:43:37.5910000Z
2024-11-29T07:43:37.5910093Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5910460Z __ test_model_prediction[gpt2_lite_helm|mmlu:us_foreign_policy|5_pqem_stderr] __
2024-11-29T07:43:37.5910763Z
2024-11-29T07:43:37.5911603Z model_input = ('gpt2', 'lite', 'helm|mmlu:us_foreign_policy|5', 'pqem_stderr', functools.partial(<functools._lru_cache_wrapper objec...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5912371Z
2024-11-29T07:43:37.5912529Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5912896Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5913414Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5913900Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5914388Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5916656Z > assert reference == approx(
2024-11-29T07:43:37.5917064Z prediction, rel=1e-4
2024-11-29T07:43:37.5918084Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5919133Z E AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:us_foreign_policy|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.5920179Z E assert 0.16329931618554522 == 0.004872014627084626 ± 4.9e-07
2024-11-29T07:43:37.5920679Z E comparison failed
2024-11-29T07:43:37.5920915Z E Obtained: 0.16329931618554522
2024-11-29T07:43:37.5921263Z E Expected: 0.004872014627084626 ± 4.9e-07
2024-11-29T07:43:37.5921462Z
2024-11-29T07:43:37.5921562Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5921935Z _______ test_model_prediction[gpt2_lite_lighteval|anli:r1|0_acc_stderr] ________
2024-11-29T07:43:37.5922223Z
2024-11-29T07:43:37.5922924Z model_input = ('gpt2', 'lite', 'lighteval|anli:r1|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object at 0x7f63...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16666666666666666)
2024-11-29T07:43:37.5923695Z
2024-11-29T07:43:37.5923860Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5924234Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5924653Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5925153Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5925689Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5926348Z > assert reference == approx(
2024-11-29T07:43:37.5926770Z prediction, rel=1e-4
2024-11-29T07:43:37.5927162Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5927752Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|anli:r1|0, metric acc_stderr incorrect
2024-11-29T07:43:37.5928323Z E assert 0.16666666666666666 == 0.00514299138248941 ± 5.1e-07
2024-11-29T07:43:37.5928631Z E comparison failed
2024-11-29T07:43:37.5928857Z E Obtained: 0.16666666666666666
2024-11-29T07:43:37.5929167Z E Expected: 0.00514299138248941 ± 5.1e-07
2024-11-29T07:43:37.5929345Z
2024-11-29T07:43:37.5929446Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5929815Z _ test_model_prediction[gpt2_lite_lighteval|blimp:adjunct_island|0_acc_stderr] _
2024-11-29T07:43:37.5930114Z
2024-11-29T07:43:37.5930809Z model_input = ('gpt2', 'lite', 'lighteval|blimp:adjunct_island|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper obj...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.5931601Z
2024-11-29T07:43:37.5931763Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5932136Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5932756Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5933259Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5933742Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5934112Z > assert reference == approx(
2024-11-29T07:43:37.5934362Z prediction, rel=1e-4
2024-11-29T07:43:37.5934878Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5935493Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|blimp:adjunct_island|0, metric acc_stderr incorrect
2024-11-29T07:43:37.5936097Z E assert 0.13333333333333333 == 0.003921139545506534 ± 3.9e-07
2024-11-29T07:43:37.5936407Z E comparison failed
2024-11-29T07:43:37.5936634Z E Obtained: 0.13333333333333333
2024-11-29T07:43:37.5936950Z E Expected: 0.003921139545506534 ± 3.9e-07
2024-11-29T07:43:37.5937132Z
2024-11-29T07:43:37.5937231Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5937609Z _ test_model_prediction[gpt2_lite_lighteval|blimp:ellipsis_n_bar_1|0_acc_stderr] _
2024-11-29T07:43:37.5937912Z
2024-11-29T07:43:37.5938588Z model_input = ('gpt2', 'lite', 'lighteval|blimp:ellipsis_n_bar_1|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper o...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.5939340Z
2024-11-29T07:43:37.5939501Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5939864Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5940267Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5940751Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5941238Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5941606Z > assert reference == approx(
2024-11-29T07:43:37.5941854Z prediction, rel=1e-4
2024-11-29T07:43:37.5942231Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5942855Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|blimp:ellipsis_n_bar_1|0, metric acc_stderr incorrect
2024-11-29T07:43:37.5943750Z E assert 0.15275252316519466 == 0.004709524351738684 ± 4.7e-07
2024-11-29T07:43:37.5944254Z E comparison failed
2024-11-29T07:43:37.5944480Z E Obtained: 0.15275252316519466
2024-11-29T07:43:37.5944790Z E Expected: 0.004709524351738684 ± 4.7e-07
2024-11-29T07:43:37.5944969Z
2024-11-29T07:43:37.5945068Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5945421Z ___________ test_model_prediction[gpt2_lite_helm|boolq|5_em_stderr] ____________
2024-11-29T07:43:37.5945695Z
2024-11-29T07:43:37.5946392Z model_input = ('gpt2', 'lite', 'helm|boolq|5', 'em_stderr', functools.partial(<functools._lru_cache_wrapper object at 0x7f6332c883b0...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5947153Z
2024-11-29T07:43:37.5947321Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5947704Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5948126Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5948620Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5949106Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5949481Z > assert reference == approx(
2024-11-29T07:43:37.5949729Z prediction, rel=1e-4
2024-11-29T07:43:37.5950234Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5950802Z E AssertionError: Model gpt2 on lite samples, for eval helm|boolq|5, metric em_stderr incorrect
2024-11-29T07:43:37.5951344Z E assert 0.16329931618554522 == 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.5951646Z E comparison failed
2024-11-29T07:43:37.5951872Z E Obtained: 0.16329931618554522
2024-11-29T07:43:37.5952301Z E Expected: 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.5952485Z
2024-11-29T07:43:37.5952579Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5952930Z ___________ test_model_prediction[gpt2_lite_helm|boolq|5_qem_stderr] ___________
2024-11-29T07:43:37.5953203Z
2024-11-29T07:43:37.5953893Z model_input = ('gpt2', 'lite', 'helm|boolq|5', 'qem_stderr', functools.partial(<functools._lru_cache_wrapper object at 0x7f6332c883b...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5954653Z
2024-11-29T07:43:37.5954810Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5955176Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5955574Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5956067Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5956567Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5956939Z > assert reference == approx(
2024-11-29T07:43:37.5957184Z prediction, rel=1e-4
2024-11-29T07:43:37.5957832Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5958395Z E AssertionError: Model gpt2 on lite samples, for eval helm|boolq|5, metric qem_stderr incorrect
2024-11-29T07:43:37.5958946Z E assert 0.16329931618554522 == 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.5959248Z E comparison failed
2024-11-29T07:43:37.5959470Z E Obtained: 0.16329931618554522
2024-11-29T07:43:37.5959777Z E Expected: 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.5959961Z
2024-11-29T07:43:37.5960055Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5960406Z ___________ test_model_prediction[gpt2_lite_helm|boolq|5_pem_stderr] ___________
2024-11-29T07:43:37.5960693Z
2024-11-29T07:43:37.5961374Z model_input = ('gpt2', 'lite', 'helm|boolq|5', 'pem_stderr', functools.partial(<functools._lru_cache_wrapper object at 0x7f6332c883b...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5962133Z
2024-11-29T07:43:37.5962291Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5962659Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5963313Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5964188Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5964689Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5965067Z > assert reference == approx(
2024-11-29T07:43:37.5965320Z prediction, rel=1e-4
2024-11-29T07:43:37.5965701Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5966267Z E AssertionError: Model gpt2 on lite samples, for eval helm|boolq|5, metric pem_stderr incorrect
2024-11-29T07:43:37.5966818Z E assert 0.16329931618554522 == 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.5967128Z E comparison failed
2024-11-29T07:43:37.5967355Z E Obtained: 0.16329931618554522
2024-11-29T07:43:37.5967843Z E Expected: 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.5968029Z
2024-11-29T07:43:37.5968129Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5968477Z __________ test_model_prediction[gpt2_lite_helm|boolq|5_pqem_stderr] ___________
2024-11-29T07:43:37.5968758Z
2024-11-29T07:43:37.5969434Z model_input = ('gpt2', 'lite', 'helm|boolq|5', 'pqem_stderr', functools.partial(<functools._lru_cache_wrapper object at 0x7f6332c883...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5970369Z
2024-11-29T07:43:37.5970539Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5970908Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5971320Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5971814Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5972313Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5972692Z > assert reference == approx(
2024-11-29T07:43:37.5972940Z prediction, rel=1e-4
2024-11-29T07:43:37.5973323Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5973887Z E AssertionError: Model gpt2 on lite samples, for eval helm|boolq|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.5974436Z E assert 0.16329931618554522 == 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.5974748Z E comparison failed
2024-11-29T07:43:37.5974977Z E Obtained: 0.16329931618554522
2024-11-29T07:43:37.5975287Z E Expected: 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.5975468Z
2024-11-29T07:43:37.5975567Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5975939Z ___ test_model_prediction[gpt2_lite_lighteval|agieval:aqua-rat|0_acc_stderr] ___
2024-11-29T07:43:37.5976228Z
2024-11-29T07:43:37.5976938Z model_input = ('gpt2', 'lite', 'lighteval|agieval:aqua-rat|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object ...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275)
2024-11-29T07:43:37.5977707Z
2024-11-29T07:43:37.5977870Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5978250Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5978655Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5979154Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5979649Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5980019Z > assert reference == approx(
2024-11-29T07:43:37.5980383Z prediction, rel=1e-4
2024-11-29T07:43:37.5981044Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5981772Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:aqua-rat|0, metric acc_stderr incorrect
2024-11-29T07:43:37.5982342Z E assert 0.15275 == 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.5982620Z E comparison failed
2024-11-29T07:43:37.5982841Z E Obtained: 0.15275
2024-11-29T07:43:37.5983125Z E Expected: 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.5983305Z
2024-11-29T07:43:37.5983407Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5983791Z _ test_model_prediction[gpt2_lite_lighteval|agieval:aqua-rat|0_acc_norm_stderr] _
2024-11-29T07:43:37.5984080Z
2024-11-29T07:43:37.5984781Z model_input = ('gpt2', 'lite', 'lighteval|agieval:aqua-rat|0', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper ob...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275)
2024-11-29T07:43:37.5985683Z
2024-11-29T07:43:37.5985853Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5986220Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5986624Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5987118Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5987721Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5988105Z > assert reference == approx(
2024-11-29T07:43:37.5988351Z prediction, rel=1e-4
2024-11-29T07:43:37.5988735Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5989360Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:aqua-rat|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.5989936Z E assert 0.15275 == 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.5990211Z E comparison failed
2024-11-29T07:43:37.5990430Z E Obtained: 0.15275
2024-11-29T07:43:37.5990709Z E Expected: 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.5990892Z
2024-11-29T07:43:37.5990986Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5991518Z __ test_model_prediction[gpt2_lite_lighteval|agieval:logiqa-en|0_acc_stderr] ___
2024-11-29T07:43:37.5991818Z
2024-11-29T07:43:37.5992516Z model_input = ('gpt2', 'lite', 'lighteval|agieval:logiqa-en|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.5993285Z
2024-11-29T07:43:37.5993443Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5993809Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5994216Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5994701Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5995188Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5995563Z > assert reference == approx(
2024-11-29T07:43:37.5995817Z prediction, rel=1e-4
2024-11-29T07:43:37.5996203Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5996818Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:logiqa-en|0, metric acc_stderr incorrect
2024-11-29T07:43:37.5997608Z E assert 0.1 == 0.00309049205...1304 ± 3.1e-07
2024-11-29T07:43:37.5997879Z E comparison failed
2024-11-29T07:43:37.5998090Z E Obtained: 0.1
2024-11-29T07:43:37.5998367Z E Expected: 0.0030904920548581304 ± 3.1e-07
2024-11-29T07:43:37.5998554Z
2024-11-29T07:43:37.5998655Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5999036Z _ test_model_prediction[gpt2_lite_lighteval|agieval:logiqa-en|0_acc_norm_stderr] _
2024-11-29T07:43:37.5999337Z
2024-11-29T07:43:37.6000059Z model_input = ('gpt2', 'lite', 'lighteval|agieval:logiqa-en|0', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper o...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275)
2024-11-29T07:43:37.6001390Z
2024-11-29T07:43:37.6001548Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6001914Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6002319Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6002811Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6003470Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6003857Z > assert reference == approx(
2024-11-29T07:43:37.6004102Z prediction, rel=1e-4
2024-11-29T07:43:37.6004478Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6005116Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:logiqa-en|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6005829Z E assert 0.15275 == 0.00457742206...5185 ± 4.6e-07
2024-11-29T07:43:37.6006101Z E comparison failed
2024-11-29T07:43:37.6006319Z E Obtained: 0.15275
2024-11-29T07:43:37.6006597Z E Expected: 0.0045774220684565185 ± 4.6e-07
2024-11-29T07:43:37.6006784Z
2024-11-29T07:43:37.6006957Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6007517Z ___ test_model_prediction[gpt2_lite_lighteval|agieval:lsat-ar|0_acc_stderr] ____
2024-11-29T07:43:37.6007826Z
2024-11-29T07:43:37.6008536Z model_input = ('gpt2', 'lite', 'lighteval|agieval:lsat-ar|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object a...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6009298Z
2024-11-29T07:43:37.6009458Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6009822Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6010232Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6010718Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6011205Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6011579Z > assert reference == approx(
2024-11-29T07:43:37.6011822Z prediction, rel=1e-4
2024-11-29T07:43:37.6012203Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6012835Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-ar|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6013389Z E assert 0.1 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6013659Z E comparison failed
2024-11-29T07:43:37.6013876Z E Obtained: 0.1
2024-11-29T07:43:37.6014140Z E Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6014325Z
2024-11-29T07:43:37.6014423Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6014801Z _ test_model_prediction[gpt2_lite_lighteval|agieval:lsat-ar|0_acc_norm_stderr] _
2024-11-29T07:43:37.6015092Z
2024-11-29T07:43:37.6015797Z model_input = ('gpt2', 'lite', 'lighteval|agieval:lsat-ar|0', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper obj...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6016560Z
2024-11-29T07:43:37.6016735Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6017115Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6017530Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6018043Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6018542Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6018922Z > assert reference == approx(
2024-11-29T07:43:37.6019175Z prediction, rel=1e-4
2024-11-29T07:43:37.6019565Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6020194Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-ar|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6020749Z E assert 0.1 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6021169Z E comparison failed
2024-11-29T07:43:37.6021399Z E Obtained: 0.1
2024-11-29T07:43:37.6021674Z E Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6021855Z
2024-11-29T07:43:37.6021956Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6022328Z ___ test_model_prediction[gpt2_lite_lighteval|agieval:lsat-lr|0_acc_stderr] ____
2024-11-29T07:43:37.6022615Z
2024-11-29T07:43:37.6023428Z model_input = ('gpt2', 'lite', 'lighteval|agieval:lsat-lr|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object a...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333)
2024-11-29T07:43:37.6024220Z
2024-11-29T07:43:37.6024377Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6024745Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6025152Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6025652Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6026139Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6026514Z > assert reference == approx(
2024-11-29T07:43:37.6026763Z prediction, rel=1e-4
2024-11-29T07:43:37.6027141Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6027754Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-lr|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6028311Z E assert 0.13333 == 0.004077368628777015 ± 4.1e-07
2024-11-29T07:43:37.6028585Z E comparison failed
2024-11-29T07:43:37.6028805Z E Obtained: 0.13333
2024-11-29T07:43:37.6029088Z E Expected: 0.004077368628777015 ± 4.1e-07
2024-11-29T07:43:37.6029279Z
2024-11-29T07:43:37.6029377Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6029761Z _ test_model_prediction[gpt2_lite_lighteval|agieval:lsat-lr|0_acc_norm_stderr] _
2024-11-29T07:43:37.6030051Z
2024-11-29T07:43:37.6030738Z model_input = ('gpt2', 'lite', 'lighteval|agieval:lsat-lr|0', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper obj...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333)
2024-11-29T07:43:37.6031527Z
2024-11-29T07:43:37.6031685Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6032055Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6032459Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6032947Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6033434Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6033812Z > assert reference == approx(
2024-11-29T07:43:37.6034060Z prediction, rel=1e-4
2024-11-29T07:43:37.6034438Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6035061Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-lr|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6035623Z E assert 0.13333 == 0.004077368628777015 ± 4.1e-07
2024-11-29T07:43:37.6036009Z E comparison failed
2024-11-29T07:43:37.6036359Z E Obtained: 0.13333
2024-11-29T07:43:37.6036664Z E Expected: 0.004077368628777015 ± 4.1e-07
2024-11-29T07:43:37.6036986Z
2024-11-29T07:43:37.6037146Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6037957Z ___ test_model_prediction[gpt2_lite_lighteval|agieval:lsat-rc|0_acc_stderr] ____
2024-11-29T07:43:37.6038311Z
2024-11-29T07:43:37.6039835Z model_input = ('gpt2', 'lite', 'lighteval|agieval:lsat-rc|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object a...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275)
2024-11-29T07:43:37.6041353Z
2024-11-29T07:43:37.6041630Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6042266Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6042756Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6043426Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6043918Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6044294Z > assert reference == approx(
2024-11-29T07:43:37.6044543Z prediction, rel=1e-4
2024-11-29T07:43:37.6044925Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6045539Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-rc|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6046115Z E assert 0.15275 == 0.004582352884486063 ± 4.6e-07
2024-11-29T07:43:37.6046391Z E comparison failed
2024-11-29T07:43:37.6046613Z E Obtained: 0.15275
2024-11-29T07:43:37.6047101Z E Expected: 0.004582352884486063 ± 4.6e-07
2024-11-29T07:43:37.6047412Z
2024-11-29T07:43:37.6047594Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6047972Z _ test_model_prediction[gpt2_lite_lighteval|agieval:lsat-rc|0_acc_norm_stderr] _
2024-11-29T07:43:37.6048269Z
2024-11-29T07:43:37.6048983Z model_input = ('gpt2', 'lite', 'lighteval|agieval:lsat-rc|0', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper obj...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333)
2024-11-29T07:43:37.6049747Z
2024-11-29T07:43:37.6049909Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6050282Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6050683Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6051170Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6051662Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6052041Z > assert reference == approx(
2024-11-29T07:43:37.6052289Z prediction, rel=1e-4
2024-11-29T07:43:37.6052665Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6053286Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-rc|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6053846Z E assert 0.13333 == 0.004080117321518739 ± 4.1e-07
2024-11-29T07:43:37.6054122Z E comparison failed
2024-11-29T07:43:37.6054345Z E Obtained: 0.13333
2024-11-29T07:43:37.6054626Z E Expected: 0.004080117321518739 ± 4.1e-07
2024-11-29T07:43:37.6054808Z
2024-11-29T07:43:37.6054907Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6055614Z _ test_model_prediction[gpt2_lite_lighteval|agieval:sat-en-without-passage|0_acc_stderr] _
2024-11-29T07:43:37.6056149Z
2024-11-29T07:43:37.6056861Z model_input = ('gpt2', 'lite', 'lighteval|agieval:sat-en-without-passage|0', 'acc_stderr', functools.partial(<functools._lru_cache_w...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333)
2024-11-29T07:43:37.6057653Z
2024-11-29T07:43:37.6057824Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6058190Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6058596Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6059245Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6059753Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6060127Z > assert reference == approx(
2024-11-29T07:43:37.6060380Z prediction, rel=1e-4
2024-11-29T07:43:37.6060763Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6061539Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-en-without-passage|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6062164Z E assert 0.13333 == 0.00388625951...6192 ± 3.9e-07
2024-11-29T07:43:37.6062445Z E comparison failed
2024-11-29T07:43:37.6062667Z E Obtained: 0.13333
2024-11-29T07:43:37.6062946Z E Expected: 0.0038862595143676192 ± 3.9e-07
2024-11-29T07:43:37.6063127Z
2024-11-29T07:43:37.6063227Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6063669Z _ test_model_prediction[gpt2_lite_lighteval|agieval:sat-en-without-passage|0_acc_norm_stderr] _
2024-11-29T07:43:37.6064012Z
2024-11-29T07:43:37.6064713Z model_input = ('gpt2', 'lite', 'lighteval|agieval:sat-en-without-passage|0', 'acc_norm_stderr', functools.partial(<functools._lru_ca...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275)
2024-11-29T07:43:37.6065503Z
2024-11-29T07:43:37.6065659Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6066028Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6066447Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6066945Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6067435Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6067815Z > assert reference == approx(
2024-11-29T07:43:37.6068064Z prediction, rel=1e-4
2024-11-29T07:43:37.6068448Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6069118Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-en-without-passage|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6069728Z E assert 0.15275 == 0.004538042951960014 ± 4.5e-07
2024-11-29T07:43:37.6070009Z E comparison failed
2024-11-29T07:43:37.6070228Z E Obtained: 0.15275
2024-11-29T07:43:37.6070504Z E Expected: 0.004538042951960014 ± 4.5e-07
2024-11-29T07:43:37.6070689Z
2024-11-29T07:43:37.6070784Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6071154Z ____ test_model_prediction[gpt2_lite_lighteval|agieval:sat-en|0_acc_stderr] ____
2024-11-29T07:43:37.6071441Z
2024-11-29T07:43:37.6072141Z model_input = ('gpt2', 'lite', 'lighteval|agieval:sat-en|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object at...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333)
2024-11-29T07:43:37.6072913Z
2024-11-29T07:43:37.6073073Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6073440Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6073845Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6074341Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6074835Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6075212Z > assert reference == approx(
2024-11-29T07:43:37.6075462Z prediction, rel=1e-4
2024-11-29T07:43:37.6075840Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6076568Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-en|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6077117Z E assert 0.13333 == 0.00388625951...6192 ± 3.9e-07
2024-11-29T07:43:37.6077672Z E comparison failed
2024-11-29T07:43:37.6077892Z E Obtained: 0.13333
2024-11-29T07:43:37.6078188Z E Expected: 0.0038862595143676192 ± 3.9e-07
2024-11-29T07:43:37.6078376Z
2024-11-29T07:43:37.6078638Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6079015Z _ test_model_prediction[gpt2_lite_lighteval|agieval:sat-en|0_acc_norm_stderr] __
2024-11-29T07:43:37.6079312Z
2024-11-29T07:43:37.6080014Z model_input = ('gpt2', 'lite', 'lighteval|agieval:sat-en|0', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper obje...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275)
2024-11-29T07:43:37.6080787Z
2024-11-29T07:43:37.6080953Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6081325Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6081732Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6082302Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6082800Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6083183Z > assert reference == approx(
2024-11-29T07:43:37.6083436Z prediction, rel=1e-4
2024-11-29T07:43:37.6083819Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6084451Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-en|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6085030Z E assert 0.15275 == 0.004538042951960014 ± 4.5e-07
2024-11-29T07:43:37.6085351Z E comparison failed
2024-11-29T07:43:37.6085761Z E Obtained: 0.15275
2024-11-29T07:43:37.6086114Z E Expected: 0.004538042951960014 ± 4.5e-07
2024-11-29T07:43:37.6086318Z
2024-11-29T07:43:37.6086419Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6086787Z ___ test_model_prediction[gpt2_lite_lighteval|agieval:sat-math|0_acc_stderr] ___
2024-11-29T07:43:37.6087080Z
2024-11-29T07:43:37.6087779Z model_input = ('gpt2', 'lite', 'lighteval|agieval:sat-math|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object ...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275)
2024-11-29T07:43:37.6088561Z
2024-11-29T07:43:37.6088725Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6089096Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6089506Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6090007Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6090497Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6090878Z > assert reference == approx(
2024-11-29T07:43:37.6091129Z prediction, rel=1e-4
2024-11-29T07:43:37.6091637Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6092259Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-math|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6092817Z E assert 0.15275 == 0.004664521171326971 ± 4.7e-07
2024-11-29T07:43:37.6093091Z E comparison failed
2024-11-29T07:43:37.6093712Z E Obtained: 0.15275
2024-11-29T07:43:37.6094005Z E Expected: 0.004664521171326971 ± 4.7e-07
2024-11-29T07:43:37.6094190Z
2024-11-29T07:43:37.6094292Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6094849Z _ test_model_prediction[gpt2_lite_lighteval|agieval:sat-math|0_acc_norm_stderr] _
2024-11-29T07:43:37.6095151Z
2024-11-29T07:43:37.6095842Z model_input = ('gpt2', 'lite', 'lighteval|agieval:sat-math|0', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper ob...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6096799Z
2024-11-29T07:43:37.6097112Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6097485Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6097899Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6098403Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6098890Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6099269Z > assert reference == approx(
2024-11-29T07:43:37.6099522Z prediction, rel=1e-4
2024-11-29T07:43:37.6099908Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6100552Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-math|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6101128Z E assert 0.1 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6101406Z E comparison failed
2024-11-29T07:43:37.6101624Z E Obtained: 0.1
2024-11-29T07:43:37.6101896Z E Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6102083Z
2024-11-29T07:43:37.6102183Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6102571Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:causal_judgment|3_acc_stderr] _
2024-11-29T07:43:37.6103084Z
2024-11-29T07:43:37.6103794Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:causal_judgment|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16666666666666666)
2024-11-29T07:43:37.6104594Z
2024-11-29T07:43:37.6104752Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6105116Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6105523Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6106021Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6106504Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6106879Z > assert reference == approx(
2024-11-29T07:43:37.6107127Z prediction, rel=1e-4
2024-11-29T07:43:37.6107505Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6108147Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:causal_judgment|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6108769Z E assert 0.16666666666666666 == 0.004861068811484776 ± 4.9e-07
2024-11-29T07:43:37.6109079Z E comparison failed
2024-11-29T07:43:37.6109306Z E Obtained: 0.16666666666666666
2024-11-29T07:43:37.6109616Z E Expected: 0.004861068811484776 ± 4.9e-07
2024-11-29T07:43:37.6109803Z
2024-11-29T07:43:37.6109903Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6110282Z _ test_model_prediction[gpt2_lite_harness|bigbench:causal_judgment|3_acc_stderr] _
2024-11-29T07:43:37.6110604Z
2024-11-29T07:43:37.6111390Z model_input = ('gpt2', 'lite', 'harness|bigbench:causal_judgment|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper o...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6112167Z
2024-11-29T07:43:37.6112326Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6112855Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6113265Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6113759Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6114248Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6114740Z > assert reference == approx(
2024-11-29T07:43:37.6114989Z prediction, rel=1e-4
2024-11-29T07:43:37.6115369Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6116001Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:causal_judgment|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6116577Z E assert 0.1633 == 0.004699965923246645 ± 4.7e-07
2024-11-29T07:43:37.6116853Z E comparison failed
2024-11-29T07:43:37.6117079Z E Obtained: 0.1633
2024-11-29T07:43:37.6117602Z E Expected: 0.004699965923246645 ± 4.7e-07
2024-11-29T07:43:37.6117793Z
2024-11-29T07:43:37.6117887Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6118276Z _ test_model_prediction[gpt2_lite_harness|bigbench:causal_judgment|3_acc_norm_stderr] _
2024-11-29T07:43:37.6118595Z
2024-11-29T07:43:37.6119279Z model_input = ('gpt2', 'lite', 'harness|bigbench:causal_judgment|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrap...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16666666666666666)
2024-11-29T07:43:37.6120057Z
2024-11-29T07:43:37.6120212Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6120579Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6120987Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6121485Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6121973Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6122347Z > assert reference == approx(
2024-11-29T07:43:37.6122594Z prediction, rel=1e-4
2024-11-29T07:43:37.6122973Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6123630Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:causal_judgment|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6124246Z E assert 0.16666666666666666 == 0.004861068811484776 ± 4.9e-07
2024-11-29T07:43:37.6124553Z E comparison failed
2024-11-29T07:43:37.6124780Z E Obtained: 0.16666666666666666
2024-11-29T07:43:37.6125086Z E Expected: 0.004861068811484776 ± 4.9e-07
2024-11-29T07:43:37.6125265Z
2024-11-29T07:43:37.6125362Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6125756Z _ test_model_prediction[gpt2_lite_harness|bigbench:date_understanding|3_acc_stderr] _
2024-11-29T07:43:37.6126064Z
2024-11-29T07:43:37.6126756Z model_input = ('gpt2', 'lite', 'harness|bigbench:date_understanding|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrappe...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6127527Z
2024-11-29T07:43:37.6127697Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6128062Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6128461Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6128955Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6129443Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6129823Z > assert reference == approx(
2024-11-29T07:43:37.6130228Z prediction, rel=1e-4
2024-11-29T07:43:37.6130609Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6131249Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:date_understanding|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6131858Z E assert 0.13333333333333333 == 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6132302Z E comparison failed
2024-11-29T07:43:37.6132530Z E Obtained: 0.13333333333333333
2024-11-29T07:43:37.6132839Z E Expected: 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6133019Z
2024-11-29T07:43:37.6133118Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6133519Z _ test_model_prediction[gpt2_lite_harness|bigbench:date_understanding|3_acc_norm_stderr] _
2024-11-29T07:43:37.6133837Z
2024-11-29T07:43:37.6134530Z model_input = ('gpt2', 'lite', 'harness|bigbench:date_understanding|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache_w...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6135321Z
2024-11-29T07:43:37.6135482Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6135846Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6136249Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6136744Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6137232Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6137606Z > assert reference == approx(
2024-11-29T07:43:37.6137849Z prediction, rel=1e-4
2024-11-29T07:43:37.6138232Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6138891Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:date_understanding|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6139524Z E assert 0.13333333333333333 == 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6139828Z E comparison failed
2024-11-29T07:43:37.6140052Z E Obtained: 0.13333333333333333
2024-11-29T07:43:37.6140356Z E Expected: 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6140546Z
2024-11-29T07:43:37.6140640Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6141028Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:disambiguation_qa|3_acc_stderr] _
2024-11-29T07:43:37.6141338Z
2024-11-29T07:43:37.6142028Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:disambiguation_qa|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapp...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.6142801Z
2024-11-29T07:43:37.6142961Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6143326Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6143731Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6144220Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6144705Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6145083Z > assert reference == approx(
2024-11-29T07:43:37.6145328Z prediction, rel=1e-4
2024-11-29T07:43:37.6145705Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6146351Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:disambiguation_qa|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6146974Z E assert 0.15275252316519466 == 0.004650507199996266 ± 4.7e-07
2024-11-29T07:43:37.6147396Z E comparison failed
2024-11-29T07:43:37.6147627Z E Obtained: 0.15275252316519466
2024-11-29T07:43:37.6147935Z E Expected: 0.004650507199996266 ± 4.7e-07
2024-11-29T07:43:37.6148120Z
2024-11-29T07:43:37.6148215Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6148599Z _ test_model_prediction[gpt2_lite_harness|bigbench:disambiguation_qa|3_acc_stderr] _
2024-11-29T07:43:37.6149024Z
2024-11-29T07:43:37.6149713Z model_input = ('gpt2', 'lite', 'harness|bigbench:disambiguation_qa|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.6150485Z
2024-11-29T07:43:37.6150642Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6151004Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6151416Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6151909Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6152391Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6152766Z > assert reference == approx(
2024-11-29T07:43:37.6153012Z prediction, rel=1e-4
2024-11-29T07:43:37.6153397Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6154039Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:disambiguation_qa|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6154651Z E assert 0.15275252316519466 == 0.004650507199996266 ± 4.7e-07
2024-11-29T07:43:37.6154958Z E comparison failed
2024-11-29T07:43:37.6155180Z E Obtained: 0.15275252316519466
2024-11-29T07:43:37.6155480Z E Expected: 0.004650507199996266 ± 4.7e-07
2024-11-29T07:43:37.6155671Z
2024-11-29T07:43:37.6155771Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6156170Z _ test_model_prediction[gpt2_lite_harness|bigbench:disambiguation_qa|3_acc_norm_stderr] _
2024-11-29T07:43:37.6156492Z
2024-11-29T07:43:37.6157172Z model_input = ('gpt2', 'lite', 'harness|bigbench:disambiguation_qa|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wr...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519464)
2024-11-29T07:43:37.6158198Z
2024-11-29T07:43:37.6158358Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6158724Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6159130Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6159621Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6160110Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6160489Z > assert reference == approx(
2024-11-29T07:43:37.6160804Z prediction, rel=1e-4
2024-11-29T07:43:37.6161185Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6161857Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:disambiguation_qa|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6162500Z E assert 0.15275252316519464 == 0.004582439170218064 ± 4.6e-07
2024-11-29T07:43:37.6162807Z E comparison failed
2024-11-29T07:43:37.6163031Z E Obtained: 0.15275252316519464
2024-11-29T07:43:37.6163334Z E Expected: 0.004582439170218064 ± 4.6e-07
2024-11-29T07:43:37.6163514Z
2024-11-29T07:43:37.6163614Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6163999Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:geometric_shapes|3_acc_stderr] _
2024-11-29T07:43:37.6164308Z
2024-11-29T07:43:37.6165169Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:geometric_shapes|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrappe...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6165949Z
2024-11-29T07:43:37.6166113Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6166612Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6167021Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6167511Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6167997Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6168365Z > assert reference == approx(
2024-11-29T07:43:37.6168610Z prediction, rel=1e-4
2024-11-29T07:43:37.6168995Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6169641Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:geometric_shapes|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6170268Z E assert 0.13333333333333333 == 0.004000255247111385 ± 4.0e-07
2024-11-29T07:43:37.6170575Z E comparison failed
2024-11-29T07:43:37.6170799Z E Obtained: 0.13333333333333333
2024-11-29T07:43:37.6171110Z E Expected: 0.004000255247111385 ± 4.0e-07
2024-11-29T07:43:37.6171292Z
2024-11-29T07:43:37.6171393Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6171788Z _ test_model_prediction[gpt2_lite_harness|bigbench:geometric_shapes|3_acc_norm_stderr] _
2024-11-29T07:43:37.6172109Z
2024-11-29T07:43:37.6172820Z model_input = ('gpt2', 'lite', 'harness|bigbench:geometric_shapes|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wra...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6173581Z
2024-11-29T07:43:37.6173744Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6174110Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6174517Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6175001Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6175492Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6175863Z > assert reference == approx(
2024-11-29T07:43:37.6176109Z prediction, rel=1e-4
2024-11-29T07:43:37.6176488Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6177135Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:geometric_shapes|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6177761Z E assert 0.13333333333333333 == 0.004041744140305727 ± 4.0e-07
2024-11-29T07:43:37.6178069Z E comparison failed
2024-11-29T07:43:37.6178301Z E Obtained: 0.13333333333333333
2024-11-29T07:43:37.6178606Z E Expected: 0.004041744140305727 ± 4.0e-07
2024-11-29T07:43:37.6178790Z
2024-11-29T07:43:37.6178885Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6179325Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:logical_deduction_five_objects|3_acc_stderr] _
2024-11-29T07:43:37.6179687Z
2024-11-29T07:43:37.6180377Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:logical_deduction_five_objects|3', 'acc_stderr', functools.partial(<functools._lr...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6181150Z
2024-11-29T07:43:37.6181310Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6181818Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6182228Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6182717Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6183203Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6183717Z > assert reference == approx(
2024-11-29T07:43:37.6183971Z prediction, rel=1e-4
2024-11-29T07:43:37.6184347Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6185043Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:logical_deduction_five_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6185660Z E assert 0.1 == 0.003030251214408201 ± 3.0e-07
2024-11-29T07:43:37.6185930Z E comparison failed
2024-11-29T07:43:37.6186151Z E Obtained: 0.1
2024-11-29T07:43:37.6186424Z E Expected: 0.003030251214408201 ± 3.0e-07
2024-11-29T07:43:37.6186608Z
2024-11-29T07:43:37.6186702Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6187125Z _ test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_five_objects|3_acc_stderr] _
2024-11-29T07:43:37.6187480Z
2024-11-29T07:43:37.6188160Z model_input = ('gpt2', 'lite', 'harness|bigbench:logical_deduction_five_objects|3', 'acc_stderr', functools.partial(<functools._lru_...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519464)
2024-11-29T07:43:37.6188934Z
2024-11-29T07:43:37.6189090Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6189454Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6189857Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6190351Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6190836Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6191208Z > assert reference == approx(
2024-11-29T07:43:37.6191456Z prediction, rel=1e-4
2024-11-29T07:43:37.6191871Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6192549Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_five_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6193200Z E assert 0.15275252316519464 == 0.004381117916034022 ± 4.4e-07
2024-11-29T07:43:37.6193507Z E comparison failed
2024-11-29T07:43:37.6193735Z E Obtained: 0.15275252316519464
2024-11-29T07:43:37.6194044Z E Expected: 0.004381117916034022 ± 4.4e-07
2024-11-29T07:43:37.6194226Z
2024-11-29T07:43:37.6194324Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6194767Z _ test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_five_objects|3_acc_norm_stderr] _
2024-11-29T07:43:37.6195122Z
2024-11-29T07:43:37.6195831Z model_input = ('gpt2', 'lite', 'harness|bigbench:logical_deduction_five_objects|3', 'acc_norm_stderr', functools.partial(<functools....fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519464)
2024-11-29T07:43:37.6196601Z
2024-11-29T07:43:37.6196763Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6197125Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6197921Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6198415Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6198897Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6199439Z > assert reference == approx(
2024-11-29T07:43:37.6199695Z prediction, rel=1e-4
2024-11-29T07:43:37.6200072Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6200769Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_five_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6201455Z E assert 0.15275252316519464 == 0.004480319549235682 ± 4.5e-07
2024-11-29T07:43:37.6201929Z E comparison failed
2024-11-29T07:43:37.6202157Z E Obtained: 0.15275252316519464
2024-11-29T07:43:37.6202468Z E Expected: 0.004480319549235682 ± 4.5e-07
2024-11-29T07:43:37.6202650Z
2024-11-29T07:43:37.6202750Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6203184Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:logical_deduction_seven_objects|3_acc_stderr] _
2024-11-29T07:43:37.6203543Z
2024-11-29T07:43:37.6204246Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:logical_deduction_seven_objects|3', 'acc_stderr', functools.partial(<functools._l...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6205014Z
2024-11-29T07:43:37.6205176Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6205541Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6205955Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6206441Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6206929Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6207299Z > assert reference == approx(
2024-11-29T07:43:37.6207546Z prediction, rel=1e-4
2024-11-29T07:43:37.6207922Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6208624Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:logical_deduction_seven_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6209284Z E assert 0.13333333333333333 == 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6209590Z E comparison failed
2024-11-29T07:43:37.6209814Z E Obtained: 0.13333333333333333
2024-11-29T07:43:37.6210130Z E Expected: 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6210313Z
2024-11-29T07:43:37.6210408Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6210838Z _ test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_seven_objects|3_acc_stderr] _
2024-11-29T07:43:37.6211194Z
2024-11-29T07:43:37.6211880Z model_input = ('gpt2', 'lite', 'harness|bigbench:logical_deduction_seven_objects|3', 'acc_stderr', functools.partial(<functools._lru...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6212657Z
2024-11-29T07:43:37.6212816Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6213180Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6213585Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6214073Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6214569Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6214948Z > assert reference == approx(
2024-11-29T07:43:37.6215194Z prediction, rel=1e-4
2024-11-29T07:43:37.6215570Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6216246Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_seven_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6217029Z E assert 0.13333333333333333 == 0.004006756056224812 ± 4.0e-07
2024-11-29T07:43:37.6217340Z E comparison failed
2024-11-29T07:43:37.6217561Z E Obtained: 0.13333333333333333
2024-11-29T07:43:37.6217868Z E Expected: 0.004006756056224812 ± 4.0e-07
2024-11-29T07:43:37.6218054Z
2024-11-29T07:43:37.6218146Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6218588Z _ test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_seven_objects|3_acc_norm_stderr] _
2024-11-29T07:43:37.6219068Z
2024-11-29T07:43:37.6219757Z model_input = ('gpt2', 'lite', 'harness|bigbench:logical_deduction_seven_objects|3', 'acc_norm_stderr', functools.partial(<functools...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6220527Z
2024-11-29T07:43:37.6220683Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6221058Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6221463Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6221955Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6222442Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6222814Z > assert reference == approx(
2024-11-29T07:43:37.6223064Z prediction, rel=1e-4
2024-11-29T07:43:37.6223443Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6224140Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_seven_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6224811Z E assert 0.13333333333333333 == 0.004006756056224812 ± 4.0e-07
2024-11-29T07:43:37.6225114Z E comparison failed
2024-11-29T07:43:37.6225338Z E Obtained: 0.13333333333333333
2024-11-29T07:43:37.6225650Z E Expected: 0.004006756056224812 ± 4.0e-07
2024-11-29T07:43:37.6225834Z
2024-11-29T07:43:37.6225932Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6226366Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:logical_deduction_three_objects|3_acc_stderr] _
2024-11-29T07:43:37.6226720Z
2024-11-29T07:43:37.6227418Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:logical_deduction_three_objects|3', 'acc_stderr', functools.partial(<functools._l...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6228200Z
2024-11-29T07:43:37.6228360Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6228724Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6229125Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6229617Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6230108Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6230484Z > assert reference == approx(
2024-11-29T07:43:37.6230731Z prediction, rel=1e-4
2024-11-29T07:43:37.6231109Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6231803Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:logical_deduction_three_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6232425Z E assert 0.1633 == 0.00504952053...2955 ± 5.0e-07
2024-11-29T07:43:37.6232699Z E comparison failed
2024-11-29T07:43:37.6232920Z E Obtained: 0.1633
2024-11-29T07:43:37.6233202Z E Expected: 0.0050495205374032955 ± 5.0e-07
2024-11-29T07:43:37.6233382Z
2024-11-29T07:43:37.6233481Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6234031Z _ test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_three_objects|3_acc_stderr] _
2024-11-29T07:43:37.6234387Z
2024-11-29T07:43:37.6235076Z model_input = ('gpt2', 'lite', 'harness|bigbench:logical_deduction_three_objects|3', 'acc_stderr', functools.partial(<functools._lru...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6235970Z
2024-11-29T07:43:37.6236131Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6236492Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6236896Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6237734Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6238236Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6238625Z > assert reference == approx(
2024-11-29T07:43:37.6238879Z prediction, rel=1e-4
2024-11-29T07:43:37.6239258Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6239939Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_three_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6240612Z E assert 0.13333333333333333 == 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6240925Z E comparison failed
2024-11-29T07:43:37.6241150Z E Obtained: 0.13333333333333333
2024-11-29T07:43:37.6241456Z E Expected: 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6241642Z
2024-11-29T07:43:37.6241736Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6242190Z _ test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_three_objects|3_acc_norm_stderr] _
2024-11-29T07:43:37.6242559Z
2024-11-29T07:43:37.6243287Z model_input = ('gpt2', 'lite', 'harness|bigbench:logical_deduction_three_objects|3', 'acc_norm_stderr', functools.partial(<functools...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.6244062Z
2024-11-29T07:43:37.6244220Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6244584Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6244997Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6245492Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6245988Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6246363Z > assert reference == approx(
2024-11-29T07:43:37.6246607Z prediction, rel=1e-4
2024-11-29T07:43:37.6246982Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6247685Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_three_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6248367Z E assert 0.15275252316519466 == 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6248672Z E comparison failed
2024-11-29T07:43:37.6248892Z E Obtained: 0.15275252316519466
2024-11-29T07:43:37.6249199Z E Expected: 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6249391Z
2024-11-29T07:43:37.6249485Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6249885Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:movie_recommendation|3_acc_stderr] _
2024-11-29T07:43:37.6250209Z
2024-11-29T07:43:37.6250918Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:movie_recommendation|3', 'acc_stderr', functools.partial(<functools._lru_cache_wr...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.6251861Z
2024-11-29T07:43:37.6252026Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6252391Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6252796Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6253288Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6253912Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6254284Z > assert reference == approx(
2024-11-29T07:43:37.6254528Z prediction, rel=1e-4
2024-11-29T07:43:37.6254908Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6255567Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:movie_recommendation|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6256205Z E assert 0.15275252316519466 == 0.004428245629971239 ± 4.4e-07
2024-11-29T07:43:37.6256509Z E comparison failed
2024-11-29T07:43:37.6256735Z E Obtained: 0.15275252316519466
2024-11-29T07:43:37.6257047Z E Expected: 0.004428245629971239 ± 4.4e-07
2024-11-29T07:43:37.6257227Z
2024-11-29T07:43:37.6257326Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6257714Z _ test_model_prediction[gpt2_lite_harness|bigbench:movie_recommendation|3_acc_stderr] _
2024-11-29T07:43:37.6258039Z
2024-11-29T07:43:37.6258734Z model_input = ('gpt2', 'lite', 'harness|bigbench:movie_recommendation|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrap...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16666666666666666)
2024-11-29T07:43:37.6259529Z
2024-11-29T07:43:37.6259689Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6260049Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6260459Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6260948Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6261432Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6261809Z > assert reference == approx(
2024-11-29T07:43:37.6262053Z prediction, rel=1e-4
2024-11-29T07:43:37.6262434Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6263080Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:movie_recommendation|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6263697Z E assert 0.16666666666666666 == 0.004907190695349086 ± 4.9e-07
2024-11-29T07:43:37.6264006Z E comparison failed
2024-11-29T07:43:37.6264231Z E Obtained: 0.16666666666666666
2024-11-29T07:43:37.6264542Z E Expected: 0.004907190695349086 ± 4.9e-07
2024-11-29T07:43:37.6264722Z
2024-11-29T07:43:37.6264821Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6265233Z _ test_model_prediction[gpt2_lite_harness|bigbench:movie_recommendation|3_acc_norm_stderr] _
2024-11-29T07:43:37.6265565Z
2024-11-29T07:43:37.6266261Z model_input = ('gpt2', 'lite', 'harness|bigbench:movie_recommendation|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519464)
2024-11-29T07:43:37.6267056Z
2024-11-29T07:43:37.6267218Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6267590Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6267993Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6268481Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6269089Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6269469Z > assert reference == approx(
2024-11-29T07:43:37.6269717Z prediction, rel=1e-4
2024-11-29T07:43:37.6270092Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6270760Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:movie_recommendation|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6271508Z E assert 0.15275252316519464 == 0.004703372376466875 ± 4.7e-07
2024-11-29T07:43:37.6271815Z E comparison failed
2024-11-29T07:43:37.6272037Z E Obtained: 0.15275252316519464
2024-11-29T07:43:37.6272342Z E Expected: 0.004703372376466875 ± 4.7e-07
2024-11-29T07:43:37.6272530Z
2024-11-29T07:43:37.6272623Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6272993Z __ test_model_prediction[gpt2_lite_lighteval|bigbench:navigate|3_acc_stderr] ___
2024-11-29T07:43:37.6273293Z
2024-11-29T07:43:37.6274000Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:navigate|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6274805Z
2024-11-29T07:43:37.6274961Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6275336Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6275740Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6276233Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6276717Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6277094Z > assert reference == approx(
2024-11-29T07:43:37.6277570Z prediction, rel=1e-4
2024-11-29T07:43:37.6277961Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6278586Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:navigate|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6279140Z E assert 0.1633 == 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6279414Z E comparison failed
2024-11-29T07:43:37.6279644Z E Obtained: 0.1633
2024-11-29T07:43:37.6279923Z E Expected: 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6280108Z
2024-11-29T07:43:37.6280201Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6280572Z ___ test_model_prediction[gpt2_lite_harness|bigbench:navigate|3_acc_stderr] ____
2024-11-29T07:43:37.6280872Z
2024-11-29T07:43:37.6281571Z model_input = ('gpt2', 'lite', 'harness|bigbench:navigate|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object a...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6282354Z
2024-11-29T07:43:37.6282509Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6282875Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6283280Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6283765Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6284256Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6284633Z > assert reference == approx(
2024-11-29T07:43:37.6284719Z prediction, rel=1e-4
2024-11-29T07:43:37.6284950Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6285245Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:navigate|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6285569Z E assert 0.1633 == 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6285658Z E comparison failed
2024-11-29T07:43:37.6285744Z E Obtained: 0.1633
2024-11-29T07:43:37.6285889Z E Expected: 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6285895Z
2024-11-29T07:43:37.6285994Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6286199Z _ test_model_prediction[gpt2_lite_harness|bigbench:navigate|3_acc_norm_stderr] _
2024-11-29T07:43:37.6286337Z
2024-11-29T07:43:37.6287035Z model_input = ('gpt2', 'lite', 'harness|bigbench:navigate|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper obj...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6287040Z
2024-11-29T07:43:37.6287198Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6287328Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6287535Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6287745Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6287940Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6288033Z > assert reference == approx(
2024-11-29T07:43:37.6288123Z prediction, rel=1e-4
2024-11-29T07:43:37.6288350Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6288660Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:navigate|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6288826Z E assert 0.1633 == 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6288910Z E comparison failed
2024-11-29T07:43:37.6288996Z E Obtained: 0.1633
2024-11-29T07:43:37.6289137Z E Expected: 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6289148Z
2024-11-29T07:43:37.6289248Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6289514Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:reasoning_about_colored_objects|3_acc_stderr] _
2024-11-29T07:43:37.6289518Z
2024-11-29T07:43:37.6290212Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:reasoning_about_colored_objects|3', 'acc_stderr', functools.partial(<functools._l...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6290224Z
2024-11-29T07:43:37.6290379Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6290506Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6290704Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6290910Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6291106Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6291200Z > assert reference == approx(
2024-11-29T07:43:37.6291285Z prediction, rel=1e-4
2024-11-29T07:43:37.6291517Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6291926Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:reasoning_about_colored_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6292138Z E assert 0.13333333333333333 == 0.004000255247111385 ± 4.0e-07
2024-11-29T07:43:37.6292223Z E comparison failed
2024-11-29T07:43:37.6292313Z E Obtained: 0.13333333333333333
2024-11-29T07:43:37.6292455Z E Expected: 0.004000255247111385 ± 4.0e-07
2024-11-29T07:43:37.6292460Z
2024-11-29T07:43:37.6292562Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6292945Z _ test_model_prediction[gpt2_lite_harness|bigbench:reasoning_about_colored_objects|3_acc_stderr] _
2024-11-29T07:43:37.6292951Z
2024-11-29T07:43:37.6293658Z model_input = ('gpt2', 'lite', 'harness|bigbench:reasoning_about_colored_objects|3', 'acc_stderr', functools.partial(<functools._lru...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6293663Z
2024-11-29T07:43:37.6293824Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6294079Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6294279Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6294484Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6294681Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6294775Z > assert reference == approx(
2024-11-29T07:43:37.6294861Z prediction, rel=1e-4
2024-11-29T07:43:37.6295095Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6295465Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:reasoning_about_colored_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6295673Z E assert 0.13333333333333333 == 0.00405961457...4385 ± 4.1e-07
2024-11-29T07:43:37.6295764Z E comparison failed
2024-11-29T07:43:37.6295855Z E Obtained: 0.13333333333333333
2024-11-29T07:43:37.6295998Z E Expected: 0.0040596145716644385 ± 4.1e-07
2024-11-29T07:43:37.6296004Z
2024-11-29T07:43:37.6296102Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6296373Z _ test_model_prediction[gpt2_lite_harness|bigbench:reasoning_about_colored_objects|3_acc_norm_stderr] _
2024-11-29T07:43:37.6296378Z
2024-11-29T07:43:37.6297082Z model_input = ('gpt2', 'lite', 'harness|bigbench:reasoning_about_colored_objects|3', 'acc_norm_stderr', functools.partial(<functools...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6297087Z
2024-11-29T07:43:37.6297250Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6297372Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6297573Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6297778Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6297976Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6298063Z > assert reference == approx(
2024-11-29T07:43:37.6298148Z prediction, rel=1e-4
2024-11-29T07:43:37.6298374Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6298763Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:reasoning_about_colored_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6298919Z E assert 0.1 == 0.002971327782118411 ± 3.0e-07
2024-11-29T07:43:37.6299003Z E comparison failed
2024-11-29T07:43:37.6299090Z E Obtained: 0.1
2024-11-29T07:43:37.6299231Z E Expected: 0.002971327782118411 ± 3.0e-07
2024-11-29T07:43:37.6299236Z
2024-11-29T07:43:37.6299341Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6299543Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:ruin_names|3_acc_stderr] __
2024-11-29T07:43:37.6299549Z
2024-11-29T07:43:37.6300231Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:ruin_names|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper obje...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519464)
2024-11-29T07:43:37.6300236Z
2024-11-29T07:43:37.6300517Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6300643Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6300846Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6301047Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6301248Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6301451Z > assert reference == approx(
2024-11-29T07:43:37.6301536Z prediction, rel=1e-4
2024-11-29T07:43:37.6301763Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6302075Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:ruin_names|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6302282Z E assert 0.15275252316519464 == 0.00459225857...0545 ± 4.6e-07
2024-11-29T07:43:37.6302366Z E comparison failed
2024-11-29T07:43:37.6302463Z E Obtained: 0.15275252316519464
2024-11-29T07:43:37.6302607Z E Expected: 0.0045922585770880545 ± 4.6e-07
2024-11-29T07:43:37.6302612Z
2024-11-29T07:43:37.6302710Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6302913Z __ test_model_prediction[gpt2_lite_harness|bigbench:ruin_names|3_acc_stderr] ___
2024-11-29T07:43:37.6302918Z
2024-11-29T07:43:37.6303604Z model_input = ('gpt2', 'lite', 'harness|bigbench:ruin_names|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6303617Z
2024-11-29T07:43:37.6303775Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6303895Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6304095Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6304299Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6304499Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6304586Z > assert reference == approx(
2024-11-29T07:43:37.6304682Z prediction, rel=1e-4
2024-11-29T07:43:37.6304903Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6305214Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:ruin_names|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6305414Z E assert 0.13333333333333333 == 0.004037827888116828 ± 4.0e-07
2024-11-29T07:43:37.6305498Z E comparison failed
2024-11-29T07:43:37.6305589Z E Obtained: 0.13333333333333333
2024-11-29T07:43:37.6305730Z E Expected: 0.004037827888116828 ± 4.0e-07
2024-11-29T07:43:37.6305741Z
2024-11-29T07:43:37.6305834Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6306045Z _ test_model_prediction[gpt2_lite_harness|bigbench:ruin_names|3_acc_norm_stderr] _
2024-11-29T07:43:37.6306055Z
2024-11-29T07:43:37.6306723Z model_input = ('gpt2', 'lite', 'harness|bigbench:ruin_names|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper o...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6306737Z
2024-11-29T07:43:37.6306897Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6307019Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6307221Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6307421Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6307622Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6307831Z > assert reference == approx(
2024-11-29T07:43:37.6307927Z prediction, rel=1e-4
2024-11-29T07:43:37.6308148Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6308469Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:ruin_names|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6308669Z E assert 0.13333333333333333 == 0.004037827888116828 ± 4.0e-07
2024-11-29T07:43:37.6308873Z E comparison failed
2024-11-29T07:43:37.6308958Z E Obtained: 0.13333333333333333
2024-11-29T07:43:37.6309103Z E Expected: 0.004037827888116828 ± 4.0e-07
2024-11-29T07:43:37.6309113Z
2024-11-29T07:43:37.6309207Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6309483Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:salient_translation_error_detection|3_acc_stderr] _
2024-11-29T07:43:37.6309493Z
2024-11-29T07:43:37.6310203Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:salient_translation_error_detection|3', 'acc_stderr', functools.partial(<functool...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6310208Z
2024-11-29T07:43:37.6310368Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6310492Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6310706Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6310905Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6311105Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6311193Z > assert reference == approx(
2024-11-29T07:43:37.6311282Z prediction, rel=1e-4
2024-11-29T07:43:37.6311510Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6311909Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:salient_translation_error_detection|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6312068Z E assert 0.1633 == 0.00497231172432741 ± 5.0e-07
2024-11-29T07:43:37.6312157Z E comparison failed
2024-11-29T07:43:37.6312238Z E Obtained: 0.1633
2024-11-29T07:43:37.6312384Z E Expected: 0.00497231172432741 ± 5.0e-07
2024-11-29T07:43:37.6312395Z
2024-11-29T07:43:37.6312488Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6312762Z _ test_model_prediction[gpt2_lite_harness|bigbench:salient_translation_error_detection|3_acc_stderr] _
2024-11-29T07:43:37.6312766Z
2024-11-29T07:43:37.6313464Z model_input = ('gpt2', 'lite', 'harness|bigbench:salient_translation_error_detection|3', 'acc_stderr', functools.partial(<functools....ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6313476Z
2024-11-29T07:43:37.6313637Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6313759Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6313960Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6314161Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6314366Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6314453Z > assert reference == approx(
2024-11-29T07:43:37.6314541Z prediction, rel=1e-4
2024-11-29T07:43:37.6314761Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6315147Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:salient_translation_error_detection|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6315418Z E assert 0.1 == 0.00304553201...4616 ± 3.0e-07
2024-11-29T07:43:37.6315514Z E comparison failed
2024-11-29T07:43:37.6315593Z E Obtained: 0.1
2024-11-29T07:43:37.6315743Z E Expected: 0.0030455320167854616 ± 3.0e-07
2024-11-29T07:43:37.6315748Z
2024-11-29T07:43:37.6315842Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6316135Z _ test_model_prediction[gpt2_lite_harness|bigbench:salient_translation_error_detection|3_acc_norm_stderr] _
2024-11-29T07:43:37.6316250Z
2024-11-29T07:43:37.6316999Z model_input = ('gpt2', 'lite', 'harness|bigbench:salient_translation_error_detection|3', 'acc_norm_stderr', functools.partial(<funct...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6317004Z
2024-11-29T07:43:37.6317167Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6317450Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6317709Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6317913Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6318115Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6318208Z > assert reference == approx(
2024-11-29T07:43:37.6318297Z prediction, rel=1e-4
2024-11-29T07:43:37.6318526Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6318927Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:salient_translation_error_detection|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6319091Z E assert 0.1 == 0.00304553201...4616 ± 3.0e-07
2024-11-29T07:43:37.6319178Z E comparison failed
2024-11-29T07:43:37.6319258Z E Obtained: 0.1
2024-11-29T07:43:37.6319409Z E Expected: 0.0030455320167854616 ± 3.0e-07
2024-11-29T07:43:37.6319414Z
2024-11-29T07:43:37.6319509Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6319712Z ___ test_model_prediction[gpt2_lite_lighteval|bigbench:snarks|3_acc_stderr] ____
2024-11-29T07:43:37.6319716Z
2024-11-29T07:43:37.6320412Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:snarks|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object a...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6320423Z
2024-11-29T07:43:37.6320580Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6320701Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6320899Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6321098Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6321303Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6321391Z > assert reference == approx(
2024-11-29T07:43:37.6321481Z prediction, rel=1e-4
2024-11-29T07:43:37.6321715Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6322020Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:snarks|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6322180Z E assert 0.1633 == 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6322279Z E comparison failed
2024-11-29T07:43:37.6322360Z E Obtained: 0.1633
2024-11-29T07:43:37.6322506Z E Expected: 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6322511Z
2024-11-29T07:43:37.6322603Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6322805Z ____ test_model_prediction[gpt2_lite_harness|bigbench:snarks|3_acc_stderr] _____
2024-11-29T07:43:37.6322810Z
2024-11-29T07:43:37.6323667Z model_input = ('gpt2', 'lite', 'harness|bigbench:snarks|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object at ...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6323674Z
2024-11-29T07:43:37.6323838Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6323957Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6324320Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6324521Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6324719Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6324806Z > assert reference == approx(
2024-11-29T07:43:37.6324895Z prediction, rel=1e-4
2024-11-29T07:43:37.6325128Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6325428Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:snarks|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6325589Z E assert 0.1633 == 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6325677Z E comparison failed
2024-11-29T07:43:37.6325757Z E Obtained: 0.1633
2024-11-29T07:43:37.6325910Z E Expected: 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6325915Z
2024-11-29T07:43:37.6326008Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6326217Z __ test_model_prediction[gpt2_lite_harness|bigbench:snarks|3_acc_norm_stderr] __
2024-11-29T07:43:37.6326223Z
2024-11-29T07:43:37.6326914Z model_input = ('gpt2', 'lite', 'harness|bigbench:snarks|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper objec...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6326928Z
2024-11-29T07:43:37.6327087Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6327206Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6327407Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6327607Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6327816Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6327901Z > assert reference == approx(
2024-11-29T07:43:37.6327990Z prediction, rel=1e-4
2024-11-29T07:43:37.6328212Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6328527Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:snarks|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6328689Z E assert 0.1633 == 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6328780Z E comparison failed
2024-11-29T07:43:37.6328860Z E Obtained: 0.1633
2024-11-29T07:43:37.6329004Z E Expected: 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6329010Z
2024-11-29T07:43:37.6329102Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6329337Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:sports_understanding|3_acc_stderr] _
2024-11-29T07:43:37.6329347Z
2024-11-29T07:43:37.6330043Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:sports_understanding|3', 'acc_stderr', functools.partial(<functools._lru_cache_wr...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6330048Z
2024-11-29T07:43:37.6330205Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6330323Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6330650Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6330855Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6331051Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6331137Z > assert reference == approx(
2024-11-29T07:43:37.6331226Z prediction, rel=1e-4
2024-11-29T07:43:37.6331558Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6331909Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:sports_understanding|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6332183Z E assert 0.1633 == 0.005037214858781963 ± 5.0e-07
2024-11-29T07:43:37.6332271Z E comparison failed
2024-11-29T07:43:37.6332352Z E Obtained: 0.1633
2024-11-29T07:43:37.6332500Z E Expected: 0.005037214858781963 ± 5.0e-07
2024-11-29T07:43:37.6332505Z
2024-11-29T07:43:37.6332608Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6332843Z _ test_model_prediction[gpt2_lite_harness|bigbench:sports_understanding|3_acc_stderr] _
2024-11-29T07:43:37.6332847Z
2024-11-29T07:43:37.6333542Z model_input = ('gpt2', 'lite', 'harness|bigbench:sports_understanding|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrap...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6333555Z
2024-11-29T07:43:37.6333715Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6333834Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6334036Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6334234Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6334435Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6334522Z > assert reference == approx(
2024-11-29T07:43:37.6334612Z prediction, rel=1e-4
2024-11-29T07:43:37.6334834Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6335174Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:sports_understanding|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6335334Z E assert 0.1633 == 0.0049194014382352 ± 4.9e-07
2024-11-29T07:43:37.6335426Z E comparison failed
2024-11-29T07:43:37.6335505Z E Obtained: 0.1633
2024-11-29T07:43:37.6335651Z E Expected: 0.0049194014382352 ± 4.9e-07
2024-11-29T07:43:37.6335656Z
2024-11-29T07:43:37.6335751Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6335995Z _ test_model_prediction[gpt2_lite_harness|bigbench:sports_understanding|3_acc_norm_stderr] _
2024-11-29T07:43:37.6336000Z
2024-11-29T07:43:37.6336699Z model_input = ('gpt2', 'lite', 'harness|bigbench:sports_understanding|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6336705Z
2024-11-29T07:43:37.6336865Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6336986Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6337192Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6337391Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6337589Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6337675Z > assert reference == approx(
2024-11-29T07:43:37.6337765Z prediction, rel=1e-4
2024-11-29T07:43:37.6338113Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6338493Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:sports_understanding|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6338654Z E assert 0.1633 == 0.0049194014382352 ± 4.9e-07
2024-11-29T07:43:37.6338742Z E comparison failed
2024-11-29T07:43:37.6338823Z E Obtained: 0.1633
2024-11-29T07:43:37.6339081Z E Expected: 0.0049194014382352 ± 4.9e-07
2024-11-29T07:43:37.6339086Z
2024-11-29T07:43:37.6339181Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6339400Z _ test_model_prediction[gpt2_lite_harness|bigbench:temporal_sequences|3_acc_stderr] _
2024-11-29T07:43:37.6339405Z
2024-11-29T07:43:37.6340097Z model_input = ('gpt2', 'lite', 'harness|bigbench:temporal_sequences|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrappe...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6340110Z
2024-11-29T07:43:37.6340272Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6340394Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6340599Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6340799Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6341007Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6341093Z > assert reference == approx(
2024-11-29T07:43:37.6341184Z prediction, rel=1e-4
2024-11-29T07:43:37.6341408Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6341746Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:temporal_sequences|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6341904Z E assert 0.1 == 0.00294961187...1973 ± 2.9e-07
2024-11-29T07:43:37.6341993Z E comparison failed
2024-11-29T07:43:37.6342072Z E Obtained: 0.1
2024-11-29T07:43:37.6342221Z E Expected: 0.0029496118745031973 ± 2.9e-07
2024-11-29T07:43:37.6342225Z
2024-11-29T07:43:37.6342320Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6342560Z _ test_model_prediction[gpt2_lite_harness|bigbench:temporal_sequences|3_acc_norm_stderr] _
2024-11-29T07:43:37.6342571Z
2024-11-29T07:43:37.6343261Z model_input = ('gpt2', 'lite', 'harness|bigbench:temporal_sequences|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache_w...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6343266Z
2024-11-29T07:43:37.6343430Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6343551Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6343759Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6343959Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6344158Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6344244Z > assert reference == approx(
2024-11-29T07:43:37.6344335Z prediction, rel=1e-4
2024-11-29T07:43:37.6344564Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6344913Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:temporal_sequences|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6345063Z E assert 0.1 == 0.00294961187...1973 ± 2.9e-07
2024-11-29T07:43:37.6345152Z E comparison failed
2024-11-29T07:43:37.6345229Z E Obtained: 0.1
2024-11-29T07:43:37.6345379Z E Expected: 0.0029496118745031973 ± 2.9e-07
2024-11-29T07:43:37.6345384Z
2024-11-29T07:43:37.6345602Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6345905Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:tracking_shuffled_objects_five_objects|3_acc_stderr] _
2024-11-29T07:43:37.6345910Z
2024-11-29T07:43:37.6346593Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:tracking_shuffled_objects_five_objects|3', 'acc_stderr', functools.partial(<funct...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6346709Z
2024-11-29T07:43:37.6346874Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6346997Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6347198Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6347399Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6347604Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6347691Z > assert reference == approx(
2024-11-29T07:43:37.6347781Z prediction, rel=1e-4
2024-11-29T07:43:37.6348003Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6348410Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:tracking_shuffled_objects_five_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6348622Z E assert 0.13333333333333333 == 0.004030304374777823 ± 4.0e-07
2024-11-29T07:43:37.6348710Z E comparison failed
2024-11-29T07:43:37.6348797Z E Obtained: 0.13333333333333333
2024-11-29T07:43:37.6348944Z E Expected: 0.004030304374777823 ± 4.0e-07
2024-11-29T07:43:37.6348949Z
2024-11-29T07:43:37.6349041Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6349327Z _ test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_five_objects|3_acc_stderr] _
2024-11-29T07:43:37.6349338Z
2024-11-29T07:43:37.6350016Z model_input = ('gpt2', 'lite', 'harness|bigbench:tracking_shuffled_objects_five_objects|3', 'acc_stderr', functools.partial(<functoo...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6350021Z
2024-11-29T07:43:37.6350181Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6350309Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6350511Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6350712Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6350914Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6351000Z > assert reference == approx(
2024-11-29T07:43:37.6351090Z prediction, rel=1e-4
2024-11-29T07:43:37.6351316Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6351713Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_five_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6351912Z E assert 0.13333333333333333 == 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6352006Z E comparison failed
2024-11-29T07:43:37.6352093Z E Obtained: 0.13333333333333333
2024-11-29T07:43:37.6352238Z E Expected: 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6352243Z
2024-11-29T07:43:37.6352335Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6352640Z _ test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_five_objects|3_acc_norm_stderr] _
2024-11-29T07:43:37.6352645Z
2024-11-29T07:43:37.6353483Z model_input = ('gpt2', 'lite', 'harness|bigbench:tracking_shuffled_objects_five_objects|3', 'acc_norm_stderr', functools.partial(<fu...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6353495Z
2024-11-29T07:43:37.6353653Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6353778Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6353977Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6354288Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6354487Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6354573Z > assert reference == approx(
2024-11-29T07:43:37.6354665Z prediction, rel=1e-4
2024-11-29T07:43:37.6354887Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6355305Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_five_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6355460Z E assert 0.1 == 0.00294083125...9783 ± 2.9e-07
2024-11-29T07:43:37.6355549Z E comparison failed
2024-11-29T07:43:37.6355628Z E Obtained: 0.1
2024-11-29T07:43:37.6355777Z E Expected: 0.0029408312580779783 ± 2.9e-07
2024-11-29T07:43:37.6355781Z
2024-11-29T07:43:37.6355883Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6356179Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:tracking_shuffled_objects_seven_objects|3_acc_stderr] _
2024-11-29T07:43:37.6356184Z
2024-11-29T07:43:37.6356865Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:tracking_shuffled_objects_seven_objects|3', 'acc_stderr', functools.partial(<func...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519464)
2024-11-29T07:43:37.6356875Z
2024-11-29T07:43:37.6357036Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6357161Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6357556Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6357770Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6357964Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6358058Z > assert reference == approx(
2024-11-29T07:43:37.6358149Z prediction, rel=1e-4
2024-11-29T07:43:37.6358375Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6358793Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:tracking_shuffled_objects_seven_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6358997Z E assert 0.15275252316519464 == 0.004588830718970504 ± 4.6e-07
2024-11-29T07:43:37.6359091Z E comparison failed
2024-11-29T07:43:37.6359175Z E Obtained: 0.15275252316519464
2024-11-29T07:43:37.6359325Z E Expected: 0.004588830718970504 ± 4.6e-07
2024-11-29T07:43:37.6359330Z
2024-11-29T07:43:37.6359428Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6359719Z _ test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_seven_objects|3_acc_stderr] _
2024-11-29T07:43:37.6359730Z
2024-11-29T07:43:37.6360424Z model_input = ('gpt2', 'lite', 'harness|bigbench:tracking_shuffled_objects_seven_objects|3', 'acc_stderr', functools.partial(<functo...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6360429Z
2024-11-29T07:43:37.6360587Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6360713Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6361066Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6361276Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6361469Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6361561Z > assert reference == approx(
2024-11-29T07:43:37.6361646Z prediction, rel=1e-4
2024-11-29T07:43:37.6362037Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6362433Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_seven_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6362592Z E assert 0.1 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6362675Z E comparison failed
2024-11-29T07:43:37.6362753Z E Obtained: 0.1
2024-11-29T07:43:37.6362901Z E Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6362913Z
2024-11-29T07:43:37.6363005Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6363299Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:tracking_shuffled_objects_three_objects|3_acc_stderr] _
2024-11-29T07:43:37.6363304Z
2024-11-29T07:43:37.6364022Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:tracking_shuffled_objects_three_objects|3', 'acc_stderr', functools.partial(<func...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6364034Z
2024-11-29T07:43:37.6364189Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6364316Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6364512Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6364717Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6364913Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6365006Z > assert reference == approx(
2024-11-29T07:43:37.6365091Z prediction, rel=1e-4
2024-11-29T07:43:37.6365316Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6365722Z E AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:tracking_shuffled_objects_three_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6365890Z E assert 0.1633 == 0.00504952053...2955 ± 5.0e-07
2024-11-29T07:43:37.6365976Z E comparison failed
2024-11-29T07:43:37.6366062Z E Obtained: 0.1633
2024-11-29T07:43:37.6366203Z E Expected: 0.0050495205374032955 ± 5.0e-07
2024-11-29T07:43:37.6366208Z
2024-11-29T07:43:37.6366307Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6366593Z _ test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_three_objects|3_acc_stderr] _
2024-11-29T07:43:37.6366604Z
2024-11-29T07:43:37.6367291Z model_input = ('gpt2', 'lite', 'harness|bigbench:tracking_shuffled_objects_three_objects|3', 'acc_stderr', functools.partial(<functo...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6367295Z
2024-11-29T07:43:37.6367449Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6367582Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6367780Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6367987Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6368181Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6368278Z > assert reference == approx(
2024-11-29T07:43:37.6368371Z prediction, rel=1e-4
2024-11-29T07:43:37.6368718Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6369115Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_three_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6369321Z E assert 0.13333333333333333 == 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6369518Z E comparison failed
2024-11-29T07:43:37.6369609Z E Obtained: 0.13333333333333333
2024-11-29T07:43:37.6369753Z E Expected: 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6369759Z
2024-11-29T07:43:37.6369857Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6370159Z _ test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_three_objects|3_acc_norm_stderr] _
2024-11-29T07:43:37.6370164Z
2024-11-29T07:43:37.6370876Z model_input = ('gpt2', 'lite', 'harness|bigbench:tracking_shuffled_objects_three_objects|3', 'acc_norm_stderr', functools.partial(<f...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.6370882Z
2024-11-29T07:43:37.6371039Z @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6371169Z def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6371367Z """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6371581Z model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6371774Z prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6371868Z > assert reference == approx(
2024-11-29T07:43:37.6371952Z prediction, rel=1e-4
2024-11-29T07:43:37.6372178Z ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6372595Z E AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_three_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6372800Z E assert 0.15275252316519466 == 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6372884Z E comparison failed
2024-11-29T07:43:37.6372973Z E Obtained: 0.15275252316519466
2024-11-29T07:43:37.6373113Z E Expected: 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6373124Z
2024-11-29T07:43:37.6373222Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6373351Z =========================== short test summary info ============================
2024-11-29T07:43:37.6374021Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_leaderboard|arc:challenge|25_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval leaderboard|arc:challenge|25, metric acc_stderr incorrect
2024-11-29T07:43:37.6374212Z assert 0.15275252316519466 == 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6374300Z comparison failed
2024-11-29T07:43:37.6374392Z Obtained: 0.15275252316519466
2024-11-29T07:43:37.6374534Z Expected: 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6375208Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_leaderboard|truthfulqa:mc|0_truthfulqa_mc1_stderr] - AssertionError: Model gpt2 on lite samples, for eval leaderboard|truthfulqa:mc|0, metric truthfulqa_mc1_stderr incorrect
2024-11-29T07:43:37.6375400Z assert 0.15275252316519466 == 0.004619651629850591 ± 4.6e-07
2024-11-29T07:43:37.6375485Z comparison failed
2024-11-29T07:43:37.6375573Z Obtained: 0.15275252316519466
2024-11-29T07:43:37.6375705Z Expected: 0.004619651629850591 ± 4.6e-07
2024-11-29T07:43:37.6376363Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_leaderboard|truthfulqa:mc|0_truthfulqa_mc2_stderr] - AssertionError: Model gpt2 on lite samples, for eval leaderboard|truthfulqa:mc|0, metric truthfulqa_mc2_stderr incorrect
2024-11-29T07:43:37.6376549Z assert 0.14105533101540416 == 0.004258753966872427 ± 4.3e-07
2024-11-29T07:43:37.6376759Z comparison failed
2024-11-29T07:43:37.6376845Z Obtained: 0.14105533101540416
2024-11-29T07:43:37.6376978Z Expected: 0.004258753966872427 ± 4.3e-07
2024-11-29T07:43:37.6377560Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_leaderboard|hellaswag|10_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval leaderboard|hellaswag|10, metric acc_stderr incorrect
2024-11-29T07:43:37.6377879Z assert 0.16329931618554522 == 0.004968770338693327 ± 5.0e-07
2024-11-29T07:43:37.6377959Z comparison failed
2024-11-29T07:43:37.6378038Z Obtained: 0.16329931618554522
2024-11-29T07:43:37.6378174Z Expected: 0.004968770338693327 ± 5.0e-07
2024-11-29T07:43:37.6378782Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_leaderboard|hellaswag|10_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval leaderboard|hellaswag|10, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6378972Z assert 0.16329931618554522 == 0.004785693561320304 ± 4.8e-07
2024-11-29T07:43:37.6379058Z comparison failed
2024-11-29T07:43:37.6379143Z Obtained: 0.16329931618554522
2024-11-29T07:43:37.6379271Z Expected: 0.004785693561320304 ± 4.8e-07
2024-11-29T07:43:37.6379915Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_leaderboard|mmlu:abstract_algebra|5_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval leaderboard|mmlu:abstract_algebra|5, metric acc_stderr incorrect
2024-11-29T07:43:37.6380103Z assert 0.16329931618554522 == 0.00497641685...3716 ± 5.0e-07
2024-11-29T07:43:37.6380192Z comparison failed
2024-11-29T07:43:37.6380271Z Obtained: 0.16329931618554522
2024-11-29T07:43:37.6380412Z Expected: 0.0049764168560043716 ± 5.0e-07
2024-11-29T07:43:37.6381060Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_leaderboard|mmlu:college_chemistry|5_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval leaderboard|mmlu:college_chemistry|5, metric acc_stderr incorrect
2024-11-29T07:43:37.6381255Z assert 0.09999999999999999 == 0.002971327782118411 ± 3.0e-07
2024-11-29T07:43:37.6381334Z comparison failed
2024-11-29T07:43:37.6381419Z Obtained: 0.09999999999999999
2024-11-29T07:43:37.6381549Z Expected: 0.002971327782118411 ± 3.0e-07
2024-11-29T07:43:37.6382200Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_leaderboard|mmlu:computer_security|5_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval leaderboard|mmlu:computer_security|5, metric acc_stderr incorrect
2024-11-29T07:43:37.6382386Z assert 0.09999999999999999 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6382469Z comparison failed
2024-11-29T07:43:37.6382548Z Obtained: 0.09999999999999999
2024-11-29T07:43:37.6382684Z Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6383319Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_leaderboard|mmlu:us_foreign_policy|5_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval leaderboard|mmlu:us_foreign_policy|5, metric acc_stderr incorrect
2024-11-29T07:43:37.6383510Z assert 0.15275252316519466 == 0.004633704913049727 ± 4.6e-07
2024-11-29T07:43:37.6383588Z comparison failed
2024-11-29T07:43:37.6383673Z Obtained: 0.15275252316519466
2024-11-29T07:43:37.6383802Z Expected: 0.004633704913049727 ± 4.6e-07
2024-11-29T07:43:37.6384380Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|mmlu:abstract_algebra|5_em_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:abstract_algebra|5, metric em_stderr incorrect
2024-11-29T07:43:37.6384568Z assert 0.16329931618554522 == 0.00497641685...3716 ± 5.0e-07
2024-11-29T07:43:37.6384652Z comparison failed
2024-11-29T07:43:37.6384731Z Obtained: 0.16329931618554522
2024-11-29T07:43:37.6384866Z Expected: 0.0049764168560043716 ± 5.0e-07
2024-11-29T07:43:37.6385457Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|mmlu:abstract_algebra|5_pqem_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:abstract_algebra|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.6385768Z assert 0.16329931618554522 == 0.00497641685...3716 ± 5.0e-07
2024-11-29T07:43:37.6385853Z comparison failed
2024-11-29T07:43:37.6385940Z Obtained: 0.16329931618554522
2024-11-29T07:43:37.6386075Z Expected: 0.0049764168560043716 ± 5.0e-07
2024-11-29T07:43:37.6386663Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|mmlu:college_chemistry|5_em_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:college_chemistry|5, metric em_stderr incorrect
2024-11-29T07:43:37.6386960Z assert 0.15275252316519466 == 0.00457283509661358 ± 4.6e-07
2024-11-29T07:43:37.6387043Z comparison failed
2024-11-29T07:43:37.6387122Z Obtained: 0.15275252316519466
2024-11-29T07:43:37.6387256Z Expected: 0.00457283509661358 ± 4.6e-07
2024-11-29T07:43:37.6387847Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|mmlu:college_chemistry|5_pqem_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:college_chemistry|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.6388043Z assert 0.16329931618554522 == 0.004802280906184263 ± 4.8e-07
2024-11-29T07:43:37.6388121Z comparison failed
2024-11-29T07:43:37.6388206Z Obtained: 0.16329931618554522
2024-11-29T07:43:37.6388334Z Expected: 0.004802280906184263 ± 4.8e-07
2024-11-29T07:43:37.6388931Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|mmlu:computer_security|5_em_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:computer_security|5, metric em_stderr incorrect
2024-11-29T07:43:37.6389119Z assert 0.09999999999999999 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6389202Z comparison failed
2024-11-29T07:43:37.6389282Z Obtained: 0.09999999999999999
2024-11-29T07:43:37.6389413Z Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6390014Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|mmlu:computer_security|5_pqem_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:computer_security|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.6390209Z assert 0.15275252316519464 == 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.6390290Z comparison failed
2024-11-29T07:43:37.6390370Z Obtained: 0.15275252316519464
2024-11-29T07:43:37.6390506Z Expected: 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.6391078Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|mmlu:us_foreign_policy|5_em_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:us_foreign_policy|5, metric em_stderr incorrect
2024-11-29T07:43:37.6391269Z assert 0.15275252316519466 == 0.004633704913049727 ± 4.6e-07
2024-11-29T07:43:37.6391348Z comparison failed
2024-11-29T07:43:37.6391434Z Obtained: 0.15275252316519466
2024-11-29T07:43:37.6391565Z Expected: 0.004633704913049727 ± 4.6e-07
2024-11-29T07:43:37.6392188Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|mmlu:us_foreign_policy|5_pqem_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:us_foreign_policy|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.6392376Z assert 0.16329931618554522 == 0.004872014627084626 ± 4.9e-07
2024-11-29T07:43:37.6392463Z comparison failed
2024-11-29T07:43:37.6392543Z Obtained: 0.16329931618554522
2024-11-29T07:43:37.6392677Z Expected: 0.004872014627084626 ± 4.9e-07
2024-11-29T07:43:37.6393206Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|anli:r1|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|anli:r1|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6393398Z assert 0.16666666666666666 == 0.00514299138248941 ± 5.1e-07
2024-11-29T07:43:37.6393477Z comparison failed
2024-11-29T07:43:37.6393562Z Obtained: 0.16666666666666666
2024-11-29T07:43:37.6393691Z Expected: 0.00514299138248941 ± 5.1e-07
2024-11-29T07:43:37.6394314Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|blimp:adjunct_island|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|blimp:adjunct_island|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6394620Z assert 0.13333333333333333 == 0.003921139545506534 ± 3.9e-07
2024-11-29T07:43:37.6394706Z comparison failed
2024-11-29T07:43:37.6394786Z Obtained: 0.13333333333333333
2024-11-29T07:43:37.6394936Z Expected: 0.003921139545506534 ± 3.9e-07
2024-11-29T07:43:37.6395556Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|blimp:ellipsis_n_bar_1|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|blimp:ellipsis_n_bar_1|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6395862Z assert 0.15275252316519466 == 0.004709524351738684 ± 4.7e-07
2024-11-29T07:43:37.6395940Z comparison failed
2024-11-29T07:43:37.6396024Z Obtained: 0.15275252316519466
2024-11-29T07:43:37.6396154Z Expected: 0.004709524351738684 ± 4.7e-07
2024-11-29T07:43:37.6396643Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|boolq|5_em_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|boolq|5, metric em_stderr incorrect
2024-11-29T07:43:37.6396833Z assert 0.16329931618554522 == 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.6396915Z comparison failed
2024-11-29T07:43:37.6396993Z Obtained: 0.16329931618554522
2024-11-29T07:43:37.6397128Z Expected: 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.6397841Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|boolq|5_qem_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|boolq|5, metric qem_stderr incorrect
2024-11-29T07:43:37.6398053Z assert 0.16329931618554522 == 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.6398133Z comparison failed
2024-11-29T07:43:37.6398217Z Obtained: 0.16329931618554522
2024-11-29T07:43:37.6398346Z Expected: 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.6398841Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|boolq|5_pem_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|boolq|5, metric pem_stderr incorrect
2024-11-29T07:43:37.6399021Z assert 0.16329931618554522 == 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.6399112Z comparison failed
2024-11-29T07:43:37.6399192Z Obtained: 0.16329931618554522
2024-11-29T07:43:37.6399338Z Expected: 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.6399834Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|boolq|5_pqem_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|boolq|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.6400019Z assert 0.16329931618554522 == 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.6400104Z comparison failed
2024-11-29T07:43:37.6400183Z Obtained: 0.16329931618554522
2024-11-29T07:43:37.6400316Z Expected: 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.6400908Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:aqua-rat|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:aqua-rat|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6401065Z assert 0.15275 == 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.6401143Z comparison failed
2024-11-29T07:43:37.6401230Z Obtained: 0.15275
2024-11-29T07:43:37.6401359Z Expected: 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.6401990Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:aqua-rat|0_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:aqua-rat|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6402137Z assert 0.15275 == 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.6402227Z comparison failed
2024-11-29T07:43:37.6402302Z Obtained: 0.15275
2024-11-29T07:43:37.6402436Z Expected: 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.6403034Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:logiqa-en|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:logiqa-en|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6403175Z assert 0.1 == 0.00309049205...1304 ± 3.1e-07
2024-11-29T07:43:37.6403253Z comparison failed
2024-11-29T07:43:37.6403523Z Obtained: 0.1
2024-11-29T07:43:37.6403663Z Expected: 0.0030904920548581304 ± 3.1e-07
2024-11-29T07:43:37.6404293Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:logiqa-en|0_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:logiqa-en|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6404440Z assert 0.15275 == 0.00457742206...5185 ± 4.6e-07
2024-11-29T07:43:37.6404663Z comparison failed
2024-11-29T07:43:37.6404739Z Obtained: 0.15275
2024-11-29T07:43:37.6404882Z Expected: 0.0045774220684565185 ± 4.6e-07
2024-11-29T07:43:37.6405462Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:lsat-ar|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-ar|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6405602Z assert 0.1 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6405683Z comparison failed
2024-11-29T07:43:37.6405763Z Obtained: 0.1
2024-11-29T07:43:37.6405903Z Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6406551Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:lsat-ar|0_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-ar|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6406686Z assert 0.1 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6406770Z comparison failed
2024-11-29T07:43:37.6406855Z Obtained: 0.1
2024-11-29T07:43:37.6406992Z Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6407568Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:lsat-lr|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-lr|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6407719Z assert 0.13333 == 0.004077368628777015 ± 4.1e-07
2024-11-29T07:43:37.6407797Z comparison failed
2024-11-29T07:43:37.6407876Z Obtained: 0.13333
2024-11-29T07:43:37.6408006Z Expected: 0.004077368628777015 ± 4.1e-07
2024-11-29T07:43:37.6408624Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:lsat-lr|0_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-lr|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6408770Z assert 0.13333 == 0.004077368628777015 ± 4.1e-07
2024-11-29T07:43:37.6408863Z comparison failed
2024-11-29T07:43:37.6408938Z Obtained: 0.13333
2024-11-29T07:43:37.6409081Z Expected: 0.004077368628777015 ± 4.1e-07
2024-11-29T07:43:37.6409655Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:lsat-rc|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-rc|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6409804Z assert 0.15275 == 0.004582352884486063 ± 4.6e-07
2024-11-29T07:43:37.6409883Z comparison failed
2024-11-29T07:43:37.6409957Z Obtained: 0.15275
2024-11-29T07:43:37.6410092Z Expected: 0.004582352884486063 ± 4.6e-07
2024-11-29T07:43:37.6410702Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:lsat-rc|0_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-rc|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6410852Z assert 0.13333 == 0.004080117321518739 ± 4.1e-07
2024-11-29T07:43:37.6410928Z comparison failed
2024-11-29T07:43:37.6411009Z Obtained: 0.13333
2024-11-29T07:43:37.6411137Z Expected: 0.004080117321518739 ± 4.1e-07
2024-11-29T07:43:37.6411844Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:sat-en-without-passage|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-en-without-passage|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6411987Z assert 0.13333 == 0.00388625951...6192 ± 3.9e-07
2024-11-29T07:43:37.6412070Z comparison failed
2024-11-29T07:43:37.6412145Z Obtained: 0.13333
2024-11-29T07:43:37.6412281Z Expected: 0.0038862595143676192 ± 3.9e-07
2024-11-29T07:43:37.6413126Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:sat-en-without-passage|0_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-en-without-passage|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6413281Z assert 0.15275 == 0.004538042951960014 ± 4.5e-07
2024-11-29T07:43:37.6413358Z comparison failed
2024-11-29T07:43:37.6413438Z Obtained: 0.15275
2024-11-29T07:43:37.6413704Z Expected: 0.004538042951960014 ± 4.5e-07
2024-11-29T07:43:37.6414280Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:sat-en|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-en|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6414426Z assert 0.13333 == 0.00388625951...6192 ± 3.9e-07
2024-11-29T07:43:37.6414509Z comparison failed
2024-11-29T07:43:37.6414585Z Obtained: 0.13333
2024-11-29T07:43:37.6414723Z Expected: 0.0038862595143676192 ± 3.9e-07
2024-11-29T07:43:37.6415338Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:sat-en|0_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-en|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6415490Z assert 0.15275 == 0.004538042951960014 ± 4.5e-07
2024-11-29T07:43:37.6415566Z comparison failed
2024-11-29T07:43:37.6415652Z Obtained: 0.15275
2024-11-29T07:43:37.6415783Z Expected: 0.004538042951960014 ± 4.5e-07
2024-11-29T07:43:37.6416384Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:sat-math|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-math|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6416528Z assert 0.15275 == 0.004664521171326971 ± 4.7e-07
2024-11-29T07:43:37.6416610Z comparison failed
2024-11-29T07:43:37.6416685Z Obtained: 0.15275
2024-11-29T07:43:37.6416820Z Expected: 0.004664521171326971 ± 4.7e-07
2024-11-29T07:43:37.6417442Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:sat-math|0_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-math|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6417581Z assert 0.1 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6417657Z comparison failed
2024-11-29T07:43:37.6417737Z Obtained: 0.1
2024-11-29T07:43:37.6417867Z Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6418526Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:causal_judgment|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:causal_judgment|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6418718Z assert 0.16666666666666666 == 0.004861068811484776 ± 4.9e-07
2024-11-29T07:43:37.6418800Z comparison failed
2024-11-29T07:43:37.6418880Z Obtained: 0.16666666666666666
2024-11-29T07:43:37.6419015Z Expected: 0.004861068811484776 ± 4.9e-07
2024-11-29T07:43:37.6419644Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:causal_judgment|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:causal_judgment|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6419797Z assert 0.1633 == 0.004699965923246645 ± 4.7e-07
2024-11-29T07:43:37.6419874Z comparison failed
2024-11-29T07:43:37.6419955Z Obtained: 0.1633
2024-11-29T07:43:37.6420084Z Expected: 0.004699965923246645 ± 4.7e-07
2024-11-29T07:43:37.6420754Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:causal_judgment|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:causal_judgment|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6420939Z assert 0.16666666666666666 == 0.004861068811484776 ± 4.9e-07
2024-11-29T07:43:37.6421022Z comparison failed
2024-11-29T07:43:37.6421102Z Obtained: 0.16666666666666666
2024-11-29T07:43:37.6421231Z Expected: 0.004861068811484776 ± 4.9e-07
2024-11-29T07:43:37.6422015Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:date_understanding|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:date_understanding|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6422202Z assert 0.13333333333333333 == 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6422286Z comparison failed
2024-11-29T07:43:37.6422366Z Obtained: 0.13333333333333333
2024-11-29T07:43:37.6422500Z Expected: 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6423292Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:date_understanding|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:date_understanding|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6423484Z assert 0.13333333333333333 == 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6423563Z comparison failed
2024-11-29T07:43:37.6423646Z Obtained: 0.13333333333333333
2024-11-29T07:43:37.6423775Z Expected: 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6424449Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:disambiguation_qa|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:disambiguation_qa|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6424629Z assert 0.15275252316519466 == 0.004650507199996266 ± 4.7e-07
2024-11-29T07:43:37.6424713Z comparison failed
2024-11-29T07:43:37.6424792Z Obtained: 0.15275252316519466
2024-11-29T07:43:37.6424933Z Expected: 0.004650507199996266 ± 4.7e-07
2024-11-29T07:43:37.6425575Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:disambiguation_qa|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:disambiguation_qa|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6425761Z assert 0.15275252316519466 == 0.004650507199996266 ± 4.7e-07
2024-11-29T07:43:37.6425840Z comparison failed
2024-11-29T07:43:37.6425926Z Obtained: 0.15275252316519466
2024-11-29T07:43:37.6426055Z Expected: 0.004650507199996266 ± 4.7e-07
2024-11-29T07:43:37.6426738Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:disambiguation_qa|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:disambiguation_qa|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6426920Z assert 0.15275252316519464 == 0.004582439170218064 ± 4.6e-07
2024-11-29T07:43:37.6427005Z comparison failed
2024-11-29T07:43:37.6427091Z Obtained: 0.15275252316519464
2024-11-29T07:43:37.6427225Z Expected: 0.004582439170218064 ± 4.6e-07
2024-11-29T07:43:37.6427880Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:geometric_shapes|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:geometric_shapes|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6428065Z assert 0.13333333333333333 == 0.004000255247111385 ± 4.0e-07
2024-11-29T07:43:37.6428144Z comparison failed
2024-11-29T07:43:37.6428227Z Obtained: 0.13333333333333333
2024-11-29T07:43:37.6428363Z Expected: 0.004000255247111385 ± 4.0e-07
2024-11-29T07:43:37.6429040Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:geometric_shapes|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:geometric_shapes|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6429220Z assert 0.13333333333333333 == 0.004041744140305727 ± 4.0e-07
2024-11-29T07:43:37.6429309Z comparison failed
2024-11-29T07:43:37.6429388Z Obtained: 0.13333333333333333
2024-11-29T07:43:37.6429522Z Expected: 0.004041744140305727 ± 4.0e-07
2024-11-29T07:43:37.6430267Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:logical_deduction_five_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:logical_deduction_five_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6430407Z assert 0.1 == 0.003030251214408201 ± 3.0e-07
2024-11-29T07:43:37.6430484Z comparison failed
2024-11-29T07:43:37.6430690Z Obtained: 0.1
2024-11-29T07:43:37.6430826Z Expected: 0.003030251214408201 ± 3.0e-07
2024-11-29T07:43:37.6431557Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_five_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_five_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6431737Z assert 0.15275252316519464 == 0.004381117916034022 ± 4.4e-07
2024-11-29T07:43:37.6431937Z comparison failed
2024-11-29T07:43:37.6432018Z Obtained: 0.15275252316519464
2024-11-29T07:43:37.6432154Z Expected: 0.004381117916034022 ± 4.4e-07
2024-11-29T07:43:37.6432909Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_five_objects|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_five_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6433103Z assert 0.15275252316519464 == 0.004480319549235682 ± 4.5e-07
2024-11-29T07:43:37.6433182Z comparison failed
2024-11-29T07:43:37.6433267Z Obtained: 0.15275252316519464
2024-11-29T07:43:37.6433396Z Expected: 0.004480319549235682 ± 4.5e-07
2024-11-29T07:43:37.6434148Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:logical_deduction_seven_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:logical_deduction_seven_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6434339Z assert 0.13333333333333333 == 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6434423Z comparison failed
2024-11-29T07:43:37.6434502Z Obtained: 0.13333333333333333
2024-11-29T07:43:37.6434634Z Expected: 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6435374Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_seven_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_seven_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6435565Z assert 0.13333333333333333 == 0.004006756056224812 ± 4.0e-07
2024-11-29T07:43:37.6435646Z comparison failed
2024-11-29T07:43:37.6435730Z Obtained: 0.13333333333333333
2024-11-29T07:43:37.6435858Z Expected: 0.004006756056224812 ± 4.0e-07
2024-11-29T07:43:37.6436626Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_seven_objects|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_seven_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6436813Z assert 0.13333333333333333 == 0.004006756056224812 ± 4.0e-07
2024-11-29T07:43:37.6436897Z comparison failed
2024-11-29T07:43:37.6436975Z Obtained: 0.13333333333333333
2024-11-29T07:43:37.6437109Z Expected: 0.004006756056224812 ± 4.0e-07
2024-11-29T07:43:37.6438097Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:logical_deduction_three_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:logical_deduction_three_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6438261Z assert 0.1633 == 0.00504952053...2955 ± 5.0e-07
2024-11-29T07:43:37.6438343Z comparison failed
2024-11-29T07:43:37.6438425Z Obtained: 0.1633
2024-11-29T07:43:37.6438557Z Expected: 0.0050495205374032955 ± 5.0e-07
2024-11-29T07:43:37.6439295Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_three_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_three_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6439483Z assert 0.13333333333333333 == 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6439566Z comparison failed
2024-11-29T07:43:37.6439644Z Obtained: 0.13333333333333333
2024-11-29T07:43:37.6439771Z Expected: 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6440695Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_three_objects|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_three_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6440890Z assert 0.15275252316519466 == 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6440969Z comparison failed
2024-11-29T07:43:37.6441048Z Obtained: 0.15275252316519466
2024-11-29T07:43:37.6441328Z Expected: 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6442012Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:movie_recommendation|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:movie_recommendation|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6442199Z assert 0.15275252316519466 == 0.004428245629971239 ± 4.4e-07
2024-11-29T07:43:37.6442276Z comparison failed
2024-11-29T07:43:37.6442361Z Obtained: 0.15275252316519466
2024-11-29T07:43:37.6442489Z Expected: 0.004428245629971239 ± 4.4e-07
2024-11-29T07:43:37.6443171Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:movie_recommendation|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:movie_recommendation|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6443353Z assert 0.16666666666666666 == 0.004907190695349086 ± 4.9e-07
2024-11-29T07:43:37.6443437Z comparison failed
2024-11-29T07:43:37.6443516Z Obtained: 0.16666666666666666
2024-11-29T07:43:37.6443659Z Expected: 0.004907190695349086 ± 4.9e-07
2024-11-29T07:43:37.6444350Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:movie_recommendation|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:movie_recommendation|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6444536Z assert 0.15275252316519464 == 0.004703372376466875 ± 4.7e-07
2024-11-29T07:43:37.6444615Z comparison failed
2024-11-29T07:43:37.6444700Z Obtained: 0.15275252316519464
2024-11-29T07:43:37.6444835Z Expected: 0.004703372376466875 ± 4.7e-07
2024-11-29T07:43:37.6445445Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:navigate|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:navigate|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6445589Z assert 0.1633 == 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6445673Z comparison failed
2024-11-29T07:43:37.6445755Z Obtained: 0.1633
2024-11-29T07:43:37.6445888Z Expected: 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6446467Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:navigate|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:navigate|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6446614Z assert 0.1633 == 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6446692Z comparison failed
2024-11-29T07:43:37.6446775Z Obtained: 0.1633
2024-11-29T07:43:37.6446901Z Expected: 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6447522Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:navigate|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:navigate|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6447664Z assert 0.1633 == 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6447747Z comparison failed
2024-11-29T07:43:37.6447822Z Obtained: 0.1633
2024-11-29T07:43:37.6447960Z Expected: 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6448715Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:reasoning_about_colored_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:reasoning_about_colored_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6448901Z assert 0.13333333333333333 == 0.004000255247111385 ± 4.0e-07
2024-11-29T07:43:37.6448979Z comparison failed
2024-11-29T07:43:37.6449064Z Obtained: 0.13333333333333333
2024-11-29T07:43:37.6449338Z Expected: 0.004000255247111385 ± 4.0e-07
2024-11-29T07:43:37.6450074Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:reasoning_about_colored_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:reasoning_about_colored_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6450257Z assert 0.13333333333333333 == 0.00405961457...4385 ± 4.1e-07
2024-11-29T07:43:37.6450460Z comparison failed
2024-11-29T07:43:37.6450541Z Obtained: 0.13333333333333333
2024-11-29T07:43:37.6450681Z Expected: 0.0040596145716644385 ± 4.1e-07
2024-11-29T07:43:37.6451436Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:reasoning_about_colored_objects|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:reasoning_about_colored_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6451576Z assert 0.1 == 0.002971327782118411 ± 3.0e-07
2024-11-29T07:43:37.6451653Z comparison failed
2024-11-29T07:43:37.6451744Z Obtained: 0.1
2024-11-29T07:43:37.6451876Z Expected: 0.002971327782118411 ± 3.0e-07
2024-11-29T07:43:37.6452488Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:ruin_names|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:ruin_names|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6452667Z assert 0.15275252316519464 == 0.00459225857...0545 ± 4.6e-07
2024-11-29T07:43:37.6452756Z comparison failed
2024-11-29T07:43:37.6452836Z Obtained: 0.15275252316519464
2024-11-29T07:43:37.6452971Z Expected: 0.0045922585770880545 ± 4.6e-07
2024-11-29T07:43:37.6453558Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:ruin_names|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:ruin_names|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6453742Z assert 0.13333333333333333 == 0.004037827888116828 ± 4.0e-07
2024-11-29T07:43:37.6453820Z comparison failed
2024-11-29T07:43:37.6453912Z Obtained: 0.13333333333333333
2024-11-29T07:43:37.6454041Z Expected: 0.004037827888116828 ± 4.0e-07
2024-11-29T07:43:37.6454664Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:ruin_names|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:ruin_names|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6454858Z assert 0.13333333333333333 == 0.004037827888116828 ± 4.0e-07
2024-11-29T07:43:37.6454958Z comparison failed
2024-11-29T07:43:37.6455038Z Obtained: 0.13333333333333333
2024-11-29T07:43:37.6455166Z Expected: 0.004037827888116828 ± 4.0e-07
2024-11-29T07:43:37.6455946Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:salient_translation_error_detection|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:salient_translation_error_detection|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6456092Z assert 0.1633 == 0.00497231172432741 ± 5.0e-07
2024-11-29T07:43:37.6456176Z comparison failed
2024-11-29T07:43:37.6456253Z Obtained: 0.1633
2024-11-29T07:43:37.6456386Z Expected: 0.00497231172432741 ± 5.0e-07
2024-11-29T07:43:37.6457157Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:salient_translation_error_detection|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:salient_translation_error_detection|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6457305Z assert 0.1 == 0.00304553201...4616 ± 3.0e-07
2024-11-29T07:43:37.6457383Z comparison failed
2024-11-29T07:43:37.6457463Z Obtained: 0.1
2024-11-29T07:43:37.6457594Z Expected: 0.0030455320167854616 ± 3.0e-07
2024-11-29T07:43:37.6458387Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:salient_translation_error_detection|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:salient_translation_error_detection|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6458646Z assert 0.1 == 0.00304553201...4616 ± 3.0e-07
2024-11-29T07:43:37.6458743Z comparison failed
2024-11-29T07:43:37.6458818Z Obtained: 0.1
2024-11-29T07:43:37.6458958Z Expected: 0.0030455320167854616 ± 3.0e-07
2024-11-29T07:43:37.6459546Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:snarks|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:snarks|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6459819Z assert 0.1633 == 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6459898Z comparison failed
2024-11-29T07:43:37.6459977Z Obtained: 0.1633
2024-11-29T07:43:37.6460107Z Expected: 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6460680Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:snarks|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:snarks|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6460824Z assert 0.1633 == 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6460916Z comparison failed
2024-11-29T07:43:37.6460993Z Obtained: 0.1633
2024-11-29T07:43:37.6461129Z Expected: 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6461720Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:snarks|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:snarks|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6461881Z assert 0.1633 == 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6461960Z comparison failed
2024-11-29T07:43:37.6462040Z Obtained: 0.1633
2024-11-29T07:43:37.6462168Z Expected: 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6462856Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:sports_understanding|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:sports_understanding|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6462999Z assert 0.1633 == 0.005037214858781963 ± 5.0e-07
2024-11-29T07:43:37.6463089Z comparison failed
2024-11-29T07:43:37.6463164Z Obtained: 0.1633
2024-11-29T07:43:37.6463298Z Expected: 0.005037214858781963 ± 5.0e-07
2024-11-29T07:43:37.6463959Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:sports_understanding|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:sports_understanding|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6464113Z assert 0.1633 == 0.0049194014382352 ± 4.9e-07
2024-11-29T07:43:37.6464191Z comparison failed
2024-11-29T07:43:37.6464271Z Obtained: 0.1633
2024-11-29T07:43:37.6464398Z Expected: 0.0049194014382352 ± 4.9e-07
2024-11-29T07:43:37.6465099Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:sports_understanding|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:sports_understanding|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6465239Z assert 0.1633 == 0.0049194014382352 ± 4.9e-07
2024-11-29T07:43:37.6465328Z comparison failed
2024-11-29T07:43:37.6465406Z Obtained: 0.1633
2024-11-29T07:43:37.6465537Z Expected: 0.0049194014382352 ± 4.9e-07
2024-11-29T07:43:37.6466189Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:temporal_sequences|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:temporal_sequences|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6466328Z assert 0.1 == 0.00294961187...1973 ± 2.9e-07
2024-11-29T07:43:37.6466414Z comparison failed
2024-11-29T07:43:37.6466495Z Obtained: 0.1
2024-11-29T07:43:37.6466629Z Expected: 0.0029496118745031973 ± 2.9e-07
2024-11-29T07:43:37.6467308Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:temporal_sequences|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:temporal_sequences|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6467442Z assert 0.1 == 0.00294961187...1973 ± 2.9e-07
2024-11-29T07:43:37.6467525Z comparison failed
2024-11-29T07:43:37.6467727Z Obtained: 0.1
2024-11-29T07:43:37.6467866Z Expected: 0.0029496118745031973 ± 2.9e-07
2024-11-29T07:43:37.6468663Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:tracking_shuffled_objects_five_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:tracking_shuffled_objects_five_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6468973Z assert 0.13333333333333333 == 0.004030304374777823 ± 4.0e-07
2024-11-29T07:43:37.6469052Z comparison failed
2024-11-29T07:43:37.6469132Z Obtained: 0.13333333333333333
2024-11-29T07:43:37.6469270Z Expected: 0.004030304374777823 ± 4.0e-07
2024-11-29T07:43:37.6470041Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_five_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_five_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6470237Z assert 0.13333333333333333 == 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6470316Z comparison failed
2024-11-29T07:43:37.6470404Z Obtained: 0.13333333333333333
2024-11-29T07:43:37.6470534Z Expected: 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6471346Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_five_objects|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_five_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6471489Z assert 0.1 == 0.00294083125...9783 ± 2.9e-07
2024-11-29T07:43:37.6471576Z comparison failed
2024-11-29T07:43:37.6471651Z Obtained: 0.1
2024-11-29T07:43:37.6471791Z Expected: 0.0029408312580779783 ± 2.9e-07
2024-11-29T07:43:37.6472593Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:tracking_shuffled_objects_seven_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:tracking_shuffled_objects_seven_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6472779Z assert 0.15275252316519464 == 0.004588830718970504 ± 4.6e-07
2024-11-29T07:43:37.6472856Z comparison failed
2024-11-29T07:43:37.6472940Z Obtained: 0.15275252316519464
2024-11-29T07:43:37.6473070Z Expected: 0.004588830718970504 ± 4.6e-07
2024-11-29T07:43:37.6473854Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_seven_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_seven_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6473996Z assert 0.1 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6474078Z comparison failed
2024-11-29T07:43:37.6474153Z Obtained: 0.1
2024-11-29T07:43:37.6474289Z Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6475087Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:tracking_shuffled_objects_three_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:tracking_shuffled_objects_three_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6475260Z assert 0.1633 == 0.00504952053...2955 ± 5.0e-07
2024-11-29T07:43:37.6475338Z comparison failed
2024-11-29T07:43:37.6475419Z Obtained: 0.1633
2024-11-29T07:43:37.6475548Z Expected: 0.0050495205374032955 ± 5.0e-07
2024-11-29T07:43:37.6476334Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_three_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_three_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6476515Z assert 0.13333333333333333 == 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6476598Z comparison failed
2024-11-29T07:43:37.6476678Z Obtained: 0.13333333333333333
2024-11-29T07:43:37.6476812Z Expected: 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6478003Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_three_objects|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_three_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6478222Z assert 0.15275252316519466 == 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6478303Z comparison failed
2024-11-29T07:43:37.6478528Z Obtained: 0.15275252316519466
2024-11-29T07:43:37.6478670Z Expected: 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6478847Z ====== 90 failed, 482 passed, 4 skipped, 4 warnings in 2109.48s (0:35:09) ======
Fixes Issue 408.
Added fix for heavy recomputation of sample level metrics.
No change for corpus level metrics, but for sample level metrics we can just look up the values instead of recomputing them for each sample, which can be prohibitively expensive for heavier metrics such as XCOMET-XXL.