huggingface / lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
MIT License
852 stars 100 forks source link

Speed up Bootstrapping Computation #409

Open JoelNiklaus opened 3 days ago

JoelNiklaus commented 3 days ago

Fixes Issue 408.

Added fix for heavy recomputation of sample level metrics.

No change for corpus level metrics, but for sample level metrics we can just look up the values instead of recomputing them for each sample, which can be prohibitively expensive for heavier metrics such as XCOMET-XXL.

clefourrier commented 3 days ago

Hi @JoelNiklaus , thanks a lot for this PR, I'll take a deeper look hopefully bf Monday

clefourrier commented 3 days ago

(it's looking good from a first glance but I want to take some time to test it deeply)

clefourrier commented 3 days ago

Hi! You get quite a huge difference in the bootstrap from the results we hardcoded in our test suite (like, an order of magnitude) - can you check why? (I was expecting a diff on a few decimal points, not stg this huge)

HuggingFaceDocBuilderDev commented 3 days ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

JoelNiklaus commented 2 days ago

I am trying to run your tests with python -m pytest tests/test_main.py but it just hangs.

clefourrier commented 2 days ago

It's should be taking some time (around 30 min if you're on CPU), as it first needs to generate a bunch of predictions using a gpt2 model. It will be way faster if you have a GPU available

JoelNiklaus commented 2 days ago

I aborted it after 30min on an A100 GPU and I only ran the lite version.

JoelNiklaus commented 2 days ago

Which metrics did you check? BLEU or CHRF? For these the original version computes corpus level metrics. I switched to sample level metrics to speed up computation. There it would make sense to me that the stderr differs since the corpora are different for the different samples distributions in bootstrapping. For the metrics that are already sample level, I don't see a reason why they should be different.

clefourrier commented 2 days ago

30min on an A100 is not normal. I wonder if there's an issue with the command you're running. Let me share the raw logs with you. (Do you have the rights to access them, by clicking on "Details" next to the failing test in the check list?)


2024-11-29T07:08:27.7720188Z ##[endgroup]
2024-11-29T07:08:34.7243761Z ============================= test session starts ==============================
2024-11-29T07:08:34.7245155Z platform linux -- Python 3.10.15, pytest-7.4.0, pluggy-1.5.0
2024-11-29T07:08:34.7245990Z rootdir: /home/runner/work/lighteval/lighteval
2024-11-29T07:08:34.7246821Z plugins: anyio-4.6.2.post1
2024-11-29T07:08:34.7247425Z collected 576 items
2024-11-29T07:08:34.7247713Z 
2024-11-29T07:43:28.1461575Z tests/test_main.py .F...F.F.F.F.F.F.F.F...F.F.F.F.F.F.F.F.F.F.F.F.F.F.F. [  9%]
2024-11-29T07:43:28.2989743Z F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F...F.F.F.F.F.F...F.F.F.F.F.F.F.F.F. [ 21%]
2024-11-29T07:43:28.4408527Z F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F.F...F.F.F.F.F.F.F...F.F.F      [ 33%]
2024-11-29T07:43:28.6269725Z tests/test_unit_base_metrics.py .........sss                             [ 35%]
2024-11-29T07:43:28.6296287Z tests/test_unit_harness_metrics.py .                                     [ 35%]
2024-11-29T07:43:28.6317847Z tests/test_unit_harness_prompts.py .                                     [ 35%]
2024-11-29T07:43:28.6340754Z tests/test_unit_harness_metrics.py .                                     [ 35%]
2024-11-29T07:43:28.6357885Z tests/test_unit_harness_prompts.py .                                     [ 36%]
2024-11-29T07:43:28.6380080Z tests/test_unit_harness_metrics.py .                                     [ 36%]
2024-11-29T07:43:28.6417832Z tests/test_unit_harness_prompts.py .                                     [ 36%]
2024-11-29T07:43:28.6442610Z tests/test_unit_harness_metrics.py .                                     [ 36%]
2024-11-29T07:43:28.6636677Z tests/test_unit_harness_prompts.py .                                     [ 36%]
2024-11-29T07:43:28.6657953Z tests/test_unit_harness_metrics.py .                                     [ 36%]
2024-11-29T07:43:28.6671555Z tests/test_unit_harness_prompts.py .                                     [ 37%]
2024-11-29T07:43:28.6693571Z tests/test_unit_harness_metrics.py .                                     [ 37%]
2024-11-29T07:43:28.6709853Z tests/test_unit_harness_prompts.py .                                     [ 37%]
2024-11-29T07:43:28.6733091Z tests/test_unit_harness_metrics.py .                                     [ 37%]
2024-11-29T07:43:28.6745199Z tests/test_unit_harness_prompts.py .                                     [ 37%]
2024-11-29T07:43:28.6766346Z tests/test_unit_harness_metrics.py .                                     [ 38%]
2024-11-29T07:43:28.6780038Z tests/test_unit_harness_prompts.py .                                     [ 38%]
2024-11-29T07:43:28.6801796Z tests/test_unit_harness_metrics.py .                                     [ 38%]
2024-11-29T07:43:28.6814506Z tests/test_unit_harness_prompts.py .                                     [ 38%]
2024-11-29T07:43:28.6837702Z tests/test_unit_harness_metrics.py .                                     [ 38%]
2024-11-29T07:43:28.6850587Z tests/test_unit_harness_prompts.py .                                     [ 38%]
2024-11-29T07:43:28.6871904Z tests/test_unit_harness_metrics.py .                                     [ 39%]
2024-11-29T07:43:28.6885430Z tests/test_unit_harness_prompts.py .                                     [ 39%]
2024-11-29T07:43:28.6908359Z tests/test_unit_harness_metrics.py .                                     [ 39%]
2024-11-29T07:43:28.6921181Z tests/test_unit_harness_prompts.py .                                     [ 39%]
2024-11-29T07:43:28.6942161Z tests/test_unit_harness_metrics.py .                                     [ 39%]
2024-11-29T07:43:28.6958442Z tests/test_unit_harness_prompts.py .                                     [ 39%]
2024-11-29T07:43:28.6980167Z tests/test_unit_harness_metrics.py .                                     [ 40%]
2024-11-29T07:43:28.6992955Z tests/test_unit_harness_prompts.py .                                     [ 40%]
2024-11-29T07:43:28.7016468Z tests/test_unit_harness_metrics.py .                                     [ 40%]
2024-11-29T07:43:28.7029699Z tests/test_unit_harness_prompts.py .                                     [ 40%]
2024-11-29T07:43:28.7052872Z tests/test_unit_harness_metrics.py .                                     [ 40%]
2024-11-29T07:43:28.7065482Z tests/test_unit_harness_prompts.py .                                     [ 40%]
2024-11-29T07:43:28.7089440Z tests/test_unit_harness_metrics.py .                                     [ 41%]
2024-11-29T07:43:28.7102445Z tests/test_unit_harness_prompts.py .                                     [ 41%]
2024-11-29T07:43:28.7126781Z tests/test_unit_harness_metrics.py .                                     [ 41%]
2024-11-29T07:43:28.7139341Z tests/test_unit_harness_prompts.py .                                     [ 41%]
2024-11-29T07:43:28.7165058Z tests/test_unit_harness_metrics.py .                                     [ 41%]
2024-11-29T07:43:28.7178028Z tests/test_unit_harness_prompts.py .                                     [ 42%]
2024-11-29T07:43:28.7202080Z tests/test_unit_harness_metrics.py .                                     [ 42%]
2024-11-29T07:43:28.7218747Z tests/test_unit_harness_prompts.py .                                     [ 42%]
2024-11-29T07:43:28.7241644Z tests/test_unit_harness_metrics.py .                                     [ 42%]
2024-11-29T07:43:28.7255891Z tests/test_unit_harness_prompts.py .                                     [ 42%]
2024-11-29T07:43:28.7278849Z tests/test_unit_harness_metrics.py .                                     [ 42%]
2024-11-29T07:43:28.7294533Z tests/test_unit_harness_prompts.py .                                     [ 43%]
2024-11-29T07:43:28.7330266Z tests/test_unit_harness_metrics.py .                                     [ 43%]
2024-11-29T07:43:28.7346074Z tests/test_unit_harness_prompts.py .                                     [ 43%]
2024-11-29T07:43:28.7381971Z tests/test_unit_harness_metrics.py .                                     [ 43%]
2024-11-29T07:43:28.7395807Z tests/test_unit_harness_prompts.py .                                     [ 43%]
2024-11-29T07:43:28.7431748Z tests/test_unit_harness_metrics.py .                                     [ 43%]
2024-11-29T07:43:28.7464638Z tests/test_unit_harness_prompts.py .                                     [ 44%]
2024-11-29T07:43:28.7488992Z tests/test_unit_harness_metrics.py .                                     [ 44%]
2024-11-29T07:43:28.7504537Z tests/test_unit_harness_prompts.py .                                     [ 44%]
2024-11-29T07:43:28.7527730Z tests/test_unit_harness_metrics.py .                                     [ 44%]
2024-11-29T07:43:28.7540260Z tests/test_unit_harness_prompts.py .                                     [ 44%]
2024-11-29T07:43:28.7562518Z tests/test_unit_harness_metrics.py .                                     [ 44%]
2024-11-29T07:43:28.7574965Z tests/test_unit_harness_prompts.py .                                     [ 45%]
2024-11-29T07:43:28.7596202Z tests/test_unit_harness_metrics.py .                                     [ 45%]
2024-11-29T07:43:28.7609130Z tests/test_unit_harness_prompts.py .                                     [ 45%]
2024-11-29T07:43:28.7630209Z tests/test_unit_harness_metrics.py .                                     [ 45%]
2024-11-29T07:43:28.7643498Z tests/test_unit_harness_prompts.py .                                     [ 45%]
2024-11-29T07:43:28.7665104Z tests/test_unit_harness_metrics.py .                                     [ 46%]
2024-11-29T07:43:28.7678031Z tests/test_unit_harness_prompts.py .                                     [ 46%]
2024-11-29T07:43:28.7699259Z tests/test_unit_harness_metrics.py .                                     [ 46%]
2024-11-29T07:43:28.7713917Z tests/test_unit_harness_prompts.py .                                     [ 46%]
2024-11-29T07:43:28.7735261Z tests/test_unit_harness_metrics.py .                                     [ 46%]
2024-11-29T07:43:28.7748116Z tests/test_unit_harness_prompts.py .                                     [ 46%]
2024-11-29T07:43:28.7769310Z tests/test_unit_harness_metrics.py .                                     [ 47%]
2024-11-29T07:43:28.7782113Z tests/test_unit_harness_prompts.py .                                     [ 47%]
2024-11-29T07:43:28.7805247Z tests/test_unit_harness_metrics.py .                                     [ 47%]
2024-11-29T07:43:28.7818063Z tests/test_unit_harness_prompts.py .                                     [ 47%]
2024-11-29T07:43:28.7839104Z tests/test_unit_harness_metrics.py .                                     [ 47%]
2024-11-29T07:43:28.7852034Z tests/test_unit_harness_prompts.py .                                     [ 47%]
2024-11-29T07:43:28.7873028Z tests/test_unit_harness_metrics.py .                                     [ 48%]
2024-11-29T07:43:28.7886250Z tests/test_unit_harness_prompts.py .                                     [ 48%]
2024-11-29T07:43:28.7912642Z tests/test_unit_harness_metrics.py .                                     [ 48%]
2024-11-29T07:43:28.8116588Z tests/test_unit_harness_prompts.py .                                     [ 48%]
2024-11-29T07:43:28.8139064Z tests/test_unit_harness_metrics.py .                                     [ 48%]
2024-11-29T07:43:28.8160162Z tests/test_unit_harness_prompts.py .                                     [ 48%]
2024-11-29T07:43:28.8181821Z tests/test_unit_harness_metrics.py .                                     [ 49%]
2024-11-29T07:43:28.8194800Z tests/test_unit_harness_prompts.py .                                     [ 49%]
2024-11-29T07:43:28.8216495Z tests/test_unit_harness_metrics.py .                                     [ 49%]
2024-11-29T07:43:28.8229316Z tests/test_unit_harness_prompts.py .                                     [ 49%]
2024-11-29T07:43:28.8250711Z tests/test_unit_harness_metrics.py .                                     [ 49%]
2024-11-29T07:43:28.8263436Z tests/test_unit_harness_prompts.py .                                     [ 50%]
2024-11-29T07:43:28.8285262Z tests/test_unit_harness_metrics.py .                                     [ 50%]
2024-11-29T07:43:28.8299482Z tests/test_unit_harness_prompts.py .                                     [ 50%]
2024-11-29T07:43:28.8321526Z tests/test_unit_harness_metrics.py .                                     [ 50%]
2024-11-29T07:43:28.8356679Z tests/test_unit_harness_prompts.py .                                     [ 50%]
2024-11-29T07:43:28.8378065Z tests/test_unit_harness_metrics.py .                                     [ 50%]
2024-11-29T07:43:28.8398278Z tests/test_unit_harness_prompts.py .                                     [ 51%]
2024-11-29T07:43:28.8419146Z tests/test_unit_harness_metrics.py .                                     [ 51%]
2024-11-29T07:43:28.8433857Z tests/test_unit_harness_prompts.py .                                     [ 51%]
2024-11-29T07:43:28.8455390Z tests/test_unit_harness_metrics.py .                                     [ 51%]
2024-11-29T07:43:28.8468444Z tests/test_unit_harness_prompts.py .                                     [ 51%]
2024-11-29T07:43:28.8490077Z tests/test_unit_harness_metrics.py .                                     [ 51%]
2024-11-29T07:43:28.8502997Z tests/test_unit_harness_prompts.py .                                     [ 52%]
2024-11-29T07:43:28.8524713Z tests/test_unit_harness_metrics.py .                                     [ 52%]
2024-11-29T07:43:28.8537241Z tests/test_unit_harness_prompts.py .                                     [ 52%]
2024-11-29T07:43:28.8558792Z tests/test_unit_harness_metrics.py .                                     [ 52%]
2024-11-29T07:43:28.8575050Z tests/test_unit_harness_prompts.py .                                     [ 52%]
2024-11-29T07:43:28.8596437Z tests/test_unit_harness_metrics.py .                                     [ 52%]
2024-11-29T07:43:28.8610039Z tests/test_unit_harness_prompts.py .                                     [ 53%]
2024-11-29T07:43:28.8631025Z tests/test_unit_harness_metrics.py .                                     [ 53%]
2024-11-29T07:43:28.8644056Z tests/test_unit_harness_prompts.py .                                     [ 53%]
2024-11-29T07:43:28.8665090Z tests/test_unit_harness_metrics.py .                                     [ 53%]
2024-11-29T07:43:28.8678070Z tests/test_unit_harness_prompts.py .                                     [ 53%]
2024-11-29T07:43:28.8699420Z tests/test_unit_harness_metrics.py .                                     [ 53%]
2024-11-29T07:43:28.8712319Z tests/test_unit_harness_prompts.py .                                     [ 54%]
2024-11-29T07:43:28.8733862Z tests/test_unit_harness_metrics.py .                                     [ 54%]
2024-11-29T07:43:28.9868071Z tests/test_unit_harness_prompts.py .                                     [ 54%]
2024-11-29T07:43:28.9890873Z tests/test_unit_harness_metrics.py .                                     [ 54%]
2024-11-29T07:43:28.9903844Z tests/test_unit_harness_prompts.py .                                     [ 54%]
2024-11-29T07:43:28.9927274Z tests/test_unit_harness_metrics.py .                                     [ 55%]
2024-11-29T07:43:28.9942995Z tests/test_unit_harness_prompts.py .                                     [ 55%]
2024-11-29T07:43:28.9964596Z tests/test_unit_harness_metrics.py .                                     [ 55%]
2024-11-29T07:43:28.9981129Z tests/test_unit_harness_prompts.py .                                     [ 55%]
2024-11-29T07:43:29.0002443Z tests/test_unit_harness_metrics.py .                                     [ 55%]
2024-11-29T07:43:29.0024719Z tests/test_unit_harness_prompts.py .                                     [ 55%]
2024-11-29T07:43:29.0046243Z tests/test_unit_harness_metrics.py .                                     [ 56%]
2024-11-29T07:43:29.0059105Z tests/test_unit_harness_prompts.py .                                     [ 56%]
2024-11-29T07:43:29.0080273Z tests/test_unit_harness_metrics.py .                                     [ 56%]
2024-11-29T07:43:29.0165262Z tests/test_unit_harness_prompts.py .                                     [ 56%]
2024-11-29T07:43:29.0186303Z tests/test_unit_harness_metrics.py .                                     [ 56%]
2024-11-29T07:43:29.0216954Z tests/test_unit_harness_prompts.py .                                     [ 56%]
2024-11-29T07:43:29.0238102Z tests/test_unit_harness_metrics.py .                                     [ 57%]
2024-11-29T07:43:29.0251189Z tests/test_unit_harness_prompts.py .                                     [ 57%]
2024-11-29T07:43:29.0272105Z tests/test_unit_harness_metrics.py .                                     [ 57%]
2024-11-29T07:43:29.0285383Z tests/test_unit_harness_prompts.py .                                     [ 57%]
2024-11-29T07:43:29.0306486Z tests/test_unit_harness_metrics.py .                                     [ 57%]
2024-11-29T07:43:29.0319497Z tests/test_unit_harness_prompts.py .                                     [ 57%]
2024-11-29T07:43:29.0340447Z tests/test_unit_harness_metrics.py .                                     [ 58%]
2024-11-29T07:43:29.0353097Z tests/test_unit_harness_prompts.py .                                     [ 58%]
2024-11-29T07:43:29.0374131Z tests/test_unit_harness_metrics.py .                                     [ 58%]
2024-11-29T07:43:29.0386914Z tests/test_unit_harness_prompts.py .                                     [ 58%]
2024-11-29T07:43:29.0408470Z tests/test_unit_harness_metrics.py .                                     [ 58%]
2024-11-29T07:43:29.0420934Z tests/test_unit_harness_prompts.py .                                     [ 59%]
2024-11-29T07:43:29.0441956Z tests/test_unit_harness_metrics.py .                                     [ 59%]
2024-11-29T07:43:29.0454504Z tests/test_unit_harness_prompts.py .                                     [ 59%]
2024-11-29T07:43:29.0475302Z tests/test_unit_harness_metrics.py .                                     [ 59%]
2024-11-29T07:43:29.0488403Z tests/test_unit_harness_prompts.py .                                     [ 59%]
2024-11-29T07:43:29.0510043Z tests/test_unit_harness_metrics.py .                                     [ 59%]
2024-11-29T07:43:29.0523273Z tests/test_unit_harness_prompts.py .                                     [ 60%]
2024-11-29T07:43:29.0543852Z tests/test_unit_harness_metrics.py .                                     [ 60%]
2024-11-29T07:43:29.0556996Z tests/test_unit_harness_prompts.py .                                     [ 60%]
2024-11-29T07:43:29.0577935Z tests/test_unit_harness_metrics.py .                                     [ 60%]
2024-11-29T07:43:29.0590646Z tests/test_unit_harness_prompts.py .                                     [ 60%]
2024-11-29T07:43:29.0611937Z tests/test_unit_harness_metrics.py .                                     [ 60%]
2024-11-29T07:43:29.0625526Z tests/test_unit_harness_prompts.py .                                     [ 61%]
2024-11-29T07:43:29.0646575Z tests/test_unit_harness_metrics.py .                                     [ 61%]
2024-11-29T07:43:29.0691670Z tests/test_unit_harness_prompts.py .                                     [ 61%]
2024-11-29T07:43:29.1441540Z tests/test_unit_harness_metrics.py ..................................... [ 67%]
2024-11-29T07:43:29.2921174Z ........................................................................ [ 80%]
2024-11-29T07:43:29.6749320Z ...................................................................      [ 92%]
2024-11-29T07:43:29.8402245Z tests/test_unit_reorder.py ..                                            [ 92%]
2024-11-29T07:43:29.8901694Z tests/logging/test_evaluation_tracker.py ...s                            [ 93%]
2024-11-29T07:43:31.7466737Z tests/metrics/test_metric_requests.py ...                                [ 93%]
2024-11-29T07:43:31.7499546Z tests/metrics/test_normalizations.py ....                                [ 94%]
2024-11-29T07:43:35.1781251Z tests/models/test_abstract_model.py .                                    [ 94%]
2024-11-29T07:43:36.2062836Z tests/models/test_base_model.py .                                        [ 94%]
2024-11-29T07:43:37.5228883Z tests/tasks/test_lighteval_task.py ..                                    [ 94%]
2024-11-29T07:43:37.5541528Z tests/tasks/test_registry.py ........                                    [ 96%]
2024-11-29T07:43:37.5572620Z tests/tasks/templates/test_continuation.py ....                          [ 97%]
2024-11-29T07:43:37.5591099Z tests/tasks/templates/test_copa.py ..                                    [ 97%]
2024-11-29T07:43:37.5622767Z tests/tasks/templates/test_hellaswag.py ....                             [ 98%]
2024-11-29T07:43:37.5659938Z tests/tasks/templates/test_multichoice.py .....                          [ 98%]
2024-11-29T07:43:37.5683530Z tests/tasks/templates/test_nli.py ...                                    [ 99%]
2024-11-29T07:43:37.5771004Z tests/tasks/templates/test_translation.py ...                            [100%]
2024-11-29T07:43:37.5771703Z 
2024-11-29T07:43:37.5771957Z =================================== FAILURES ===================================
2024-11-29T07:43:37.5772928Z ___ test_model_prediction[gpt2_lite_leaderboard|arc:challenge|25_acc_stderr] ___
2024-11-29T07:43:37.5773560Z 
2024-11-29T07:43:37.5775505Z model_input = ('gpt2', 'lite', 'leaderboard|arc:challenge|25', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object ...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.5777471Z 
2024-11-29T07:43:37.5777880Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5778693Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5779605Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5780707Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5782007Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5782843Z >       assert reference == approx(
2024-11-29T07:43:37.5783380Z             prediction, rel=1e-4
2024-11-29T07:43:37.5784202Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5785593Z E       AssertionError: Model gpt2 on lite samples, for eval leaderboard|arc:challenge|25, metric acc_stderr incorrect
2024-11-29T07:43:37.5787101Z E       assert 0.15275252316519466 == 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.5787772Z E         comparison failed
2024-11-29T07:43:37.5788257Z E         Obtained: 0.15275252316519466
2024-11-29T07:43:37.5788826Z E         Expected: 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.5789153Z 
2024-11-29T07:43:37.5789319Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5790029Z _ test_model_prediction[gpt2_lite_leaderboard|truthfulqa:mc|0_truthfulqa_mc1_stderr] _
2024-11-29T07:43:37.5790612Z 
2024-11-29T07:43:37.5791970Z model_input = ('gpt2', 'lite', 'leaderboard|truthfulqa:mc|0', 'truthfulqa_mc1_stderr', functools.partial(<functools._lru_cache_wrapp...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.5793419Z 
2024-11-29T07:43:37.5793699Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5794349Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5795072Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5795897Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5796415Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5796803Z >       assert reference == approx(
2024-11-29T07:43:37.5797060Z             prediction, rel=1e-4
2024-11-29T07:43:37.5797720Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5798396Z E       AssertionError: Model gpt2 on lite samples, for eval leaderboard|truthfulqa:mc|0, metric truthfulqa_mc1_stderr incorrect
2024-11-29T07:43:37.5799037Z E       assert 0.15275252316519466 == 0.004619651629850591 ± 4.6e-07
2024-11-29T07:43:37.5799345Z E         comparison failed
2024-11-29T07:43:37.5799570Z E         Obtained: 0.15275252316519466
2024-11-29T07:43:37.5799885Z E         Expected: 0.004619651629850591 ± 4.6e-07
2024-11-29T07:43:37.5800075Z 
2024-11-29T07:43:37.5800173Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5800564Z _ test_model_prediction[gpt2_lite_leaderboard|truthfulqa:mc|0_truthfulqa_mc2_stderr] _
2024-11-29T07:43:37.5800881Z 
2024-11-29T07:43:37.5801581Z model_input = ('gpt2', 'lite', 'leaderboard|truthfulqa:mc|0', 'truthfulqa_mc2_stderr', functools.partial(<functools._lru_cache_wrapp...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.14105533101540416)
2024-11-29T07:43:37.5802385Z 
2024-11-29T07:43:37.5802549Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5802923Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5803340Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5804015Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5804517Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5804889Z >       assert reference == approx(
2024-11-29T07:43:37.5805136Z             prediction, rel=1e-4
2024-11-29T07:43:37.5805516Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5806164Z E       AssertionError: Model gpt2 on lite samples, for eval leaderboard|truthfulqa:mc|0, metric truthfulqa_mc2_stderr incorrect
2024-11-29T07:43:37.5806922Z E       assert 0.14105533101540416 == 0.004258753966872427 ± 4.3e-07
2024-11-29T07:43:37.5807230Z E         comparison failed
2024-11-29T07:43:37.5807468Z E         Obtained: 0.14105533101540416
2024-11-29T07:43:37.5807776Z E         Expected: 0.004258753966872427 ± 4.3e-07
2024-11-29T07:43:37.5807958Z 
2024-11-29T07:43:37.5808056Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5808427Z _____ test_model_prediction[gpt2_lite_leaderboard|hellaswag|10_acc_stderr] _____
2024-11-29T07:43:37.5808713Z 
2024-11-29T07:43:37.5809414Z model_input = ('gpt2', 'lite', 'leaderboard|hellaswag|10', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object at 0...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5810184Z 
2024-11-29T07:43:37.5810346Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5810721Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5811126Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5811619Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5812110Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5812527Z >       assert reference == approx(
2024-11-29T07:43:37.5812870Z             prediction, rel=1e-4
2024-11-29T07:43:37.5813529Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5814260Z E       AssertionError: Model gpt2 on lite samples, for eval leaderboard|hellaswag|10, metric acc_stderr incorrect
2024-11-29T07:43:37.5814851Z E       assert 0.16329931618554522 == 0.004968770338693327 ± 5.0e-07
2024-11-29T07:43:37.5815164Z E         comparison failed
2024-11-29T07:43:37.5815391Z E         Obtained: 0.16329931618554522
2024-11-29T07:43:37.5815698Z E         Expected: 0.004968770338693327 ± 5.0e-07
2024-11-29T07:43:37.5815879Z 
2024-11-29T07:43:37.5815978Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5816346Z __ test_model_prediction[gpt2_lite_leaderboard|hellaswag|10_acc_norm_stderr] ___
2024-11-29T07:43:37.5816629Z 
2024-11-29T07:43:37.5817340Z model_input = ('gpt2', 'lite', 'leaderboard|hellaswag|10', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper object...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5818107Z 
2024-11-29T07:43:37.5818272Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5818638Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5819050Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5819553Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5820043Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5820411Z >       assert reference == approx(
2024-11-29T07:43:37.5820658Z             prediction, rel=1e-4
2024-11-29T07:43:37.5821036Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5821794Z E       AssertionError: Model gpt2 on lite samples, for eval leaderboard|hellaswag|10, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.5822396Z E       assert 0.16329931618554522 == 0.004785693561320304 ± 4.8e-07
2024-11-29T07:43:37.5822701Z E         comparison failed
2024-11-29T07:43:37.5822926Z E         Obtained: 0.16329931618554522
2024-11-29T07:43:37.5823234Z E         Expected: 0.004785693561320304 ± 4.8e-07
2024-11-29T07:43:37.5823423Z 
2024-11-29T07:43:37.5823517Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5824022Z _ test_model_prediction[gpt2_lite_leaderboard|mmlu:abstract_algebra|5_acc_stderr] _
2024-11-29T07:43:37.5824328Z 
2024-11-29T07:43:37.5825018Z model_input = ('gpt2', 'lite', 'leaderboard|mmlu:abstract_algebra|5', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper ...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5825793Z 
2024-11-29T07:43:37.5825950Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5826324Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5826733Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5827223Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5827721Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5828098Z >       assert reference == approx(
2024-11-29T07:43:37.5828344Z             prediction, rel=1e-4
2024-11-29T07:43:37.5828722Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5829365Z E       AssertionError: Model gpt2 on lite samples, for eval leaderboard|mmlu:abstract_algebra|5, metric acc_stderr incorrect
2024-11-29T07:43:37.5829988Z E       assert 0.16329931618554522 == 0.00497641685...3716 ± 5.0e-07
2024-11-29T07:43:37.5830294Z E         comparison failed
2024-11-29T07:43:37.5830524Z E         Obtained: 0.16329931618554522
2024-11-29T07:43:37.5830835Z E         Expected: 0.0049764168560043716 ± 5.0e-07
2024-11-29T07:43:37.5831016Z 
2024-11-29T07:43:37.5831115Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5831500Z _ test_model_prediction[gpt2_lite_leaderboard|mmlu:college_chemistry|5_acc_stderr] _
2024-11-29T07:43:37.5831802Z 
2024-11-29T07:43:37.5832518Z model_input = ('gpt2', 'lite', 'leaderboard|mmlu:college_chemistry|5', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.09999999999999999)
2024-11-29T07:43:37.5833789Z 
2024-11-29T07:43:37.5833958Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5834323Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5834731Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5835285Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5835772Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5836148Z >       assert reference == approx(
2024-11-29T07:43:37.5836401Z             prediction, rel=1e-4
2024-11-29T07:43:37.5836777Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5837691Z E       AssertionError: Model gpt2 on lite samples, for eval leaderboard|mmlu:college_chemistry|5, metric acc_stderr incorrect
2024-11-29T07:43:37.5838334Z E       assert 0.09999999999999999 == 0.002971327782118411 ± 3.0e-07
2024-11-29T07:43:37.5838650Z E         comparison failed
2024-11-29T07:43:37.5838879Z E         Obtained: 0.09999999999999999
2024-11-29T07:43:37.5839189Z E         Expected: 0.002971327782118411 ± 3.0e-07
2024-11-29T07:43:37.5839377Z 
2024-11-29T07:43:37.5839473Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5840039Z _ test_model_prediction[gpt2_lite_leaderboard|mmlu:computer_security|5_acc_stderr] _
2024-11-29T07:43:37.5840358Z 
2024-11-29T07:43:37.5841054Z model_input = ('gpt2', 'lite', 'leaderboard|mmlu:computer_security|5', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.09999999999999999)
2024-11-29T07:43:37.5842026Z 
2024-11-29T07:43:37.5842188Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5842559Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5842968Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5843463Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5843953Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5844335Z >       assert reference == approx(
2024-11-29T07:43:37.5844592Z             prediction, rel=1e-4
2024-11-29T07:43:37.5844981Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5845630Z E       AssertionError: Model gpt2 on lite samples, for eval leaderboard|mmlu:computer_security|5, metric acc_stderr incorrect
2024-11-29T07:43:37.5846251Z E       assert 0.09999999999999999 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.5846564Z E         comparison failed
2024-11-29T07:43:37.5846787Z E         Obtained: 0.09999999999999999
2024-11-29T07:43:37.5847095Z E         Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.5847280Z 
2024-11-29T07:43:37.5847384Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5847771Z _ test_model_prediction[gpt2_lite_leaderboard|mmlu:us_foreign_policy|5_acc_stderr] _
2024-11-29T07:43:37.5848079Z 
2024-11-29T07:43:37.5848770Z model_input = ('gpt2', 'lite', 'leaderboard|mmlu:us_foreign_policy|5', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.5849541Z 
2024-11-29T07:43:37.5849700Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5850066Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5850472Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5851029Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5851903Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5852548Z >       assert reference == approx(
2024-11-29T07:43:37.5852798Z             prediction, rel=1e-4
2024-11-29T07:43:37.5853178Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5853825Z E       AssertionError: Model gpt2 on lite samples, for eval leaderboard|mmlu:us_foreign_policy|5, metric acc_stderr incorrect
2024-11-29T07:43:37.5854452Z E       assert 0.15275252316519466 == 0.004633704913049727 ± 4.6e-07
2024-11-29T07:43:37.5854757Z E         comparison failed
2024-11-29T07:43:37.5854986Z E         Obtained: 0.15275252316519466
2024-11-29T07:43:37.5855297Z E         Expected: 0.004633704913049727 ± 4.6e-07
2024-11-29T07:43:37.5855486Z 
2024-11-29T07:43:37.5855585Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5855954Z ___ test_model_prediction[gpt2_lite_helm|mmlu:abstract_algebra|5_em_stderr] ____
2024-11-29T07:43:37.5856244Z 
2024-11-29T07:43:37.5856924Z model_input = ('gpt2', 'lite', 'helm|mmlu:abstract_algebra|5', 'em_stderr', functools.partial(<functools._lru_cache_wrapper object a...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5857697Z 
2024-11-29T07:43:37.5858009Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5858382Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5858783Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5859278Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5859767Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5860443Z >       assert reference == approx(
2024-11-29T07:43:37.5860690Z             prediction, rel=1e-4
2024-11-29T07:43:37.5861067Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5861671Z E       AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:abstract_algebra|5, metric em_stderr incorrect
2024-11-29T07:43:37.5862260Z E       assert 0.16329931618554522 == 0.00497641685...3716 ± 5.0e-07
2024-11-29T07:43:37.5862574Z E         comparison failed
2024-11-29T07:43:37.5862802Z E         Obtained: 0.16329931618554522
2024-11-29T07:43:37.5863110Z E         Expected: 0.0049764168560043716 ± 5.0e-07
2024-11-29T07:43:37.5863292Z 
2024-11-29T07:43:37.5863390Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5863760Z __ test_model_prediction[gpt2_lite_helm|mmlu:abstract_algebra|5_pqem_stderr] ___
2024-11-29T07:43:37.5864053Z 
2024-11-29T07:43:37.5864743Z model_input = ('gpt2', 'lite', 'helm|mmlu:abstract_algebra|5', 'pqem_stderr', functools.partial(<functools._lru_cache_wrapper object...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5865529Z 
2024-11-29T07:43:37.5865689Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5866054Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5866464Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5866956Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5867445Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5867820Z >       assert reference == approx(
2024-11-29T07:43:37.5868061Z             prediction, rel=1e-4
2024-11-29T07:43:37.5868436Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5869052Z E       AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:abstract_algebra|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.5869640Z E       assert 0.16329931618554522 == 0.00497641685...3716 ± 5.0e-07
2024-11-29T07:43:37.5869948Z E         comparison failed
2024-11-29T07:43:37.5870174Z E         Obtained: 0.16329931618554522
2024-11-29T07:43:37.5870481Z E         Expected: 0.0049764168560043716 ± 5.0e-07
2024-11-29T07:43:37.5870669Z 
2024-11-29T07:43:37.5870772Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5871224Z ___ test_model_prediction[gpt2_lite_helm|mmlu:college_chemistry|5_em_stderr] ___
2024-11-29T07:43:37.5871743Z 
2024-11-29T07:43:37.5872652Z model_input = ('gpt2', 'lite', 'helm|mmlu:college_chemistry|5', 'em_stderr', functools.partial(<functools._lru_cache_wrapper object ...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.5873425Z 
2024-11-29T07:43:37.5873581Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5873947Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5874352Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5874845Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5875468Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5875857Z >       assert reference == approx(
2024-11-29T07:43:37.5876102Z             prediction, rel=1e-4
2024-11-29T07:43:37.5876476Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5877083Z E       AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:college_chemistry|5, metric em_stderr incorrect
2024-11-29T07:43:37.5878044Z E       assert 0.15275252316519466 == 0.00457283509661358 ± 4.6e-07
2024-11-29T07:43:37.5878353Z E         comparison failed
2024-11-29T07:43:37.5878580Z E         Obtained: 0.15275252316519466
2024-11-29T07:43:37.5878887Z E         Expected: 0.00457283509661358 ± 4.6e-07
2024-11-29T07:43:37.5879074Z 
2024-11-29T07:43:37.5879170Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5879541Z __ test_model_prediction[gpt2_lite_helm|mmlu:college_chemistry|5_pqem_stderr] __
2024-11-29T07:43:37.5879830Z 
2024-11-29T07:43:37.5880521Z model_input = ('gpt2', 'lite', 'helm|mmlu:college_chemistry|5', 'pqem_stderr', functools.partial(<functools._lru_cache_wrapper objec...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5881299Z 
2024-11-29T07:43:37.5881458Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5881825Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5882237Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5882728Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5883213Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5883588Z >       assert reference == approx(
2024-11-29T07:43:37.5883836Z             prediction, rel=1e-4
2024-11-29T07:43:37.5884220Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5884831Z E       AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:college_chemistry|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.5885422Z E       assert 0.16329931618554522 == 0.004802280906184263 ± 4.8e-07
2024-11-29T07:43:37.5885732Z E         comparison failed
2024-11-29T07:43:37.5885956Z E         Obtained: 0.16329931618554522
2024-11-29T07:43:37.5886267Z E         Expected: 0.004802280906184263 ± 4.8e-07
2024-11-29T07:43:37.5886457Z 
2024-11-29T07:43:37.5886551Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5886918Z ___ test_model_prediction[gpt2_lite_helm|mmlu:computer_security|5_em_stderr] ___
2024-11-29T07:43:37.5887212Z 
2024-11-29T07:43:37.5887893Z model_input = ('gpt2', 'lite', 'helm|mmlu:computer_security|5', 'em_stderr', functools.partial(<functools._lru_cache_wrapper object ...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.09999999999999999)
2024-11-29T07:43:37.5888674Z 
2024-11-29T07:43:37.5888958Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5889593Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5890150Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5890644Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5891136Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5891542Z >       assert reference == approx(
2024-11-29T07:43:37.5891788Z             prediction, rel=1e-4
2024-11-29T07:43:37.5892167Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5892776Z E       AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:computer_security|5, metric em_stderr incorrect
2024-11-29T07:43:37.5893541Z E       assert 0.09999999999999999 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.5893854Z E         comparison failed
2024-11-29T07:43:37.5894080Z E         Obtained: 0.09999999999999999
2024-11-29T07:43:37.5894393Z E         Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.5894575Z 
2024-11-29T07:43:37.5894675Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5895043Z __ test_model_prediction[gpt2_lite_helm|mmlu:computer_security|5_pqem_stderr] __
2024-11-29T07:43:37.5895464Z 
2024-11-29T07:43:37.5896154Z model_input = ('gpt2', 'lite', 'helm|mmlu:computer_security|5', 'pqem_stderr', functools.partial(<functools._lru_cache_wrapper objec...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519464)
2024-11-29T07:43:37.5896915Z 
2024-11-29T07:43:37.5897075Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5897448Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5897862Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5898352Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5898841Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5899213Z >       assert reference == approx(
2024-11-29T07:43:37.5899462Z             prediction, rel=1e-4
2024-11-29T07:43:37.5899846Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5900461Z E       AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:computer_security|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.5901053Z E       assert 0.15275252316519464 == 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.5901360Z E         comparison failed
2024-11-29T07:43:37.5901589Z E         Obtained: 0.15275252316519464
2024-11-29T07:43:37.5901901Z E         Expected: 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.5902091Z 
2024-11-29T07:43:37.5902193Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5902558Z ___ test_model_prediction[gpt2_lite_helm|mmlu:us_foreign_policy|5_em_stderr] ___
2024-11-29T07:43:37.5902842Z 
2024-11-29T07:43:37.5903512Z model_input = ('gpt2', 'lite', 'helm|mmlu:us_foreign_policy|5', 'em_stderr', functools.partial(<functools._lru_cache_wrapper object ...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.5904269Z 
2024-11-29T07:43:37.5904431Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5904795Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5905199Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5905684Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5906174Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5906553Z >       assert reference == approx(
2024-11-29T07:43:37.5906888Z             prediction, rel=1e-4
2024-11-29T07:43:37.5907289Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5907895Z E       AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:us_foreign_policy|5, metric em_stderr incorrect
2024-11-29T07:43:37.5908875Z E       assert 0.15275252316519466 == 0.004633704913049727 ± 4.6e-07
2024-11-29T07:43:37.5909278Z E         comparison failed
2024-11-29T07:43:37.5909502Z E         Obtained: 0.15275252316519466
2024-11-29T07:43:37.5909815Z E         Expected: 0.004633704913049727 ± 4.6e-07
2024-11-29T07:43:37.5910000Z 
2024-11-29T07:43:37.5910093Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5910460Z __ test_model_prediction[gpt2_lite_helm|mmlu:us_foreign_policy|5_pqem_stderr] __
2024-11-29T07:43:37.5910763Z 
2024-11-29T07:43:37.5911603Z model_input = ('gpt2', 'lite', 'helm|mmlu:us_foreign_policy|5', 'pqem_stderr', functools.partial(<functools._lru_cache_wrapper objec...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5912371Z 
2024-11-29T07:43:37.5912529Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5912896Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5913414Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5913900Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5914388Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5916656Z >       assert reference == approx(
2024-11-29T07:43:37.5917064Z             prediction, rel=1e-4
2024-11-29T07:43:37.5918084Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5919133Z E       AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:us_foreign_policy|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.5920179Z E       assert 0.16329931618554522 == 0.004872014627084626 ± 4.9e-07
2024-11-29T07:43:37.5920679Z E         comparison failed
2024-11-29T07:43:37.5920915Z E         Obtained: 0.16329931618554522
2024-11-29T07:43:37.5921263Z E         Expected: 0.004872014627084626 ± 4.9e-07
2024-11-29T07:43:37.5921462Z 
2024-11-29T07:43:37.5921562Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5921935Z _______ test_model_prediction[gpt2_lite_lighteval|anli:r1|0_acc_stderr] ________
2024-11-29T07:43:37.5922223Z 
2024-11-29T07:43:37.5922924Z model_input = ('gpt2', 'lite', 'lighteval|anli:r1|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object at 0x7f63...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16666666666666666)
2024-11-29T07:43:37.5923695Z 
2024-11-29T07:43:37.5923860Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5924234Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5924653Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5925153Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5925689Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5926348Z >       assert reference == approx(
2024-11-29T07:43:37.5926770Z             prediction, rel=1e-4
2024-11-29T07:43:37.5927162Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5927752Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|anli:r1|0, metric acc_stderr incorrect
2024-11-29T07:43:37.5928323Z E       assert 0.16666666666666666 == 0.00514299138248941 ± 5.1e-07
2024-11-29T07:43:37.5928631Z E         comparison failed
2024-11-29T07:43:37.5928857Z E         Obtained: 0.16666666666666666
2024-11-29T07:43:37.5929167Z E         Expected: 0.00514299138248941 ± 5.1e-07
2024-11-29T07:43:37.5929345Z 
2024-11-29T07:43:37.5929446Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5929815Z _ test_model_prediction[gpt2_lite_lighteval|blimp:adjunct_island|0_acc_stderr] _
2024-11-29T07:43:37.5930114Z 
2024-11-29T07:43:37.5930809Z model_input = ('gpt2', 'lite', 'lighteval|blimp:adjunct_island|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper obj...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.5931601Z 
2024-11-29T07:43:37.5931763Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5932136Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5932756Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5933259Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5933742Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5934112Z >       assert reference == approx(
2024-11-29T07:43:37.5934362Z             prediction, rel=1e-4
2024-11-29T07:43:37.5934878Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5935493Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|blimp:adjunct_island|0, metric acc_stderr incorrect
2024-11-29T07:43:37.5936097Z E       assert 0.13333333333333333 == 0.003921139545506534 ± 3.9e-07
2024-11-29T07:43:37.5936407Z E         comparison failed
2024-11-29T07:43:37.5936634Z E         Obtained: 0.13333333333333333
2024-11-29T07:43:37.5936950Z E         Expected: 0.003921139545506534 ± 3.9e-07
2024-11-29T07:43:37.5937132Z 
2024-11-29T07:43:37.5937231Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5937609Z _ test_model_prediction[gpt2_lite_lighteval|blimp:ellipsis_n_bar_1|0_acc_stderr] _
2024-11-29T07:43:37.5937912Z 
2024-11-29T07:43:37.5938588Z model_input = ('gpt2', 'lite', 'lighteval|blimp:ellipsis_n_bar_1|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper o...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.5939340Z 
2024-11-29T07:43:37.5939501Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5939864Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5940267Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5940751Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5941238Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5941606Z >       assert reference == approx(
2024-11-29T07:43:37.5941854Z             prediction, rel=1e-4
2024-11-29T07:43:37.5942231Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5942855Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|blimp:ellipsis_n_bar_1|0, metric acc_stderr incorrect
2024-11-29T07:43:37.5943750Z E       assert 0.15275252316519466 == 0.004709524351738684 ± 4.7e-07
2024-11-29T07:43:37.5944254Z E         comparison failed
2024-11-29T07:43:37.5944480Z E         Obtained: 0.15275252316519466
2024-11-29T07:43:37.5944790Z E         Expected: 0.004709524351738684 ± 4.7e-07
2024-11-29T07:43:37.5944969Z 
2024-11-29T07:43:37.5945068Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5945421Z ___________ test_model_prediction[gpt2_lite_helm|boolq|5_em_stderr] ____________
2024-11-29T07:43:37.5945695Z 
2024-11-29T07:43:37.5946392Z model_input = ('gpt2', 'lite', 'helm|boolq|5', 'em_stderr', functools.partial(<functools._lru_cache_wrapper object at 0x7f6332c883b0...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5947153Z 
2024-11-29T07:43:37.5947321Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5947704Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5948126Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5948620Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5949106Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5949481Z >       assert reference == approx(
2024-11-29T07:43:37.5949729Z             prediction, rel=1e-4
2024-11-29T07:43:37.5950234Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5950802Z E       AssertionError: Model gpt2 on lite samples, for eval helm|boolq|5, metric em_stderr incorrect
2024-11-29T07:43:37.5951344Z E       assert 0.16329931618554522 == 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.5951646Z E         comparison failed
2024-11-29T07:43:37.5951872Z E         Obtained: 0.16329931618554522
2024-11-29T07:43:37.5952301Z E         Expected: 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.5952485Z 
2024-11-29T07:43:37.5952579Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5952930Z ___________ test_model_prediction[gpt2_lite_helm|boolq|5_qem_stderr] ___________
2024-11-29T07:43:37.5953203Z 
2024-11-29T07:43:37.5953893Z model_input = ('gpt2', 'lite', 'helm|boolq|5', 'qem_stderr', functools.partial(<functools._lru_cache_wrapper object at 0x7f6332c883b...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5954653Z 
2024-11-29T07:43:37.5954810Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5955176Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5955574Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5956067Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5956567Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5956939Z >       assert reference == approx(
2024-11-29T07:43:37.5957184Z             prediction, rel=1e-4
2024-11-29T07:43:37.5957832Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5958395Z E       AssertionError: Model gpt2 on lite samples, for eval helm|boolq|5, metric qem_stderr incorrect
2024-11-29T07:43:37.5958946Z E       assert 0.16329931618554522 == 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.5959248Z E         comparison failed
2024-11-29T07:43:37.5959470Z E         Obtained: 0.16329931618554522
2024-11-29T07:43:37.5959777Z E         Expected: 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.5959961Z 
2024-11-29T07:43:37.5960055Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5960406Z ___________ test_model_prediction[gpt2_lite_helm|boolq|5_pem_stderr] ___________
2024-11-29T07:43:37.5960693Z 
2024-11-29T07:43:37.5961374Z model_input = ('gpt2', 'lite', 'helm|boolq|5', 'pem_stderr', functools.partial(<functools._lru_cache_wrapper object at 0x7f6332c883b...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5962133Z 
2024-11-29T07:43:37.5962291Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5962659Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5963313Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5964188Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5964689Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5965067Z >       assert reference == approx(
2024-11-29T07:43:37.5965320Z             prediction, rel=1e-4
2024-11-29T07:43:37.5965701Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5966267Z E       AssertionError: Model gpt2 on lite samples, for eval helm|boolq|5, metric pem_stderr incorrect
2024-11-29T07:43:37.5966818Z E       assert 0.16329931618554522 == 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.5967128Z E         comparison failed
2024-11-29T07:43:37.5967355Z E         Obtained: 0.16329931618554522
2024-11-29T07:43:37.5967843Z E         Expected: 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.5968029Z 
2024-11-29T07:43:37.5968129Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5968477Z __________ test_model_prediction[gpt2_lite_helm|boolq|5_pqem_stderr] ___________
2024-11-29T07:43:37.5968758Z 
2024-11-29T07:43:37.5969434Z model_input = ('gpt2', 'lite', 'helm|boolq|5', 'pqem_stderr', functools.partial(<functools._lru_cache_wrapper object at 0x7f6332c883...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16329931618554522)
2024-11-29T07:43:37.5970369Z 
2024-11-29T07:43:37.5970539Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5970908Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5971320Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5971814Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5972313Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5972692Z >       assert reference == approx(
2024-11-29T07:43:37.5972940Z             prediction, rel=1e-4
2024-11-29T07:43:37.5973323Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5973887Z E       AssertionError: Model gpt2 on lite samples, for eval helm|boolq|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.5974436Z E       assert 0.16329931618554522 == 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.5974748Z E         comparison failed
2024-11-29T07:43:37.5974977Z E         Obtained: 0.16329931618554522
2024-11-29T07:43:37.5975287Z E         Expected: 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.5975468Z 
2024-11-29T07:43:37.5975567Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5975939Z ___ test_model_prediction[gpt2_lite_lighteval|agieval:aqua-rat|0_acc_stderr] ___
2024-11-29T07:43:37.5976228Z 
2024-11-29T07:43:37.5976938Z model_input = ('gpt2', 'lite', 'lighteval|agieval:aqua-rat|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object ...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275)
2024-11-29T07:43:37.5977707Z 
2024-11-29T07:43:37.5977870Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5978250Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5978655Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5979154Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5979649Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5980019Z >       assert reference == approx(
2024-11-29T07:43:37.5980383Z             prediction, rel=1e-4
2024-11-29T07:43:37.5981044Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5981772Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:aqua-rat|0, metric acc_stderr incorrect
2024-11-29T07:43:37.5982342Z E       assert 0.15275 == 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.5982620Z E         comparison failed
2024-11-29T07:43:37.5982841Z E         Obtained: 0.15275
2024-11-29T07:43:37.5983125Z E         Expected: 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.5983305Z 
2024-11-29T07:43:37.5983407Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5983791Z _ test_model_prediction[gpt2_lite_lighteval|agieval:aqua-rat|0_acc_norm_stderr] _
2024-11-29T07:43:37.5984080Z 
2024-11-29T07:43:37.5984781Z model_input = ('gpt2', 'lite', 'lighteval|agieval:aqua-rat|0', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper ob...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275)
2024-11-29T07:43:37.5985683Z 
2024-11-29T07:43:37.5985853Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5986220Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5986624Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5987118Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5987721Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5988105Z >       assert reference == approx(
2024-11-29T07:43:37.5988351Z             prediction, rel=1e-4
2024-11-29T07:43:37.5988735Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5989360Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:aqua-rat|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.5989936Z E       assert 0.15275 == 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.5990211Z E         comparison failed
2024-11-29T07:43:37.5990430Z E         Obtained: 0.15275
2024-11-29T07:43:37.5990709Z E         Expected: 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.5990892Z 
2024-11-29T07:43:37.5990986Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5991518Z __ test_model_prediction[gpt2_lite_lighteval|agieval:logiqa-en|0_acc_stderr] ___
2024-11-29T07:43:37.5991818Z 
2024-11-29T07:43:37.5992516Z model_input = ('gpt2', 'lite', 'lighteval|agieval:logiqa-en|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.5993285Z 
2024-11-29T07:43:37.5993443Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.5993809Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.5994216Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.5994701Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.5995188Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.5995563Z >       assert reference == approx(
2024-11-29T07:43:37.5995817Z             prediction, rel=1e-4
2024-11-29T07:43:37.5996203Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.5996818Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:logiqa-en|0, metric acc_stderr incorrect
2024-11-29T07:43:37.5997608Z E       assert 0.1 == 0.00309049205...1304 ± 3.1e-07
2024-11-29T07:43:37.5997879Z E         comparison failed
2024-11-29T07:43:37.5998090Z E         Obtained: 0.1
2024-11-29T07:43:37.5998367Z E         Expected: 0.0030904920548581304 ± 3.1e-07
2024-11-29T07:43:37.5998554Z 
2024-11-29T07:43:37.5998655Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.5999036Z _ test_model_prediction[gpt2_lite_lighteval|agieval:logiqa-en|0_acc_norm_stderr] _
2024-11-29T07:43:37.5999337Z 
2024-11-29T07:43:37.6000059Z model_input = ('gpt2', 'lite', 'lighteval|agieval:logiqa-en|0', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper o...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275)
2024-11-29T07:43:37.6001390Z 
2024-11-29T07:43:37.6001548Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6001914Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6002319Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6002811Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6003470Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6003857Z >       assert reference == approx(
2024-11-29T07:43:37.6004102Z             prediction, rel=1e-4
2024-11-29T07:43:37.6004478Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6005116Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:logiqa-en|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6005829Z E       assert 0.15275 == 0.00457742206...5185 ± 4.6e-07
2024-11-29T07:43:37.6006101Z E         comparison failed
2024-11-29T07:43:37.6006319Z E         Obtained: 0.15275
2024-11-29T07:43:37.6006597Z E         Expected: 0.0045774220684565185 ± 4.6e-07
2024-11-29T07:43:37.6006784Z 
2024-11-29T07:43:37.6006957Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6007517Z ___ test_model_prediction[gpt2_lite_lighteval|agieval:lsat-ar|0_acc_stderr] ____
2024-11-29T07:43:37.6007826Z 
2024-11-29T07:43:37.6008536Z model_input = ('gpt2', 'lite', 'lighteval|agieval:lsat-ar|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object a...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6009298Z 
2024-11-29T07:43:37.6009458Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6009822Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6010232Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6010718Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6011205Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6011579Z >       assert reference == approx(
2024-11-29T07:43:37.6011822Z             prediction, rel=1e-4
2024-11-29T07:43:37.6012203Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6012835Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-ar|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6013389Z E       assert 0.1 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6013659Z E         comparison failed
2024-11-29T07:43:37.6013876Z E         Obtained: 0.1
2024-11-29T07:43:37.6014140Z E         Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6014325Z 
2024-11-29T07:43:37.6014423Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6014801Z _ test_model_prediction[gpt2_lite_lighteval|agieval:lsat-ar|0_acc_norm_stderr] _
2024-11-29T07:43:37.6015092Z 
2024-11-29T07:43:37.6015797Z model_input = ('gpt2', 'lite', 'lighteval|agieval:lsat-ar|0', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper obj...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6016560Z 
2024-11-29T07:43:37.6016735Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6017115Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6017530Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6018043Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6018542Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6018922Z >       assert reference == approx(
2024-11-29T07:43:37.6019175Z             prediction, rel=1e-4
2024-11-29T07:43:37.6019565Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6020194Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-ar|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6020749Z E       assert 0.1 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6021169Z E         comparison failed
2024-11-29T07:43:37.6021399Z E         Obtained: 0.1
2024-11-29T07:43:37.6021674Z E         Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6021855Z 
2024-11-29T07:43:37.6021956Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6022328Z ___ test_model_prediction[gpt2_lite_lighteval|agieval:lsat-lr|0_acc_stderr] ____
2024-11-29T07:43:37.6022615Z 
2024-11-29T07:43:37.6023428Z model_input = ('gpt2', 'lite', 'lighteval|agieval:lsat-lr|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object a...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333)
2024-11-29T07:43:37.6024220Z 
2024-11-29T07:43:37.6024377Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6024745Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6025152Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6025652Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6026139Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6026514Z >       assert reference == approx(
2024-11-29T07:43:37.6026763Z             prediction, rel=1e-4
2024-11-29T07:43:37.6027141Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6027754Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-lr|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6028311Z E       assert 0.13333 == 0.004077368628777015 ± 4.1e-07
2024-11-29T07:43:37.6028585Z E         comparison failed
2024-11-29T07:43:37.6028805Z E         Obtained: 0.13333
2024-11-29T07:43:37.6029088Z E         Expected: 0.004077368628777015 ± 4.1e-07
2024-11-29T07:43:37.6029279Z 
2024-11-29T07:43:37.6029377Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6029761Z _ test_model_prediction[gpt2_lite_lighteval|agieval:lsat-lr|0_acc_norm_stderr] _
2024-11-29T07:43:37.6030051Z 
2024-11-29T07:43:37.6030738Z model_input = ('gpt2', 'lite', 'lighteval|agieval:lsat-lr|0', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper obj...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333)
2024-11-29T07:43:37.6031527Z 
2024-11-29T07:43:37.6031685Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6032055Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6032459Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6032947Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6033434Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6033812Z >       assert reference == approx(
2024-11-29T07:43:37.6034060Z             prediction, rel=1e-4
2024-11-29T07:43:37.6034438Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6035061Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-lr|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6035623Z E       assert 0.13333 == 0.004077368628777015 ± 4.1e-07
2024-11-29T07:43:37.6036009Z E         comparison failed
2024-11-29T07:43:37.6036359Z E         Obtained: 0.13333
2024-11-29T07:43:37.6036664Z E         Expected: 0.004077368628777015 ± 4.1e-07
2024-11-29T07:43:37.6036986Z 
2024-11-29T07:43:37.6037146Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6037957Z ___ test_model_prediction[gpt2_lite_lighteval|agieval:lsat-rc|0_acc_stderr] ____
2024-11-29T07:43:37.6038311Z 
2024-11-29T07:43:37.6039835Z model_input = ('gpt2', 'lite', 'lighteval|agieval:lsat-rc|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object a...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275)
2024-11-29T07:43:37.6041353Z 
2024-11-29T07:43:37.6041630Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6042266Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6042756Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6043426Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6043918Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6044294Z >       assert reference == approx(
2024-11-29T07:43:37.6044543Z             prediction, rel=1e-4
2024-11-29T07:43:37.6044925Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6045539Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-rc|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6046115Z E       assert 0.15275 == 0.004582352884486063 ± 4.6e-07
2024-11-29T07:43:37.6046391Z E         comparison failed
2024-11-29T07:43:37.6046613Z E         Obtained: 0.15275
2024-11-29T07:43:37.6047101Z E         Expected: 0.004582352884486063 ± 4.6e-07
2024-11-29T07:43:37.6047412Z 
2024-11-29T07:43:37.6047594Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6047972Z _ test_model_prediction[gpt2_lite_lighteval|agieval:lsat-rc|0_acc_norm_stderr] _
2024-11-29T07:43:37.6048269Z 
2024-11-29T07:43:37.6048983Z model_input = ('gpt2', 'lite', 'lighteval|agieval:lsat-rc|0', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper obj...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333)
2024-11-29T07:43:37.6049747Z 
2024-11-29T07:43:37.6049909Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6050282Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6050683Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6051170Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6051662Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6052041Z >       assert reference == approx(
2024-11-29T07:43:37.6052289Z             prediction, rel=1e-4
2024-11-29T07:43:37.6052665Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6053286Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-rc|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6053846Z E       assert 0.13333 == 0.004080117321518739 ± 4.1e-07
2024-11-29T07:43:37.6054122Z E         comparison failed
2024-11-29T07:43:37.6054345Z E         Obtained: 0.13333
2024-11-29T07:43:37.6054626Z E         Expected: 0.004080117321518739 ± 4.1e-07
2024-11-29T07:43:37.6054808Z 
2024-11-29T07:43:37.6054907Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6055614Z _ test_model_prediction[gpt2_lite_lighteval|agieval:sat-en-without-passage|0_acc_stderr] _
2024-11-29T07:43:37.6056149Z 
2024-11-29T07:43:37.6056861Z model_input = ('gpt2', 'lite', 'lighteval|agieval:sat-en-without-passage|0', 'acc_stderr', functools.partial(<functools._lru_cache_w...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333)
2024-11-29T07:43:37.6057653Z 
2024-11-29T07:43:37.6057824Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6058190Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6058596Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6059245Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6059753Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6060127Z >       assert reference == approx(
2024-11-29T07:43:37.6060380Z             prediction, rel=1e-4
2024-11-29T07:43:37.6060763Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6061539Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-en-without-passage|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6062164Z E       assert 0.13333 == 0.00388625951...6192 ± 3.9e-07
2024-11-29T07:43:37.6062445Z E         comparison failed
2024-11-29T07:43:37.6062667Z E         Obtained: 0.13333
2024-11-29T07:43:37.6062946Z E         Expected: 0.0038862595143676192 ± 3.9e-07
2024-11-29T07:43:37.6063127Z 
2024-11-29T07:43:37.6063227Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6063669Z _ test_model_prediction[gpt2_lite_lighteval|agieval:sat-en-without-passage|0_acc_norm_stderr] _
2024-11-29T07:43:37.6064012Z 
2024-11-29T07:43:37.6064713Z model_input = ('gpt2', 'lite', 'lighteval|agieval:sat-en-without-passage|0', 'acc_norm_stderr', functools.partial(<functools._lru_ca...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275)
2024-11-29T07:43:37.6065503Z 
2024-11-29T07:43:37.6065659Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6066028Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6066447Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6066945Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6067435Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6067815Z >       assert reference == approx(
2024-11-29T07:43:37.6068064Z             prediction, rel=1e-4
2024-11-29T07:43:37.6068448Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6069118Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-en-without-passage|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6069728Z E       assert 0.15275 == 0.004538042951960014 ± 4.5e-07
2024-11-29T07:43:37.6070009Z E         comparison failed
2024-11-29T07:43:37.6070228Z E         Obtained: 0.15275
2024-11-29T07:43:37.6070504Z E         Expected: 0.004538042951960014 ± 4.5e-07
2024-11-29T07:43:37.6070689Z 
2024-11-29T07:43:37.6070784Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6071154Z ____ test_model_prediction[gpt2_lite_lighteval|agieval:sat-en|0_acc_stderr] ____
2024-11-29T07:43:37.6071441Z 
2024-11-29T07:43:37.6072141Z model_input = ('gpt2', 'lite', 'lighteval|agieval:sat-en|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object at...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333)
2024-11-29T07:43:37.6072913Z 
2024-11-29T07:43:37.6073073Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6073440Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6073845Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6074341Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6074835Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6075212Z >       assert reference == approx(
2024-11-29T07:43:37.6075462Z             prediction, rel=1e-4
2024-11-29T07:43:37.6075840Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6076568Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-en|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6077117Z E       assert 0.13333 == 0.00388625951...6192 ± 3.9e-07
2024-11-29T07:43:37.6077672Z E         comparison failed
2024-11-29T07:43:37.6077892Z E         Obtained: 0.13333
2024-11-29T07:43:37.6078188Z E         Expected: 0.0038862595143676192 ± 3.9e-07
2024-11-29T07:43:37.6078376Z 
2024-11-29T07:43:37.6078638Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6079015Z _ test_model_prediction[gpt2_lite_lighteval|agieval:sat-en|0_acc_norm_stderr] __
2024-11-29T07:43:37.6079312Z 
2024-11-29T07:43:37.6080014Z model_input = ('gpt2', 'lite', 'lighteval|agieval:sat-en|0', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper obje...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275)
2024-11-29T07:43:37.6080787Z 
2024-11-29T07:43:37.6080953Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6081325Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6081732Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6082302Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6082800Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6083183Z >       assert reference == approx(
2024-11-29T07:43:37.6083436Z             prediction, rel=1e-4
2024-11-29T07:43:37.6083819Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6084451Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-en|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6085030Z E       assert 0.15275 == 0.004538042951960014 ± 4.5e-07
2024-11-29T07:43:37.6085351Z E         comparison failed
2024-11-29T07:43:37.6085761Z E         Obtained: 0.15275
2024-11-29T07:43:37.6086114Z E         Expected: 0.004538042951960014 ± 4.5e-07
2024-11-29T07:43:37.6086318Z 
2024-11-29T07:43:37.6086419Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6086787Z ___ test_model_prediction[gpt2_lite_lighteval|agieval:sat-math|0_acc_stderr] ___
2024-11-29T07:43:37.6087080Z 
2024-11-29T07:43:37.6087779Z model_input = ('gpt2', 'lite', 'lighteval|agieval:sat-math|0', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object ...racking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275)
2024-11-29T07:43:37.6088561Z 
2024-11-29T07:43:37.6088725Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6089096Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6089506Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6090007Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6090497Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6090878Z >       assert reference == approx(
2024-11-29T07:43:37.6091129Z             prediction, rel=1e-4
2024-11-29T07:43:37.6091637Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6092259Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-math|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6092817Z E       assert 0.15275 == 0.004664521171326971 ± 4.7e-07
2024-11-29T07:43:37.6093091Z E         comparison failed
2024-11-29T07:43:37.6093712Z E         Obtained: 0.15275
2024-11-29T07:43:37.6094005Z E         Expected: 0.004664521171326971 ± 4.7e-07
2024-11-29T07:43:37.6094190Z 
2024-11-29T07:43:37.6094292Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6094849Z _ test_model_prediction[gpt2_lite_lighteval|agieval:sat-math|0_acc_norm_stderr] _
2024-11-29T07:43:37.6095151Z 
2024-11-29T07:43:37.6095842Z model_input = ('gpt2', 'lite', 'lighteval|agieval:sat-math|0', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper ob...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6096799Z 
2024-11-29T07:43:37.6097112Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6097485Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6097899Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6098403Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6098890Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6099269Z >       assert reference == approx(
2024-11-29T07:43:37.6099522Z             prediction, rel=1e-4
2024-11-29T07:43:37.6099908Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6100552Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-math|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6101128Z E       assert 0.1 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6101406Z E         comparison failed
2024-11-29T07:43:37.6101624Z E         Obtained: 0.1
2024-11-29T07:43:37.6101896Z E         Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6102083Z 
2024-11-29T07:43:37.6102183Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6102571Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:causal_judgment|3_acc_stderr] _
2024-11-29T07:43:37.6103084Z 
2024-11-29T07:43:37.6103794Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:causal_judgment|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16666666666666666)
2024-11-29T07:43:37.6104594Z 
2024-11-29T07:43:37.6104752Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6105116Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6105523Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6106021Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6106504Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6106879Z >       assert reference == approx(
2024-11-29T07:43:37.6107127Z             prediction, rel=1e-4
2024-11-29T07:43:37.6107505Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6108147Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:causal_judgment|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6108769Z E       assert 0.16666666666666666 == 0.004861068811484776 ± 4.9e-07
2024-11-29T07:43:37.6109079Z E         comparison failed
2024-11-29T07:43:37.6109306Z E         Obtained: 0.16666666666666666
2024-11-29T07:43:37.6109616Z E         Expected: 0.004861068811484776 ± 4.9e-07
2024-11-29T07:43:37.6109803Z 
2024-11-29T07:43:37.6109903Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6110282Z _ test_model_prediction[gpt2_lite_harness|bigbench:causal_judgment|3_acc_stderr] _
2024-11-29T07:43:37.6110604Z 
2024-11-29T07:43:37.6111390Z model_input = ('gpt2', 'lite', 'harness|bigbench:causal_judgment|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper o...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6112167Z 
2024-11-29T07:43:37.6112326Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6112855Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6113265Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6113759Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6114248Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6114740Z >       assert reference == approx(
2024-11-29T07:43:37.6114989Z             prediction, rel=1e-4
2024-11-29T07:43:37.6115369Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6116001Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:causal_judgment|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6116577Z E       assert 0.1633 == 0.004699965923246645 ± 4.7e-07
2024-11-29T07:43:37.6116853Z E         comparison failed
2024-11-29T07:43:37.6117079Z E         Obtained: 0.1633
2024-11-29T07:43:37.6117602Z E         Expected: 0.004699965923246645 ± 4.7e-07
2024-11-29T07:43:37.6117793Z 
2024-11-29T07:43:37.6117887Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6118276Z _ test_model_prediction[gpt2_lite_harness|bigbench:causal_judgment|3_acc_norm_stderr] _
2024-11-29T07:43:37.6118595Z 
2024-11-29T07:43:37.6119279Z model_input = ('gpt2', 'lite', 'harness|bigbench:causal_judgment|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrap...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16666666666666666)
2024-11-29T07:43:37.6120057Z 
2024-11-29T07:43:37.6120212Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6120579Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6120987Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6121485Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6121973Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6122347Z >       assert reference == approx(
2024-11-29T07:43:37.6122594Z             prediction, rel=1e-4
2024-11-29T07:43:37.6122973Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6123630Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:causal_judgment|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6124246Z E       assert 0.16666666666666666 == 0.004861068811484776 ± 4.9e-07
2024-11-29T07:43:37.6124553Z E         comparison failed
2024-11-29T07:43:37.6124780Z E         Obtained: 0.16666666666666666
2024-11-29T07:43:37.6125086Z E         Expected: 0.004861068811484776 ± 4.9e-07
2024-11-29T07:43:37.6125265Z 
2024-11-29T07:43:37.6125362Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6125756Z _ test_model_prediction[gpt2_lite_harness|bigbench:date_understanding|3_acc_stderr] _
2024-11-29T07:43:37.6126064Z 
2024-11-29T07:43:37.6126756Z model_input = ('gpt2', 'lite', 'harness|bigbench:date_understanding|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrappe...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6127527Z 
2024-11-29T07:43:37.6127697Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6128062Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6128461Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6128955Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6129443Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6129823Z >       assert reference == approx(
2024-11-29T07:43:37.6130228Z             prediction, rel=1e-4
2024-11-29T07:43:37.6130609Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6131249Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:date_understanding|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6131858Z E       assert 0.13333333333333333 == 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6132302Z E         comparison failed
2024-11-29T07:43:37.6132530Z E         Obtained: 0.13333333333333333
2024-11-29T07:43:37.6132839Z E         Expected: 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6133019Z 
2024-11-29T07:43:37.6133118Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6133519Z _ test_model_prediction[gpt2_lite_harness|bigbench:date_understanding|3_acc_norm_stderr] _
2024-11-29T07:43:37.6133837Z 
2024-11-29T07:43:37.6134530Z model_input = ('gpt2', 'lite', 'harness|bigbench:date_understanding|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache_w...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6135321Z 
2024-11-29T07:43:37.6135482Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6135846Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6136249Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6136744Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6137232Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6137606Z >       assert reference == approx(
2024-11-29T07:43:37.6137849Z             prediction, rel=1e-4
2024-11-29T07:43:37.6138232Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6138891Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:date_understanding|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6139524Z E       assert 0.13333333333333333 == 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6139828Z E         comparison failed
2024-11-29T07:43:37.6140052Z E         Obtained: 0.13333333333333333
2024-11-29T07:43:37.6140356Z E         Expected: 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6140546Z 
2024-11-29T07:43:37.6140640Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6141028Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:disambiguation_qa|3_acc_stderr] _
2024-11-29T07:43:37.6141338Z 
2024-11-29T07:43:37.6142028Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:disambiguation_qa|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapp...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.6142801Z 
2024-11-29T07:43:37.6142961Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6143326Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6143731Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6144220Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6144705Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6145083Z >       assert reference == approx(
2024-11-29T07:43:37.6145328Z             prediction, rel=1e-4
2024-11-29T07:43:37.6145705Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6146351Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:disambiguation_qa|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6146974Z E       assert 0.15275252316519466 == 0.004650507199996266 ± 4.7e-07
2024-11-29T07:43:37.6147396Z E         comparison failed
2024-11-29T07:43:37.6147627Z E         Obtained: 0.15275252316519466
2024-11-29T07:43:37.6147935Z E         Expected: 0.004650507199996266 ± 4.7e-07
2024-11-29T07:43:37.6148120Z 
2024-11-29T07:43:37.6148215Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6148599Z _ test_model_prediction[gpt2_lite_harness|bigbench:disambiguation_qa|3_acc_stderr] _
2024-11-29T07:43:37.6149024Z 
2024-11-29T07:43:37.6149713Z model_input = ('gpt2', 'lite', 'harness|bigbench:disambiguation_qa|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.6150485Z 
2024-11-29T07:43:37.6150642Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6151004Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6151416Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6151909Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6152391Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6152766Z >       assert reference == approx(
2024-11-29T07:43:37.6153012Z             prediction, rel=1e-4
2024-11-29T07:43:37.6153397Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6154039Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:disambiguation_qa|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6154651Z E       assert 0.15275252316519466 == 0.004650507199996266 ± 4.7e-07
2024-11-29T07:43:37.6154958Z E         comparison failed
2024-11-29T07:43:37.6155180Z E         Obtained: 0.15275252316519466
2024-11-29T07:43:37.6155480Z E         Expected: 0.004650507199996266 ± 4.7e-07
2024-11-29T07:43:37.6155671Z 
2024-11-29T07:43:37.6155771Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6156170Z _ test_model_prediction[gpt2_lite_harness|bigbench:disambiguation_qa|3_acc_norm_stderr] _
2024-11-29T07:43:37.6156492Z 
2024-11-29T07:43:37.6157172Z model_input = ('gpt2', 'lite', 'harness|bigbench:disambiguation_qa|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wr...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519464)
2024-11-29T07:43:37.6158198Z 
2024-11-29T07:43:37.6158358Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6158724Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6159130Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6159621Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6160110Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6160489Z >       assert reference == approx(
2024-11-29T07:43:37.6160804Z             prediction, rel=1e-4
2024-11-29T07:43:37.6161185Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6161857Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:disambiguation_qa|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6162500Z E       assert 0.15275252316519464 == 0.004582439170218064 ± 4.6e-07
2024-11-29T07:43:37.6162807Z E         comparison failed
2024-11-29T07:43:37.6163031Z E         Obtained: 0.15275252316519464
2024-11-29T07:43:37.6163334Z E         Expected: 0.004582439170218064 ± 4.6e-07
2024-11-29T07:43:37.6163514Z 
2024-11-29T07:43:37.6163614Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6163999Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:geometric_shapes|3_acc_stderr] _
2024-11-29T07:43:37.6164308Z 
2024-11-29T07:43:37.6165169Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:geometric_shapes|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrappe...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6165949Z 
2024-11-29T07:43:37.6166113Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6166612Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6167021Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6167511Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6167997Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6168365Z >       assert reference == approx(
2024-11-29T07:43:37.6168610Z             prediction, rel=1e-4
2024-11-29T07:43:37.6168995Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6169641Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:geometric_shapes|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6170268Z E       assert 0.13333333333333333 == 0.004000255247111385 ± 4.0e-07
2024-11-29T07:43:37.6170575Z E         comparison failed
2024-11-29T07:43:37.6170799Z E         Obtained: 0.13333333333333333
2024-11-29T07:43:37.6171110Z E         Expected: 0.004000255247111385 ± 4.0e-07
2024-11-29T07:43:37.6171292Z 
2024-11-29T07:43:37.6171393Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6171788Z _ test_model_prediction[gpt2_lite_harness|bigbench:geometric_shapes|3_acc_norm_stderr] _
2024-11-29T07:43:37.6172109Z 
2024-11-29T07:43:37.6172820Z model_input = ('gpt2', 'lite', 'harness|bigbench:geometric_shapes|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wra...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6173581Z 
2024-11-29T07:43:37.6173744Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6174110Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6174517Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6175001Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6175492Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6175863Z >       assert reference == approx(
2024-11-29T07:43:37.6176109Z             prediction, rel=1e-4
2024-11-29T07:43:37.6176488Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6177135Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:geometric_shapes|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6177761Z E       assert 0.13333333333333333 == 0.004041744140305727 ± 4.0e-07
2024-11-29T07:43:37.6178069Z E         comparison failed
2024-11-29T07:43:37.6178301Z E         Obtained: 0.13333333333333333
2024-11-29T07:43:37.6178606Z E         Expected: 0.004041744140305727 ± 4.0e-07
2024-11-29T07:43:37.6178790Z 
2024-11-29T07:43:37.6178885Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6179325Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:logical_deduction_five_objects|3_acc_stderr] _
2024-11-29T07:43:37.6179687Z 
2024-11-29T07:43:37.6180377Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:logical_deduction_five_objects|3', 'acc_stderr', functools.partial(<functools._lr...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6181150Z 
2024-11-29T07:43:37.6181310Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6181818Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6182228Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6182717Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6183203Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6183717Z >       assert reference == approx(
2024-11-29T07:43:37.6183971Z             prediction, rel=1e-4
2024-11-29T07:43:37.6184347Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6185043Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:logical_deduction_five_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6185660Z E       assert 0.1 == 0.003030251214408201 ± 3.0e-07
2024-11-29T07:43:37.6185930Z E         comparison failed
2024-11-29T07:43:37.6186151Z E         Obtained: 0.1
2024-11-29T07:43:37.6186424Z E         Expected: 0.003030251214408201 ± 3.0e-07
2024-11-29T07:43:37.6186608Z 
2024-11-29T07:43:37.6186702Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6187125Z _ test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_five_objects|3_acc_stderr] _
2024-11-29T07:43:37.6187480Z 
2024-11-29T07:43:37.6188160Z model_input = ('gpt2', 'lite', 'harness|bigbench:logical_deduction_five_objects|3', 'acc_stderr', functools.partial(<functools._lru_...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519464)
2024-11-29T07:43:37.6188934Z 
2024-11-29T07:43:37.6189090Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6189454Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6189857Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6190351Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6190836Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6191208Z >       assert reference == approx(
2024-11-29T07:43:37.6191456Z             prediction, rel=1e-4
2024-11-29T07:43:37.6191871Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6192549Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_five_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6193200Z E       assert 0.15275252316519464 == 0.004381117916034022 ± 4.4e-07
2024-11-29T07:43:37.6193507Z E         comparison failed
2024-11-29T07:43:37.6193735Z E         Obtained: 0.15275252316519464
2024-11-29T07:43:37.6194044Z E         Expected: 0.004381117916034022 ± 4.4e-07
2024-11-29T07:43:37.6194226Z 
2024-11-29T07:43:37.6194324Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6194767Z _ test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_five_objects|3_acc_norm_stderr] _
2024-11-29T07:43:37.6195122Z 
2024-11-29T07:43:37.6195831Z model_input = ('gpt2', 'lite', 'harness|bigbench:logical_deduction_five_objects|3', 'acc_norm_stderr', functools.partial(<functools....fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519464)
2024-11-29T07:43:37.6196601Z 
2024-11-29T07:43:37.6196763Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6197125Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6197921Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6198415Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6198897Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6199439Z >       assert reference == approx(
2024-11-29T07:43:37.6199695Z             prediction, rel=1e-4
2024-11-29T07:43:37.6200072Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6200769Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_five_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6201455Z E       assert 0.15275252316519464 == 0.004480319549235682 ± 4.5e-07
2024-11-29T07:43:37.6201929Z E         comparison failed
2024-11-29T07:43:37.6202157Z E         Obtained: 0.15275252316519464
2024-11-29T07:43:37.6202468Z E         Expected: 0.004480319549235682 ± 4.5e-07
2024-11-29T07:43:37.6202650Z 
2024-11-29T07:43:37.6202750Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6203184Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:logical_deduction_seven_objects|3_acc_stderr] _
2024-11-29T07:43:37.6203543Z 
2024-11-29T07:43:37.6204246Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:logical_deduction_seven_objects|3', 'acc_stderr', functools.partial(<functools._l...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6205014Z 
2024-11-29T07:43:37.6205176Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6205541Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6205955Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6206441Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6206929Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6207299Z >       assert reference == approx(
2024-11-29T07:43:37.6207546Z             prediction, rel=1e-4
2024-11-29T07:43:37.6207922Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6208624Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:logical_deduction_seven_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6209284Z E       assert 0.13333333333333333 == 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6209590Z E         comparison failed
2024-11-29T07:43:37.6209814Z E         Obtained: 0.13333333333333333
2024-11-29T07:43:37.6210130Z E         Expected: 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6210313Z 
2024-11-29T07:43:37.6210408Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6210838Z _ test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_seven_objects|3_acc_stderr] _
2024-11-29T07:43:37.6211194Z 
2024-11-29T07:43:37.6211880Z model_input = ('gpt2', 'lite', 'harness|bigbench:logical_deduction_seven_objects|3', 'acc_stderr', functools.partial(<functools._lru...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6212657Z 
2024-11-29T07:43:37.6212816Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6213180Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6213585Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6214073Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6214569Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6214948Z >       assert reference == approx(
2024-11-29T07:43:37.6215194Z             prediction, rel=1e-4
2024-11-29T07:43:37.6215570Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6216246Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_seven_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6217029Z E       assert 0.13333333333333333 == 0.004006756056224812 ± 4.0e-07
2024-11-29T07:43:37.6217340Z E         comparison failed
2024-11-29T07:43:37.6217561Z E         Obtained: 0.13333333333333333
2024-11-29T07:43:37.6217868Z E         Expected: 0.004006756056224812 ± 4.0e-07
2024-11-29T07:43:37.6218054Z 
2024-11-29T07:43:37.6218146Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6218588Z _ test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_seven_objects|3_acc_norm_stderr] _
2024-11-29T07:43:37.6219068Z 
2024-11-29T07:43:37.6219757Z model_input = ('gpt2', 'lite', 'harness|bigbench:logical_deduction_seven_objects|3', 'acc_norm_stderr', functools.partial(<functools...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6220527Z 
2024-11-29T07:43:37.6220683Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6221058Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6221463Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6221955Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6222442Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6222814Z >       assert reference == approx(
2024-11-29T07:43:37.6223064Z             prediction, rel=1e-4
2024-11-29T07:43:37.6223443Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6224140Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_seven_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6224811Z E       assert 0.13333333333333333 == 0.004006756056224812 ± 4.0e-07
2024-11-29T07:43:37.6225114Z E         comparison failed
2024-11-29T07:43:37.6225338Z E         Obtained: 0.13333333333333333
2024-11-29T07:43:37.6225650Z E         Expected: 0.004006756056224812 ± 4.0e-07
2024-11-29T07:43:37.6225834Z 
2024-11-29T07:43:37.6225932Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6226366Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:logical_deduction_three_objects|3_acc_stderr] _
2024-11-29T07:43:37.6226720Z 
2024-11-29T07:43:37.6227418Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:logical_deduction_three_objects|3', 'acc_stderr', functools.partial(<functools._l...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6228200Z 
2024-11-29T07:43:37.6228360Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6228724Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6229125Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6229617Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6230108Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6230484Z >       assert reference == approx(
2024-11-29T07:43:37.6230731Z             prediction, rel=1e-4
2024-11-29T07:43:37.6231109Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6231803Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:logical_deduction_three_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6232425Z E       assert 0.1633 == 0.00504952053...2955 ± 5.0e-07
2024-11-29T07:43:37.6232699Z E         comparison failed
2024-11-29T07:43:37.6232920Z E         Obtained: 0.1633
2024-11-29T07:43:37.6233202Z E         Expected: 0.0050495205374032955 ± 5.0e-07
2024-11-29T07:43:37.6233382Z 
2024-11-29T07:43:37.6233481Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6234031Z _ test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_three_objects|3_acc_stderr] _
2024-11-29T07:43:37.6234387Z 
2024-11-29T07:43:37.6235076Z model_input = ('gpt2', 'lite', 'harness|bigbench:logical_deduction_three_objects|3', 'acc_stderr', functools.partial(<functools._lru...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6235970Z 
2024-11-29T07:43:37.6236131Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6236492Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6236896Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6237734Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6238236Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6238625Z >       assert reference == approx(
2024-11-29T07:43:37.6238879Z             prediction, rel=1e-4
2024-11-29T07:43:37.6239258Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6239939Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_three_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6240612Z E       assert 0.13333333333333333 == 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6240925Z E         comparison failed
2024-11-29T07:43:37.6241150Z E         Obtained: 0.13333333333333333
2024-11-29T07:43:37.6241456Z E         Expected: 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6241642Z 
2024-11-29T07:43:37.6241736Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6242190Z _ test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_three_objects|3_acc_norm_stderr] _
2024-11-29T07:43:37.6242559Z 
2024-11-29T07:43:37.6243287Z model_input = ('gpt2', 'lite', 'harness|bigbench:logical_deduction_three_objects|3', 'acc_norm_stderr', functools.partial(<functools...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.6244062Z 
2024-11-29T07:43:37.6244220Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6244584Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6244997Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6245492Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6245988Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6246363Z >       assert reference == approx(
2024-11-29T07:43:37.6246607Z             prediction, rel=1e-4
2024-11-29T07:43:37.6246982Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6247685Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_three_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6248367Z E       assert 0.15275252316519466 == 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6248672Z E         comparison failed
2024-11-29T07:43:37.6248892Z E         Obtained: 0.15275252316519466
2024-11-29T07:43:37.6249199Z E         Expected: 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6249391Z 
2024-11-29T07:43:37.6249485Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6249885Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:movie_recommendation|3_acc_stderr] _
2024-11-29T07:43:37.6250209Z 
2024-11-29T07:43:37.6250918Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:movie_recommendation|3', 'acc_stderr', functools.partial(<functools._lru_cache_wr...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.6251861Z 
2024-11-29T07:43:37.6252026Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6252391Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6252796Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6253288Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6253912Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6254284Z >       assert reference == approx(
2024-11-29T07:43:37.6254528Z             prediction, rel=1e-4
2024-11-29T07:43:37.6254908Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6255567Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:movie_recommendation|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6256205Z E       assert 0.15275252316519466 == 0.004428245629971239 ± 4.4e-07
2024-11-29T07:43:37.6256509Z E         comparison failed
2024-11-29T07:43:37.6256735Z E         Obtained: 0.15275252316519466
2024-11-29T07:43:37.6257047Z E         Expected: 0.004428245629971239 ± 4.4e-07
2024-11-29T07:43:37.6257227Z 
2024-11-29T07:43:37.6257326Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6257714Z _ test_model_prediction[gpt2_lite_harness|bigbench:movie_recommendation|3_acc_stderr] _
2024-11-29T07:43:37.6258039Z 
2024-11-29T07:43:37.6258734Z model_input = ('gpt2', 'lite', 'harness|bigbench:movie_recommendation|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrap...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.16666666666666666)
2024-11-29T07:43:37.6259529Z 
2024-11-29T07:43:37.6259689Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6260049Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6260459Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6260948Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6261432Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6261809Z >       assert reference == approx(
2024-11-29T07:43:37.6262053Z             prediction, rel=1e-4
2024-11-29T07:43:37.6262434Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6263080Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:movie_recommendation|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6263697Z E       assert 0.16666666666666666 == 0.004907190695349086 ± 4.9e-07
2024-11-29T07:43:37.6264006Z E         comparison failed
2024-11-29T07:43:37.6264231Z E         Obtained: 0.16666666666666666
2024-11-29T07:43:37.6264542Z E         Expected: 0.004907190695349086 ± 4.9e-07
2024-11-29T07:43:37.6264722Z 
2024-11-29T07:43:37.6264821Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6265233Z _ test_model_prediction[gpt2_lite_harness|bigbench:movie_recommendation|3_acc_norm_stderr] _
2024-11-29T07:43:37.6265565Z 
2024-11-29T07:43:37.6266261Z model_input = ('gpt2', 'lite', 'harness|bigbench:movie_recommendation|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519464)
2024-11-29T07:43:37.6267056Z 
2024-11-29T07:43:37.6267218Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6267590Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6267993Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6268481Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6269089Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6269469Z >       assert reference == approx(
2024-11-29T07:43:37.6269717Z             prediction, rel=1e-4
2024-11-29T07:43:37.6270092Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6270760Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:movie_recommendation|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6271508Z E       assert 0.15275252316519464 == 0.004703372376466875 ± 4.7e-07
2024-11-29T07:43:37.6271815Z E         comparison failed
2024-11-29T07:43:37.6272037Z E         Obtained: 0.15275252316519464
2024-11-29T07:43:37.6272342Z E         Expected: 0.004703372376466875 ± 4.7e-07
2024-11-29T07:43:37.6272530Z 
2024-11-29T07:43:37.6272623Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6272993Z __ test_model_prediction[gpt2_lite_lighteval|bigbench:navigate|3_acc_stderr] ___
2024-11-29T07:43:37.6273293Z 
2024-11-29T07:43:37.6274000Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:navigate|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6274805Z 
2024-11-29T07:43:37.6274961Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6275336Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6275740Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6276233Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6276717Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6277094Z >       assert reference == approx(
2024-11-29T07:43:37.6277570Z             prediction, rel=1e-4
2024-11-29T07:43:37.6277961Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6278586Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:navigate|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6279140Z E       assert 0.1633 == 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6279414Z E         comparison failed
2024-11-29T07:43:37.6279644Z E         Obtained: 0.1633
2024-11-29T07:43:37.6279923Z E         Expected: 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6280108Z 
2024-11-29T07:43:37.6280201Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6280572Z ___ test_model_prediction[gpt2_lite_harness|bigbench:navigate|3_acc_stderr] ____
2024-11-29T07:43:37.6280872Z 
2024-11-29T07:43:37.6281571Z model_input = ('gpt2', 'lite', 'harness|bigbench:navigate|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object a...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6282354Z 
2024-11-29T07:43:37.6282509Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6282875Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6283280Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6283765Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6284256Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6284633Z >       assert reference == approx(
2024-11-29T07:43:37.6284719Z             prediction, rel=1e-4
2024-11-29T07:43:37.6284950Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6285245Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:navigate|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6285569Z E       assert 0.1633 == 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6285658Z E         comparison failed
2024-11-29T07:43:37.6285744Z E         Obtained: 0.1633
2024-11-29T07:43:37.6285889Z E         Expected: 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6285895Z 
2024-11-29T07:43:37.6285994Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6286199Z _ test_model_prediction[gpt2_lite_harness|bigbench:navigate|3_acc_norm_stderr] _
2024-11-29T07:43:37.6286337Z 
2024-11-29T07:43:37.6287035Z model_input = ('gpt2', 'lite', 'harness|bigbench:navigate|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper obj...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6287040Z 
2024-11-29T07:43:37.6287198Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6287328Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6287535Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6287745Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6287940Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6288033Z >       assert reference == approx(
2024-11-29T07:43:37.6288123Z             prediction, rel=1e-4
2024-11-29T07:43:37.6288350Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6288660Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:navigate|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6288826Z E       assert 0.1633 == 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6288910Z E         comparison failed
2024-11-29T07:43:37.6288996Z E         Obtained: 0.1633
2024-11-29T07:43:37.6289137Z E         Expected: 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6289148Z 
2024-11-29T07:43:37.6289248Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6289514Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:reasoning_about_colored_objects|3_acc_stderr] _
2024-11-29T07:43:37.6289518Z 
2024-11-29T07:43:37.6290212Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:reasoning_about_colored_objects|3', 'acc_stderr', functools.partial(<functools._l...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6290224Z 
2024-11-29T07:43:37.6290379Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6290506Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6290704Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6290910Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6291106Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6291200Z >       assert reference == approx(
2024-11-29T07:43:37.6291285Z             prediction, rel=1e-4
2024-11-29T07:43:37.6291517Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6291926Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:reasoning_about_colored_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6292138Z E       assert 0.13333333333333333 == 0.004000255247111385 ± 4.0e-07
2024-11-29T07:43:37.6292223Z E         comparison failed
2024-11-29T07:43:37.6292313Z E         Obtained: 0.13333333333333333
2024-11-29T07:43:37.6292455Z E         Expected: 0.004000255247111385 ± 4.0e-07
2024-11-29T07:43:37.6292460Z 
2024-11-29T07:43:37.6292562Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6292945Z _ test_model_prediction[gpt2_lite_harness|bigbench:reasoning_about_colored_objects|3_acc_stderr] _
2024-11-29T07:43:37.6292951Z 
2024-11-29T07:43:37.6293658Z model_input = ('gpt2', 'lite', 'harness|bigbench:reasoning_about_colored_objects|3', 'acc_stderr', functools.partial(<functools._lru...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6293663Z 
2024-11-29T07:43:37.6293824Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6294079Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6294279Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6294484Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6294681Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6294775Z >       assert reference == approx(
2024-11-29T07:43:37.6294861Z             prediction, rel=1e-4
2024-11-29T07:43:37.6295095Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6295465Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:reasoning_about_colored_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6295673Z E       assert 0.13333333333333333 == 0.00405961457...4385 ± 4.1e-07
2024-11-29T07:43:37.6295764Z E         comparison failed
2024-11-29T07:43:37.6295855Z E         Obtained: 0.13333333333333333
2024-11-29T07:43:37.6295998Z E         Expected: 0.0040596145716644385 ± 4.1e-07
2024-11-29T07:43:37.6296004Z 
2024-11-29T07:43:37.6296102Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6296373Z _ test_model_prediction[gpt2_lite_harness|bigbench:reasoning_about_colored_objects|3_acc_norm_stderr] _
2024-11-29T07:43:37.6296378Z 
2024-11-29T07:43:37.6297082Z model_input = ('gpt2', 'lite', 'harness|bigbench:reasoning_about_colored_objects|3', 'acc_norm_stderr', functools.partial(<functools...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6297087Z 
2024-11-29T07:43:37.6297250Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6297372Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6297573Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6297778Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6297976Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6298063Z >       assert reference == approx(
2024-11-29T07:43:37.6298148Z             prediction, rel=1e-4
2024-11-29T07:43:37.6298374Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6298763Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:reasoning_about_colored_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6298919Z E       assert 0.1 == 0.002971327782118411 ± 3.0e-07
2024-11-29T07:43:37.6299003Z E         comparison failed
2024-11-29T07:43:37.6299090Z E         Obtained: 0.1
2024-11-29T07:43:37.6299231Z E         Expected: 0.002971327782118411 ± 3.0e-07
2024-11-29T07:43:37.6299236Z 
2024-11-29T07:43:37.6299341Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6299543Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:ruin_names|3_acc_stderr] __
2024-11-29T07:43:37.6299549Z 
2024-11-29T07:43:37.6300231Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:ruin_names|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper obje...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519464)
2024-11-29T07:43:37.6300236Z 
2024-11-29T07:43:37.6300517Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6300643Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6300846Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6301047Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6301248Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6301451Z >       assert reference == approx(
2024-11-29T07:43:37.6301536Z             prediction, rel=1e-4
2024-11-29T07:43:37.6301763Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6302075Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:ruin_names|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6302282Z E       assert 0.15275252316519464 == 0.00459225857...0545 ± 4.6e-07
2024-11-29T07:43:37.6302366Z E         comparison failed
2024-11-29T07:43:37.6302463Z E         Obtained: 0.15275252316519464
2024-11-29T07:43:37.6302607Z E         Expected: 0.0045922585770880545 ± 4.6e-07
2024-11-29T07:43:37.6302612Z 
2024-11-29T07:43:37.6302710Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6302913Z __ test_model_prediction[gpt2_lite_harness|bigbench:ruin_names|3_acc_stderr] ___
2024-11-29T07:43:37.6302918Z 
2024-11-29T07:43:37.6303604Z model_input = ('gpt2', 'lite', 'harness|bigbench:ruin_names|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6303617Z 
2024-11-29T07:43:37.6303775Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6303895Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6304095Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6304299Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6304499Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6304586Z >       assert reference == approx(
2024-11-29T07:43:37.6304682Z             prediction, rel=1e-4
2024-11-29T07:43:37.6304903Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6305214Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:ruin_names|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6305414Z E       assert 0.13333333333333333 == 0.004037827888116828 ± 4.0e-07
2024-11-29T07:43:37.6305498Z E         comparison failed
2024-11-29T07:43:37.6305589Z E         Obtained: 0.13333333333333333
2024-11-29T07:43:37.6305730Z E         Expected: 0.004037827888116828 ± 4.0e-07
2024-11-29T07:43:37.6305741Z 
2024-11-29T07:43:37.6305834Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6306045Z _ test_model_prediction[gpt2_lite_harness|bigbench:ruin_names|3_acc_norm_stderr] _
2024-11-29T07:43:37.6306055Z 
2024-11-29T07:43:37.6306723Z model_input = ('gpt2', 'lite', 'harness|bigbench:ruin_names|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper o...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6306737Z 
2024-11-29T07:43:37.6306897Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6307019Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6307221Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6307421Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6307622Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6307831Z >       assert reference == approx(
2024-11-29T07:43:37.6307927Z             prediction, rel=1e-4
2024-11-29T07:43:37.6308148Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6308469Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:ruin_names|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6308669Z E       assert 0.13333333333333333 == 0.004037827888116828 ± 4.0e-07
2024-11-29T07:43:37.6308873Z E         comparison failed
2024-11-29T07:43:37.6308958Z E         Obtained: 0.13333333333333333
2024-11-29T07:43:37.6309103Z E         Expected: 0.004037827888116828 ± 4.0e-07
2024-11-29T07:43:37.6309113Z 
2024-11-29T07:43:37.6309207Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6309483Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:salient_translation_error_detection|3_acc_stderr] _
2024-11-29T07:43:37.6309493Z 
2024-11-29T07:43:37.6310203Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:salient_translation_error_detection|3', 'acc_stderr', functools.partial(<functool...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6310208Z 
2024-11-29T07:43:37.6310368Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6310492Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6310706Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6310905Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6311105Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6311193Z >       assert reference == approx(
2024-11-29T07:43:37.6311282Z             prediction, rel=1e-4
2024-11-29T07:43:37.6311510Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6311909Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:salient_translation_error_detection|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6312068Z E       assert 0.1633 == 0.00497231172432741 ± 5.0e-07
2024-11-29T07:43:37.6312157Z E         comparison failed
2024-11-29T07:43:37.6312238Z E         Obtained: 0.1633
2024-11-29T07:43:37.6312384Z E         Expected: 0.00497231172432741 ± 5.0e-07
2024-11-29T07:43:37.6312395Z 
2024-11-29T07:43:37.6312488Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6312762Z _ test_model_prediction[gpt2_lite_harness|bigbench:salient_translation_error_detection|3_acc_stderr] _
2024-11-29T07:43:37.6312766Z 
2024-11-29T07:43:37.6313464Z model_input = ('gpt2', 'lite', 'harness|bigbench:salient_translation_error_detection|3', 'acc_stderr', functools.partial(<functools....ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6313476Z 
2024-11-29T07:43:37.6313637Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6313759Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6313960Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6314161Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6314366Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6314453Z >       assert reference == approx(
2024-11-29T07:43:37.6314541Z             prediction, rel=1e-4
2024-11-29T07:43:37.6314761Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6315147Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:salient_translation_error_detection|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6315418Z E       assert 0.1 == 0.00304553201...4616 ± 3.0e-07
2024-11-29T07:43:37.6315514Z E         comparison failed
2024-11-29T07:43:37.6315593Z E         Obtained: 0.1
2024-11-29T07:43:37.6315743Z E         Expected: 0.0030455320167854616 ± 3.0e-07
2024-11-29T07:43:37.6315748Z 
2024-11-29T07:43:37.6315842Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6316135Z _ test_model_prediction[gpt2_lite_harness|bigbench:salient_translation_error_detection|3_acc_norm_stderr] _
2024-11-29T07:43:37.6316250Z 
2024-11-29T07:43:37.6316999Z model_input = ('gpt2', 'lite', 'harness|bigbench:salient_translation_error_detection|3', 'acc_norm_stderr', functools.partial(<funct...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6317004Z 
2024-11-29T07:43:37.6317167Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6317450Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6317709Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6317913Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6318115Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6318208Z >       assert reference == approx(
2024-11-29T07:43:37.6318297Z             prediction, rel=1e-4
2024-11-29T07:43:37.6318526Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6318927Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:salient_translation_error_detection|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6319091Z E       assert 0.1 == 0.00304553201...4616 ± 3.0e-07
2024-11-29T07:43:37.6319178Z E         comparison failed
2024-11-29T07:43:37.6319258Z E         Obtained: 0.1
2024-11-29T07:43:37.6319409Z E         Expected: 0.0030455320167854616 ± 3.0e-07
2024-11-29T07:43:37.6319414Z 
2024-11-29T07:43:37.6319509Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6319712Z ___ test_model_prediction[gpt2_lite_lighteval|bigbench:snarks|3_acc_stderr] ____
2024-11-29T07:43:37.6319716Z 
2024-11-29T07:43:37.6320412Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:snarks|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object a...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6320423Z 
2024-11-29T07:43:37.6320580Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6320701Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6320899Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6321098Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6321303Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6321391Z >       assert reference == approx(
2024-11-29T07:43:37.6321481Z             prediction, rel=1e-4
2024-11-29T07:43:37.6321715Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6322020Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:snarks|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6322180Z E       assert 0.1633 == 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6322279Z E         comparison failed
2024-11-29T07:43:37.6322360Z E         Obtained: 0.1633
2024-11-29T07:43:37.6322506Z E         Expected: 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6322511Z 
2024-11-29T07:43:37.6322603Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6322805Z ____ test_model_prediction[gpt2_lite_harness|bigbench:snarks|3_acc_stderr] _____
2024-11-29T07:43:37.6322810Z 
2024-11-29T07:43:37.6323667Z model_input = ('gpt2', 'lite', 'harness|bigbench:snarks|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrapper object at ...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6323674Z 
2024-11-29T07:43:37.6323838Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6323957Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6324320Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6324521Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6324719Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6324806Z >       assert reference == approx(
2024-11-29T07:43:37.6324895Z             prediction, rel=1e-4
2024-11-29T07:43:37.6325128Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6325428Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:snarks|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6325589Z E       assert 0.1633 == 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6325677Z E         comparison failed
2024-11-29T07:43:37.6325757Z E         Obtained: 0.1633
2024-11-29T07:43:37.6325910Z E         Expected: 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6325915Z 
2024-11-29T07:43:37.6326008Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6326217Z __ test_model_prediction[gpt2_lite_harness|bigbench:snarks|3_acc_norm_stderr] __
2024-11-29T07:43:37.6326223Z 
2024-11-29T07:43:37.6326914Z model_input = ('gpt2', 'lite', 'harness|bigbench:snarks|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache_wrapper objec...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6326928Z 
2024-11-29T07:43:37.6327087Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6327206Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6327407Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6327607Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6327816Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6327901Z >       assert reference == approx(
2024-11-29T07:43:37.6327990Z             prediction, rel=1e-4
2024-11-29T07:43:37.6328212Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6328527Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:snarks|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6328689Z E       assert 0.1633 == 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6328780Z E         comparison failed
2024-11-29T07:43:37.6328860Z E         Obtained: 0.1633
2024-11-29T07:43:37.6329004Z E         Expected: 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6329010Z 
2024-11-29T07:43:37.6329102Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6329337Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:sports_understanding|3_acc_stderr] _
2024-11-29T07:43:37.6329347Z 
2024-11-29T07:43:37.6330043Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:sports_understanding|3', 'acc_stderr', functools.partial(<functools._lru_cache_wr...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6330048Z 
2024-11-29T07:43:37.6330205Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6330323Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6330650Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6330855Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6331051Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6331137Z >       assert reference == approx(
2024-11-29T07:43:37.6331226Z             prediction, rel=1e-4
2024-11-29T07:43:37.6331558Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6331909Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:sports_understanding|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6332183Z E       assert 0.1633 == 0.005037214858781963 ± 5.0e-07
2024-11-29T07:43:37.6332271Z E         comparison failed
2024-11-29T07:43:37.6332352Z E         Obtained: 0.1633
2024-11-29T07:43:37.6332500Z E         Expected: 0.005037214858781963 ± 5.0e-07
2024-11-29T07:43:37.6332505Z 
2024-11-29T07:43:37.6332608Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6332843Z _ test_model_prediction[gpt2_lite_harness|bigbench:sports_understanding|3_acc_stderr] _
2024-11-29T07:43:37.6332847Z 
2024-11-29T07:43:37.6333542Z model_input = ('gpt2', 'lite', 'harness|bigbench:sports_understanding|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrap...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6333555Z 
2024-11-29T07:43:37.6333715Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6333834Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6334036Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6334234Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6334435Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6334522Z >       assert reference == approx(
2024-11-29T07:43:37.6334612Z             prediction, rel=1e-4
2024-11-29T07:43:37.6334834Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6335174Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:sports_understanding|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6335334Z E       assert 0.1633 == 0.0049194014382352 ± 4.9e-07
2024-11-29T07:43:37.6335426Z E         comparison failed
2024-11-29T07:43:37.6335505Z E         Obtained: 0.1633
2024-11-29T07:43:37.6335651Z E         Expected: 0.0049194014382352 ± 4.9e-07
2024-11-29T07:43:37.6335656Z 
2024-11-29T07:43:37.6335751Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6335995Z _ test_model_prediction[gpt2_lite_harness|bigbench:sports_understanding|3_acc_norm_stderr] _
2024-11-29T07:43:37.6336000Z 
2024-11-29T07:43:37.6336699Z model_input = ('gpt2', 'lite', 'harness|bigbench:sports_understanding|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6336705Z 
2024-11-29T07:43:37.6336865Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6336986Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6337192Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6337391Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6337589Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6337675Z >       assert reference == approx(
2024-11-29T07:43:37.6337765Z             prediction, rel=1e-4
2024-11-29T07:43:37.6338113Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6338493Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:sports_understanding|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6338654Z E       assert 0.1633 == 0.0049194014382352 ± 4.9e-07
2024-11-29T07:43:37.6338742Z E         comparison failed
2024-11-29T07:43:37.6338823Z E         Obtained: 0.1633
2024-11-29T07:43:37.6339081Z E         Expected: 0.0049194014382352 ± 4.9e-07
2024-11-29T07:43:37.6339086Z 
2024-11-29T07:43:37.6339181Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6339400Z _ test_model_prediction[gpt2_lite_harness|bigbench:temporal_sequences|3_acc_stderr] _
2024-11-29T07:43:37.6339405Z 
2024-11-29T07:43:37.6340097Z model_input = ('gpt2', 'lite', 'harness|bigbench:temporal_sequences|3', 'acc_stderr', functools.partial(<functools._lru_cache_wrappe...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6340110Z 
2024-11-29T07:43:37.6340272Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6340394Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6340599Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6340799Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6341007Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6341093Z >       assert reference == approx(
2024-11-29T07:43:37.6341184Z             prediction, rel=1e-4
2024-11-29T07:43:37.6341408Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6341746Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:temporal_sequences|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6341904Z E       assert 0.1 == 0.00294961187...1973 ± 2.9e-07
2024-11-29T07:43:37.6341993Z E         comparison failed
2024-11-29T07:43:37.6342072Z E         Obtained: 0.1
2024-11-29T07:43:37.6342221Z E         Expected: 0.0029496118745031973 ± 2.9e-07
2024-11-29T07:43:37.6342225Z 
2024-11-29T07:43:37.6342320Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6342560Z _ test_model_prediction[gpt2_lite_harness|bigbench:temporal_sequences|3_acc_norm_stderr] _
2024-11-29T07:43:37.6342571Z 
2024-11-29T07:43:37.6343261Z model_input = ('gpt2', 'lite', 'harness|bigbench:temporal_sequences|3', 'acc_norm_stderr', functools.partial(<functools._lru_cache_w...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6343266Z 
2024-11-29T07:43:37.6343430Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6343551Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6343759Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6343959Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6344158Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6344244Z >       assert reference == approx(
2024-11-29T07:43:37.6344335Z             prediction, rel=1e-4
2024-11-29T07:43:37.6344564Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6344913Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:temporal_sequences|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6345063Z E       assert 0.1 == 0.00294961187...1973 ± 2.9e-07
2024-11-29T07:43:37.6345152Z E         comparison failed
2024-11-29T07:43:37.6345229Z E         Obtained: 0.1
2024-11-29T07:43:37.6345379Z E         Expected: 0.0029496118745031973 ± 2.9e-07
2024-11-29T07:43:37.6345384Z 
2024-11-29T07:43:37.6345602Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6345905Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:tracking_shuffled_objects_five_objects|3_acc_stderr] _
2024-11-29T07:43:37.6345910Z 
2024-11-29T07:43:37.6346593Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:tracking_shuffled_objects_five_objects|3', 'acc_stderr', functools.partial(<funct...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6346709Z 
2024-11-29T07:43:37.6346874Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6346997Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6347198Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6347399Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6347604Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6347691Z >       assert reference == approx(
2024-11-29T07:43:37.6347781Z             prediction, rel=1e-4
2024-11-29T07:43:37.6348003Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6348410Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:tracking_shuffled_objects_five_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6348622Z E       assert 0.13333333333333333 == 0.004030304374777823 ± 4.0e-07
2024-11-29T07:43:37.6348710Z E         comparison failed
2024-11-29T07:43:37.6348797Z E         Obtained: 0.13333333333333333
2024-11-29T07:43:37.6348944Z E         Expected: 0.004030304374777823 ± 4.0e-07
2024-11-29T07:43:37.6348949Z 
2024-11-29T07:43:37.6349041Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6349327Z _ test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_five_objects|3_acc_stderr] _
2024-11-29T07:43:37.6349338Z 
2024-11-29T07:43:37.6350016Z model_input = ('gpt2', 'lite', 'harness|bigbench:tracking_shuffled_objects_five_objects|3', 'acc_stderr', functools.partial(<functoo...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6350021Z 
2024-11-29T07:43:37.6350181Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6350309Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6350511Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6350712Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6350914Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6351000Z >       assert reference == approx(
2024-11-29T07:43:37.6351090Z             prediction, rel=1e-4
2024-11-29T07:43:37.6351316Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6351713Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_five_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6351912Z E       assert 0.13333333333333333 == 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6352006Z E         comparison failed
2024-11-29T07:43:37.6352093Z E         Obtained: 0.13333333333333333
2024-11-29T07:43:37.6352238Z E         Expected: 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6352243Z 
2024-11-29T07:43:37.6352335Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6352640Z _ test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_five_objects|3_acc_norm_stderr] _
2024-11-29T07:43:37.6352645Z 
2024-11-29T07:43:37.6353483Z model_input = ('gpt2', 'lite', 'harness|bigbench:tracking_shuffled_objects_five_objects|3', 'acc_norm_stderr', functools.partial(<fu...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6353495Z 
2024-11-29T07:43:37.6353653Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6353778Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6353977Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6354288Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6354487Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6354573Z >       assert reference == approx(
2024-11-29T07:43:37.6354665Z             prediction, rel=1e-4
2024-11-29T07:43:37.6354887Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6355305Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_five_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6355460Z E       assert 0.1 == 0.00294083125...9783 ± 2.9e-07
2024-11-29T07:43:37.6355549Z E         comparison failed
2024-11-29T07:43:37.6355628Z E         Obtained: 0.1
2024-11-29T07:43:37.6355777Z E         Expected: 0.0029408312580779783 ± 2.9e-07
2024-11-29T07:43:37.6355781Z 
2024-11-29T07:43:37.6355883Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6356179Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:tracking_shuffled_objects_seven_objects|3_acc_stderr] _
2024-11-29T07:43:37.6356184Z 
2024-11-29T07:43:37.6356865Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:tracking_shuffled_objects_seven_objects|3', 'acc_stderr', functools.partial(<func...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519464)
2024-11-29T07:43:37.6356875Z 
2024-11-29T07:43:37.6357036Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6357161Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6357556Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6357770Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6357964Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6358058Z >       assert reference == approx(
2024-11-29T07:43:37.6358149Z             prediction, rel=1e-4
2024-11-29T07:43:37.6358375Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6358793Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:tracking_shuffled_objects_seven_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6358997Z E       assert 0.15275252316519464 == 0.004588830718970504 ± 4.6e-07
2024-11-29T07:43:37.6359091Z E         comparison failed
2024-11-29T07:43:37.6359175Z E         Obtained: 0.15275252316519464
2024-11-29T07:43:37.6359325Z E         Expected: 0.004588830718970504 ± 4.6e-07
2024-11-29T07:43:37.6359330Z 
2024-11-29T07:43:37.6359428Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6359719Z _ test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_seven_objects|3_acc_stderr] _
2024-11-29T07:43:37.6359730Z 
2024-11-29T07:43:37.6360424Z model_input = ('gpt2', 'lite', 'harness|bigbench:tracking_shuffled_objects_seven_objects|3', 'acc_stderr', functools.partial(<functo...ch:tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1)
2024-11-29T07:43:37.6360429Z 
2024-11-29T07:43:37.6360587Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6360713Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6361066Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6361276Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6361469Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6361561Z >       assert reference == approx(
2024-11-29T07:43:37.6361646Z             prediction, rel=1e-4
2024-11-29T07:43:37.6362037Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6362433Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_seven_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6362592Z E       assert 0.1 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6362675Z E         comparison failed
2024-11-29T07:43:37.6362753Z E         Obtained: 0.1
2024-11-29T07:43:37.6362901Z E         Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6362913Z 
2024-11-29T07:43:37.6363005Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6363299Z _ test_model_prediction[gpt2_lite_lighteval|bigbench:tracking_shuffled_objects_three_objects|3_acc_stderr] _
2024-11-29T07:43:37.6363304Z 
2024-11-29T07:43:37.6364022Z model_input = ('gpt2', 'lite', 'lighteval|bigbench:tracking_shuffled_objects_three_objects|3', 'acc_stderr', functools.partial(<func...tracking_shuffled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.1633)
2024-11-29T07:43:37.6364034Z 
2024-11-29T07:43:37.6364189Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6364316Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6364512Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6364717Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6364913Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6365006Z >       assert reference == approx(
2024-11-29T07:43:37.6365091Z             prediction, rel=1e-4
2024-11-29T07:43:37.6365316Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6365722Z E       AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:tracking_shuffled_objects_three_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6365890Z E       assert 0.1633 == 0.00504952053...2955 ± 5.0e-07
2024-11-29T07:43:37.6365976Z E         comparison failed
2024-11-29T07:43:37.6366062Z E         Obtained: 0.1633
2024-11-29T07:43:37.6366203Z E         Expected: 0.0050495205374032955 ± 5.0e-07
2024-11-29T07:43:37.6366208Z 
2024-11-29T07:43:37.6366307Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6366593Z _ test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_three_objects|3_acc_stderr] _
2024-11-29T07:43:37.6366604Z 
2024-11-29T07:43:37.6367291Z model_input = ('gpt2', 'lite', 'harness|bigbench:tracking_shuffled_objects_three_objects|3', 'acc_stderr', functools.partial(<functo...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.13333333333333333)
2024-11-29T07:43:37.6367295Z 
2024-11-29T07:43:37.6367449Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6367582Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6367780Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6367987Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6368181Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6368278Z >       assert reference == approx(
2024-11-29T07:43:37.6368371Z             prediction, rel=1e-4
2024-11-29T07:43:37.6368718Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6369115Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_three_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6369321Z E       assert 0.13333333333333333 == 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6369518Z E         comparison failed
2024-11-29T07:43:37.6369609Z E         Obtained: 0.13333333333333333
2024-11-29T07:43:37.6369753Z E         Expected: 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6369759Z 
2024-11-29T07:43:37.6369857Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6370159Z _ test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_three_objects|3_acc_norm_stderr] _
2024-11-29T07:43:37.6370164Z 
2024-11-29T07:43:37.6370876Z model_input = ('gpt2', 'lite', 'harness|bigbench:tracking_shuffled_objects_three_objects|3', 'acc_norm_stderr', functools.partial(<f...fled_objects_three_objects|3|0', 'harness|bigbench:tracking_shuffled_objects_three_objects|3|0')), 0.15275252316519466)
2024-11-29T07:43:37.6370882Z 
2024-11-29T07:43:37.6371039Z     @pytest.mark.parametrize("model_input", parameters, ids=ids)
2024-11-29T07:43:37.6371169Z     def test_model_prediction(model_input: ModelInput):
2024-11-29T07:43:37.6371367Z         """Evaluates a model on a full task - is parametrized using pytest_generate_test"""
2024-11-29T07:43:37.6371581Z         model_name, test_type, eval_name, metric, get_predictions, reference = model_input
2024-11-29T07:43:37.6371774Z         prediction = get_predictions()["results"][eval_name.replace("|", ":")][metric]
2024-11-29T07:43:37.6371868Z >       assert reference == approx(
2024-11-29T07:43:37.6371952Z             prediction, rel=1e-4
2024-11-29T07:43:37.6372178Z         ), f"Model {model_name} on {test_type} samples, for eval {eval_name}, metric {metric} incorrect"
2024-11-29T07:43:37.6372595Z E       AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_three_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6372800Z E       assert 0.15275252316519466 == 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6372884Z E         comparison failed
2024-11-29T07:43:37.6372973Z E         Obtained: 0.15275252316519466
2024-11-29T07:43:37.6373113Z E         Expected: 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6373124Z 
2024-11-29T07:43:37.6373222Z tests/test_main.py:134: AssertionError
2024-11-29T07:43:37.6373351Z =========================== short test summary info ============================
2024-11-29T07:43:37.6374021Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_leaderboard|arc:challenge|25_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval leaderboard|arc:challenge|25, metric acc_stderr incorrect
2024-11-29T07:43:37.6374212Z assert 0.15275252316519466 == 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6374300Z   comparison failed
2024-11-29T07:43:37.6374392Z   Obtained: 0.15275252316519466
2024-11-29T07:43:37.6374534Z   Expected: 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6375208Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_leaderboard|truthfulqa:mc|0_truthfulqa_mc1_stderr] - AssertionError: Model gpt2 on lite samples, for eval leaderboard|truthfulqa:mc|0, metric truthfulqa_mc1_stderr incorrect
2024-11-29T07:43:37.6375400Z assert 0.15275252316519466 == 0.004619651629850591 ± 4.6e-07
2024-11-29T07:43:37.6375485Z   comparison failed
2024-11-29T07:43:37.6375573Z   Obtained: 0.15275252316519466
2024-11-29T07:43:37.6375705Z   Expected: 0.004619651629850591 ± 4.6e-07
2024-11-29T07:43:37.6376363Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_leaderboard|truthfulqa:mc|0_truthfulqa_mc2_stderr] - AssertionError: Model gpt2 on lite samples, for eval leaderboard|truthfulqa:mc|0, metric truthfulqa_mc2_stderr incorrect
2024-11-29T07:43:37.6376549Z assert 0.14105533101540416 == 0.004258753966872427 ± 4.3e-07
2024-11-29T07:43:37.6376759Z   comparison failed
2024-11-29T07:43:37.6376845Z   Obtained: 0.14105533101540416
2024-11-29T07:43:37.6376978Z   Expected: 0.004258753966872427 ± 4.3e-07
2024-11-29T07:43:37.6377560Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_leaderboard|hellaswag|10_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval leaderboard|hellaswag|10, metric acc_stderr incorrect
2024-11-29T07:43:37.6377879Z assert 0.16329931618554522 == 0.004968770338693327 ± 5.0e-07
2024-11-29T07:43:37.6377959Z   comparison failed
2024-11-29T07:43:37.6378038Z   Obtained: 0.16329931618554522
2024-11-29T07:43:37.6378174Z   Expected: 0.004968770338693327 ± 5.0e-07
2024-11-29T07:43:37.6378782Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_leaderboard|hellaswag|10_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval leaderboard|hellaswag|10, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6378972Z assert 0.16329931618554522 == 0.004785693561320304 ± 4.8e-07
2024-11-29T07:43:37.6379058Z   comparison failed
2024-11-29T07:43:37.6379143Z   Obtained: 0.16329931618554522
2024-11-29T07:43:37.6379271Z   Expected: 0.004785693561320304 ± 4.8e-07
2024-11-29T07:43:37.6379915Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_leaderboard|mmlu:abstract_algebra|5_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval leaderboard|mmlu:abstract_algebra|5, metric acc_stderr incorrect
2024-11-29T07:43:37.6380103Z assert 0.16329931618554522 == 0.00497641685...3716 ± 5.0e-07
2024-11-29T07:43:37.6380192Z   comparison failed
2024-11-29T07:43:37.6380271Z   Obtained: 0.16329931618554522
2024-11-29T07:43:37.6380412Z   Expected: 0.0049764168560043716 ± 5.0e-07
2024-11-29T07:43:37.6381060Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_leaderboard|mmlu:college_chemistry|5_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval leaderboard|mmlu:college_chemistry|5, metric acc_stderr incorrect
2024-11-29T07:43:37.6381255Z assert 0.09999999999999999 == 0.002971327782118411 ± 3.0e-07
2024-11-29T07:43:37.6381334Z   comparison failed
2024-11-29T07:43:37.6381419Z   Obtained: 0.09999999999999999
2024-11-29T07:43:37.6381549Z   Expected: 0.002971327782118411 ± 3.0e-07
2024-11-29T07:43:37.6382200Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_leaderboard|mmlu:computer_security|5_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval leaderboard|mmlu:computer_security|5, metric acc_stderr incorrect
2024-11-29T07:43:37.6382386Z assert 0.09999999999999999 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6382469Z   comparison failed
2024-11-29T07:43:37.6382548Z   Obtained: 0.09999999999999999
2024-11-29T07:43:37.6382684Z   Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6383319Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_leaderboard|mmlu:us_foreign_policy|5_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval leaderboard|mmlu:us_foreign_policy|5, metric acc_stderr incorrect
2024-11-29T07:43:37.6383510Z assert 0.15275252316519466 == 0.004633704913049727 ± 4.6e-07
2024-11-29T07:43:37.6383588Z   comparison failed
2024-11-29T07:43:37.6383673Z   Obtained: 0.15275252316519466
2024-11-29T07:43:37.6383802Z   Expected: 0.004633704913049727 ± 4.6e-07
2024-11-29T07:43:37.6384380Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|mmlu:abstract_algebra|5_em_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:abstract_algebra|5, metric em_stderr incorrect
2024-11-29T07:43:37.6384568Z assert 0.16329931618554522 == 0.00497641685...3716 ± 5.0e-07
2024-11-29T07:43:37.6384652Z   comparison failed
2024-11-29T07:43:37.6384731Z   Obtained: 0.16329931618554522
2024-11-29T07:43:37.6384866Z   Expected: 0.0049764168560043716 ± 5.0e-07
2024-11-29T07:43:37.6385457Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|mmlu:abstract_algebra|5_pqem_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:abstract_algebra|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.6385768Z assert 0.16329931618554522 == 0.00497641685...3716 ± 5.0e-07
2024-11-29T07:43:37.6385853Z   comparison failed
2024-11-29T07:43:37.6385940Z   Obtained: 0.16329931618554522
2024-11-29T07:43:37.6386075Z   Expected: 0.0049764168560043716 ± 5.0e-07
2024-11-29T07:43:37.6386663Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|mmlu:college_chemistry|5_em_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:college_chemistry|5, metric em_stderr incorrect
2024-11-29T07:43:37.6386960Z assert 0.15275252316519466 == 0.00457283509661358 ± 4.6e-07
2024-11-29T07:43:37.6387043Z   comparison failed
2024-11-29T07:43:37.6387122Z   Obtained: 0.15275252316519466
2024-11-29T07:43:37.6387256Z   Expected: 0.00457283509661358 ± 4.6e-07
2024-11-29T07:43:37.6387847Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|mmlu:college_chemistry|5_pqem_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:college_chemistry|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.6388043Z assert 0.16329931618554522 == 0.004802280906184263 ± 4.8e-07
2024-11-29T07:43:37.6388121Z   comparison failed
2024-11-29T07:43:37.6388206Z   Obtained: 0.16329931618554522
2024-11-29T07:43:37.6388334Z   Expected: 0.004802280906184263 ± 4.8e-07
2024-11-29T07:43:37.6388931Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|mmlu:computer_security|5_em_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:computer_security|5, metric em_stderr incorrect
2024-11-29T07:43:37.6389119Z assert 0.09999999999999999 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6389202Z   comparison failed
2024-11-29T07:43:37.6389282Z   Obtained: 0.09999999999999999
2024-11-29T07:43:37.6389413Z   Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6390014Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|mmlu:computer_security|5_pqem_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:computer_security|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.6390209Z assert 0.15275252316519464 == 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.6390290Z   comparison failed
2024-11-29T07:43:37.6390370Z   Obtained: 0.15275252316519464
2024-11-29T07:43:37.6390506Z   Expected: 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.6391078Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|mmlu:us_foreign_policy|5_em_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:us_foreign_policy|5, metric em_stderr incorrect
2024-11-29T07:43:37.6391269Z assert 0.15275252316519466 == 0.004633704913049727 ± 4.6e-07
2024-11-29T07:43:37.6391348Z   comparison failed
2024-11-29T07:43:37.6391434Z   Obtained: 0.15275252316519466
2024-11-29T07:43:37.6391565Z   Expected: 0.004633704913049727 ± 4.6e-07
2024-11-29T07:43:37.6392188Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|mmlu:us_foreign_policy|5_pqem_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|mmlu:us_foreign_policy|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.6392376Z assert 0.16329931618554522 == 0.004872014627084626 ± 4.9e-07
2024-11-29T07:43:37.6392463Z   comparison failed
2024-11-29T07:43:37.6392543Z   Obtained: 0.16329931618554522
2024-11-29T07:43:37.6392677Z   Expected: 0.004872014627084626 ± 4.9e-07
2024-11-29T07:43:37.6393206Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|anli:r1|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|anli:r1|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6393398Z assert 0.16666666666666666 == 0.00514299138248941 ± 5.1e-07
2024-11-29T07:43:37.6393477Z   comparison failed
2024-11-29T07:43:37.6393562Z   Obtained: 0.16666666666666666
2024-11-29T07:43:37.6393691Z   Expected: 0.00514299138248941 ± 5.1e-07
2024-11-29T07:43:37.6394314Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|blimp:adjunct_island|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|blimp:adjunct_island|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6394620Z assert 0.13333333333333333 == 0.003921139545506534 ± 3.9e-07
2024-11-29T07:43:37.6394706Z   comparison failed
2024-11-29T07:43:37.6394786Z   Obtained: 0.13333333333333333
2024-11-29T07:43:37.6394936Z   Expected: 0.003921139545506534 ± 3.9e-07
2024-11-29T07:43:37.6395556Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|blimp:ellipsis_n_bar_1|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|blimp:ellipsis_n_bar_1|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6395862Z assert 0.15275252316519466 == 0.004709524351738684 ± 4.7e-07
2024-11-29T07:43:37.6395940Z   comparison failed
2024-11-29T07:43:37.6396024Z   Obtained: 0.15275252316519466
2024-11-29T07:43:37.6396154Z   Expected: 0.004709524351738684 ± 4.7e-07
2024-11-29T07:43:37.6396643Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|boolq|5_em_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|boolq|5, metric em_stderr incorrect
2024-11-29T07:43:37.6396833Z assert 0.16329931618554522 == 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.6396915Z   comparison failed
2024-11-29T07:43:37.6396993Z   Obtained: 0.16329931618554522
2024-11-29T07:43:37.6397128Z   Expected: 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.6397841Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|boolq|5_qem_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|boolq|5, metric qem_stderr incorrect
2024-11-29T07:43:37.6398053Z assert 0.16329931618554522 == 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.6398133Z   comparison failed
2024-11-29T07:43:37.6398217Z   Obtained: 0.16329931618554522
2024-11-29T07:43:37.6398346Z   Expected: 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.6398841Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|boolq|5_pem_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|boolq|5, metric pem_stderr incorrect
2024-11-29T07:43:37.6399021Z assert 0.16329931618554522 == 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.6399112Z   comparison failed
2024-11-29T07:43:37.6399192Z   Obtained: 0.16329931618554522
2024-11-29T07:43:37.6399338Z   Expected: 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.6399834Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_helm|boolq|5_pqem_stderr] - AssertionError: Model gpt2 on lite samples, for eval helm|boolq|5, metric pqem_stderr incorrect
2024-11-29T07:43:37.6400019Z assert 0.16329931618554522 == 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.6400104Z   comparison failed
2024-11-29T07:43:37.6400183Z   Obtained: 0.16329931618554522
2024-11-29T07:43:37.6400316Z   Expected: 0.004922599108396079 ± 4.9e-07
2024-11-29T07:43:37.6400908Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:aqua-rat|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:aqua-rat|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6401065Z assert 0.15275 == 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.6401143Z   comparison failed
2024-11-29T07:43:37.6401230Z   Obtained: 0.15275
2024-11-29T07:43:37.6401359Z   Expected: 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.6401990Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:aqua-rat|0_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:aqua-rat|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6402137Z assert 0.15275 == 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.6402227Z   comparison failed
2024-11-29T07:43:37.6402302Z   Obtained: 0.15275
2024-11-29T07:43:37.6402436Z   Expected: 0.004774519456032858 ± 4.8e-07
2024-11-29T07:43:37.6403034Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:logiqa-en|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:logiqa-en|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6403175Z assert 0.1 == 0.00309049205...1304 ± 3.1e-07
2024-11-29T07:43:37.6403253Z   comparison failed
2024-11-29T07:43:37.6403523Z   Obtained: 0.1
2024-11-29T07:43:37.6403663Z   Expected: 0.0030904920548581304 ± 3.1e-07
2024-11-29T07:43:37.6404293Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:logiqa-en|0_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:logiqa-en|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6404440Z assert 0.15275 == 0.00457742206...5185 ± 4.6e-07
2024-11-29T07:43:37.6404663Z   comparison failed
2024-11-29T07:43:37.6404739Z   Obtained: 0.15275
2024-11-29T07:43:37.6404882Z   Expected: 0.0045774220684565185 ± 4.6e-07
2024-11-29T07:43:37.6405462Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:lsat-ar|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-ar|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6405602Z assert 0.1 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6405683Z   comparison failed
2024-11-29T07:43:37.6405763Z   Obtained: 0.1
2024-11-29T07:43:37.6405903Z   Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6406551Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:lsat-ar|0_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-ar|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6406686Z assert 0.1 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6406770Z   comparison failed
2024-11-29T07:43:37.6406855Z   Obtained: 0.1
2024-11-29T07:43:37.6406992Z   Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6407568Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:lsat-lr|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-lr|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6407719Z assert 0.13333 == 0.004077368628777015 ± 4.1e-07
2024-11-29T07:43:37.6407797Z   comparison failed
2024-11-29T07:43:37.6407876Z   Obtained: 0.13333
2024-11-29T07:43:37.6408006Z   Expected: 0.004077368628777015 ± 4.1e-07
2024-11-29T07:43:37.6408624Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:lsat-lr|0_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-lr|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6408770Z assert 0.13333 == 0.004077368628777015 ± 4.1e-07
2024-11-29T07:43:37.6408863Z   comparison failed
2024-11-29T07:43:37.6408938Z   Obtained: 0.13333
2024-11-29T07:43:37.6409081Z   Expected: 0.004077368628777015 ± 4.1e-07
2024-11-29T07:43:37.6409655Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:lsat-rc|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-rc|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6409804Z assert 0.15275 == 0.004582352884486063 ± 4.6e-07
2024-11-29T07:43:37.6409883Z   comparison failed
2024-11-29T07:43:37.6409957Z   Obtained: 0.15275
2024-11-29T07:43:37.6410092Z   Expected: 0.004582352884486063 ± 4.6e-07
2024-11-29T07:43:37.6410702Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:lsat-rc|0_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:lsat-rc|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6410852Z assert 0.13333 == 0.004080117321518739 ± 4.1e-07
2024-11-29T07:43:37.6410928Z   comparison failed
2024-11-29T07:43:37.6411009Z   Obtained: 0.13333
2024-11-29T07:43:37.6411137Z   Expected: 0.004080117321518739 ± 4.1e-07
2024-11-29T07:43:37.6411844Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:sat-en-without-passage|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-en-without-passage|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6411987Z assert 0.13333 == 0.00388625951...6192 ± 3.9e-07
2024-11-29T07:43:37.6412070Z   comparison failed
2024-11-29T07:43:37.6412145Z   Obtained: 0.13333
2024-11-29T07:43:37.6412281Z   Expected: 0.0038862595143676192 ± 3.9e-07
2024-11-29T07:43:37.6413126Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:sat-en-without-passage|0_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-en-without-passage|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6413281Z assert 0.15275 == 0.004538042951960014 ± 4.5e-07
2024-11-29T07:43:37.6413358Z   comparison failed
2024-11-29T07:43:37.6413438Z   Obtained: 0.15275
2024-11-29T07:43:37.6413704Z   Expected: 0.004538042951960014 ± 4.5e-07
2024-11-29T07:43:37.6414280Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:sat-en|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-en|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6414426Z assert 0.13333 == 0.00388625951...6192 ± 3.9e-07
2024-11-29T07:43:37.6414509Z   comparison failed
2024-11-29T07:43:37.6414585Z   Obtained: 0.13333
2024-11-29T07:43:37.6414723Z   Expected: 0.0038862595143676192 ± 3.9e-07
2024-11-29T07:43:37.6415338Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:sat-en|0_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-en|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6415490Z assert 0.15275 == 0.004538042951960014 ± 4.5e-07
2024-11-29T07:43:37.6415566Z   comparison failed
2024-11-29T07:43:37.6415652Z   Obtained: 0.15275
2024-11-29T07:43:37.6415783Z   Expected: 0.004538042951960014 ± 4.5e-07
2024-11-29T07:43:37.6416384Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:sat-math|0_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-math|0, metric acc_stderr incorrect
2024-11-29T07:43:37.6416528Z assert 0.15275 == 0.004664521171326971 ± 4.7e-07
2024-11-29T07:43:37.6416610Z   comparison failed
2024-11-29T07:43:37.6416685Z   Obtained: 0.15275
2024-11-29T07:43:37.6416820Z   Expected: 0.004664521171326971 ± 4.7e-07
2024-11-29T07:43:37.6417442Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|agieval:sat-math|0_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|agieval:sat-math|0, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6417581Z assert 0.1 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6417657Z   comparison failed
2024-11-29T07:43:37.6417737Z   Obtained: 0.1
2024-11-29T07:43:37.6417867Z   Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6418526Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:causal_judgment|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:causal_judgment|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6418718Z assert 0.16666666666666666 == 0.004861068811484776 ± 4.9e-07
2024-11-29T07:43:37.6418800Z   comparison failed
2024-11-29T07:43:37.6418880Z   Obtained: 0.16666666666666666
2024-11-29T07:43:37.6419015Z   Expected: 0.004861068811484776 ± 4.9e-07
2024-11-29T07:43:37.6419644Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:causal_judgment|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:causal_judgment|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6419797Z assert 0.1633 == 0.004699965923246645 ± 4.7e-07
2024-11-29T07:43:37.6419874Z   comparison failed
2024-11-29T07:43:37.6419955Z   Obtained: 0.1633
2024-11-29T07:43:37.6420084Z   Expected: 0.004699965923246645 ± 4.7e-07
2024-11-29T07:43:37.6420754Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:causal_judgment|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:causal_judgment|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6420939Z assert 0.16666666666666666 == 0.004861068811484776 ± 4.9e-07
2024-11-29T07:43:37.6421022Z   comparison failed
2024-11-29T07:43:37.6421102Z   Obtained: 0.16666666666666666
2024-11-29T07:43:37.6421231Z   Expected: 0.004861068811484776 ± 4.9e-07
2024-11-29T07:43:37.6422015Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:date_understanding|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:date_understanding|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6422202Z assert 0.13333333333333333 == 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6422286Z   comparison failed
2024-11-29T07:43:37.6422366Z   Obtained: 0.13333333333333333
2024-11-29T07:43:37.6422500Z   Expected: 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6423292Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:date_understanding|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:date_understanding|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6423484Z assert 0.13333333333333333 == 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6423563Z   comparison failed
2024-11-29T07:43:37.6423646Z   Obtained: 0.13333333333333333
2024-11-29T07:43:37.6423775Z   Expected: 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6424449Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:disambiguation_qa|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:disambiguation_qa|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6424629Z assert 0.15275252316519466 == 0.004650507199996266 ± 4.7e-07
2024-11-29T07:43:37.6424713Z   comparison failed
2024-11-29T07:43:37.6424792Z   Obtained: 0.15275252316519466
2024-11-29T07:43:37.6424933Z   Expected: 0.004650507199996266 ± 4.7e-07
2024-11-29T07:43:37.6425575Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:disambiguation_qa|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:disambiguation_qa|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6425761Z assert 0.15275252316519466 == 0.004650507199996266 ± 4.7e-07
2024-11-29T07:43:37.6425840Z   comparison failed
2024-11-29T07:43:37.6425926Z   Obtained: 0.15275252316519466
2024-11-29T07:43:37.6426055Z   Expected: 0.004650507199996266 ± 4.7e-07
2024-11-29T07:43:37.6426738Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:disambiguation_qa|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:disambiguation_qa|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6426920Z assert 0.15275252316519464 == 0.004582439170218064 ± 4.6e-07
2024-11-29T07:43:37.6427005Z   comparison failed
2024-11-29T07:43:37.6427091Z   Obtained: 0.15275252316519464
2024-11-29T07:43:37.6427225Z   Expected: 0.004582439170218064 ± 4.6e-07
2024-11-29T07:43:37.6427880Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:geometric_shapes|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:geometric_shapes|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6428065Z assert 0.13333333333333333 == 0.004000255247111385 ± 4.0e-07
2024-11-29T07:43:37.6428144Z   comparison failed
2024-11-29T07:43:37.6428227Z   Obtained: 0.13333333333333333
2024-11-29T07:43:37.6428363Z   Expected: 0.004000255247111385 ± 4.0e-07
2024-11-29T07:43:37.6429040Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:geometric_shapes|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:geometric_shapes|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6429220Z assert 0.13333333333333333 == 0.004041744140305727 ± 4.0e-07
2024-11-29T07:43:37.6429309Z   comparison failed
2024-11-29T07:43:37.6429388Z   Obtained: 0.13333333333333333
2024-11-29T07:43:37.6429522Z   Expected: 0.004041744140305727 ± 4.0e-07
2024-11-29T07:43:37.6430267Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:logical_deduction_five_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:logical_deduction_five_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6430407Z assert 0.1 == 0.003030251214408201 ± 3.0e-07
2024-11-29T07:43:37.6430484Z   comparison failed
2024-11-29T07:43:37.6430690Z   Obtained: 0.1
2024-11-29T07:43:37.6430826Z   Expected: 0.003030251214408201 ± 3.0e-07
2024-11-29T07:43:37.6431557Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_five_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_five_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6431737Z assert 0.15275252316519464 == 0.004381117916034022 ± 4.4e-07
2024-11-29T07:43:37.6431937Z   comparison failed
2024-11-29T07:43:37.6432018Z   Obtained: 0.15275252316519464
2024-11-29T07:43:37.6432154Z   Expected: 0.004381117916034022 ± 4.4e-07
2024-11-29T07:43:37.6432909Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_five_objects|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_five_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6433103Z assert 0.15275252316519464 == 0.004480319549235682 ± 4.5e-07
2024-11-29T07:43:37.6433182Z   comparison failed
2024-11-29T07:43:37.6433267Z   Obtained: 0.15275252316519464
2024-11-29T07:43:37.6433396Z   Expected: 0.004480319549235682 ± 4.5e-07
2024-11-29T07:43:37.6434148Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:logical_deduction_seven_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:logical_deduction_seven_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6434339Z assert 0.13333333333333333 == 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6434423Z   comparison failed
2024-11-29T07:43:37.6434502Z   Obtained: 0.13333333333333333
2024-11-29T07:43:37.6434634Z   Expected: 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6435374Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_seven_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_seven_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6435565Z assert 0.13333333333333333 == 0.004006756056224812 ± 4.0e-07
2024-11-29T07:43:37.6435646Z   comparison failed
2024-11-29T07:43:37.6435730Z   Obtained: 0.13333333333333333
2024-11-29T07:43:37.6435858Z   Expected: 0.004006756056224812 ± 4.0e-07
2024-11-29T07:43:37.6436626Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_seven_objects|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_seven_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6436813Z assert 0.13333333333333333 == 0.004006756056224812 ± 4.0e-07
2024-11-29T07:43:37.6436897Z   comparison failed
2024-11-29T07:43:37.6436975Z   Obtained: 0.13333333333333333
2024-11-29T07:43:37.6437109Z   Expected: 0.004006756056224812 ± 4.0e-07
2024-11-29T07:43:37.6438097Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:logical_deduction_three_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:logical_deduction_three_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6438261Z assert 0.1633 == 0.00504952053...2955 ± 5.0e-07
2024-11-29T07:43:37.6438343Z   comparison failed
2024-11-29T07:43:37.6438425Z   Obtained: 0.1633
2024-11-29T07:43:37.6438557Z   Expected: 0.0050495205374032955 ± 5.0e-07
2024-11-29T07:43:37.6439295Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_three_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_three_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6439483Z assert 0.13333333333333333 == 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6439566Z   comparison failed
2024-11-29T07:43:37.6439644Z   Obtained: 0.13333333333333333
2024-11-29T07:43:37.6439771Z   Expected: 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6440695Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:logical_deduction_three_objects|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:logical_deduction_three_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6440890Z assert 0.15275252316519466 == 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6440969Z   comparison failed
2024-11-29T07:43:37.6441048Z   Obtained: 0.15275252316519466
2024-11-29T07:43:37.6441328Z   Expected: 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6442012Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:movie_recommendation|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:movie_recommendation|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6442199Z assert 0.15275252316519466 == 0.004428245629971239 ± 4.4e-07
2024-11-29T07:43:37.6442276Z   comparison failed
2024-11-29T07:43:37.6442361Z   Obtained: 0.15275252316519466
2024-11-29T07:43:37.6442489Z   Expected: 0.004428245629971239 ± 4.4e-07
2024-11-29T07:43:37.6443171Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:movie_recommendation|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:movie_recommendation|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6443353Z assert 0.16666666666666666 == 0.004907190695349086 ± 4.9e-07
2024-11-29T07:43:37.6443437Z   comparison failed
2024-11-29T07:43:37.6443516Z   Obtained: 0.16666666666666666
2024-11-29T07:43:37.6443659Z   Expected: 0.004907190695349086 ± 4.9e-07
2024-11-29T07:43:37.6444350Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:movie_recommendation|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:movie_recommendation|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6444536Z assert 0.15275252316519464 == 0.004703372376466875 ± 4.7e-07
2024-11-29T07:43:37.6444615Z   comparison failed
2024-11-29T07:43:37.6444700Z   Obtained: 0.15275252316519464
2024-11-29T07:43:37.6444835Z   Expected: 0.004703372376466875 ± 4.7e-07
2024-11-29T07:43:37.6445445Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:navigate|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:navigate|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6445589Z assert 0.1633 == 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6445673Z   comparison failed
2024-11-29T07:43:37.6445755Z   Obtained: 0.1633
2024-11-29T07:43:37.6445888Z   Expected: 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6446467Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:navigate|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:navigate|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6446614Z assert 0.1633 == 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6446692Z   comparison failed
2024-11-29T07:43:37.6446775Z   Obtained: 0.1633
2024-11-29T07:43:37.6446901Z   Expected: 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6447522Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:navigate|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:navigate|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6447664Z assert 0.1633 == 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6447747Z   comparison failed
2024-11-29T07:43:37.6447822Z   Obtained: 0.1633
2024-11-29T07:43:37.6447960Z   Expected: 0.00471801862010947 ± 4.7e-07
2024-11-29T07:43:37.6448715Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:reasoning_about_colored_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:reasoning_about_colored_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6448901Z assert 0.13333333333333333 == 0.004000255247111385 ± 4.0e-07
2024-11-29T07:43:37.6448979Z   comparison failed
2024-11-29T07:43:37.6449064Z   Obtained: 0.13333333333333333
2024-11-29T07:43:37.6449338Z   Expected: 0.004000255247111385 ± 4.0e-07
2024-11-29T07:43:37.6450074Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:reasoning_about_colored_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:reasoning_about_colored_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6450257Z assert 0.13333333333333333 == 0.00405961457...4385 ± 4.1e-07
2024-11-29T07:43:37.6450460Z   comparison failed
2024-11-29T07:43:37.6450541Z   Obtained: 0.13333333333333333
2024-11-29T07:43:37.6450681Z   Expected: 0.0040596145716644385 ± 4.1e-07
2024-11-29T07:43:37.6451436Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:reasoning_about_colored_objects|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:reasoning_about_colored_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6451576Z assert 0.1 == 0.002971327782118411 ± 3.0e-07
2024-11-29T07:43:37.6451653Z   comparison failed
2024-11-29T07:43:37.6451744Z   Obtained: 0.1
2024-11-29T07:43:37.6451876Z   Expected: 0.002971327782118411 ± 3.0e-07
2024-11-29T07:43:37.6452488Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:ruin_names|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:ruin_names|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6452667Z assert 0.15275252316519464 == 0.00459225857...0545 ± 4.6e-07
2024-11-29T07:43:37.6452756Z   comparison failed
2024-11-29T07:43:37.6452836Z   Obtained: 0.15275252316519464
2024-11-29T07:43:37.6452971Z   Expected: 0.0045922585770880545 ± 4.6e-07
2024-11-29T07:43:37.6453558Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:ruin_names|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:ruin_names|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6453742Z assert 0.13333333333333333 == 0.004037827888116828 ± 4.0e-07
2024-11-29T07:43:37.6453820Z   comparison failed
2024-11-29T07:43:37.6453912Z   Obtained: 0.13333333333333333
2024-11-29T07:43:37.6454041Z   Expected: 0.004037827888116828 ± 4.0e-07
2024-11-29T07:43:37.6454664Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:ruin_names|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:ruin_names|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6454858Z assert 0.13333333333333333 == 0.004037827888116828 ± 4.0e-07
2024-11-29T07:43:37.6454958Z   comparison failed
2024-11-29T07:43:37.6455038Z   Obtained: 0.13333333333333333
2024-11-29T07:43:37.6455166Z   Expected: 0.004037827888116828 ± 4.0e-07
2024-11-29T07:43:37.6455946Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:salient_translation_error_detection|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:salient_translation_error_detection|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6456092Z assert 0.1633 == 0.00497231172432741 ± 5.0e-07
2024-11-29T07:43:37.6456176Z   comparison failed
2024-11-29T07:43:37.6456253Z   Obtained: 0.1633
2024-11-29T07:43:37.6456386Z   Expected: 0.00497231172432741 ± 5.0e-07
2024-11-29T07:43:37.6457157Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:salient_translation_error_detection|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:salient_translation_error_detection|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6457305Z assert 0.1 == 0.00304553201...4616 ± 3.0e-07
2024-11-29T07:43:37.6457383Z   comparison failed
2024-11-29T07:43:37.6457463Z   Obtained: 0.1
2024-11-29T07:43:37.6457594Z   Expected: 0.0030455320167854616 ± 3.0e-07
2024-11-29T07:43:37.6458387Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:salient_translation_error_detection|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:salient_translation_error_detection|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6458646Z assert 0.1 == 0.00304553201...4616 ± 3.0e-07
2024-11-29T07:43:37.6458743Z   comparison failed
2024-11-29T07:43:37.6458818Z   Obtained: 0.1
2024-11-29T07:43:37.6458958Z   Expected: 0.0030455320167854616 ± 3.0e-07
2024-11-29T07:43:37.6459546Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:snarks|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:snarks|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6459819Z assert 0.1633 == 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6459898Z   comparison failed
2024-11-29T07:43:37.6459977Z   Obtained: 0.1633
2024-11-29T07:43:37.6460107Z   Expected: 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6460680Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:snarks|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:snarks|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6460824Z assert 0.1633 == 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6460916Z   comparison failed
2024-11-29T07:43:37.6460993Z   Obtained: 0.1633
2024-11-29T07:43:37.6461129Z   Expected: 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6461720Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:snarks|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:snarks|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6461881Z assert 0.1633 == 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6461960Z   comparison failed
2024-11-29T07:43:37.6462040Z   Obtained: 0.1633
2024-11-29T07:43:37.6462168Z   Expected: 0.004897013451149668 ± 4.9e-07
2024-11-29T07:43:37.6462856Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:sports_understanding|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:sports_understanding|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6462999Z assert 0.1633 == 0.005037214858781963 ± 5.0e-07
2024-11-29T07:43:37.6463089Z   comparison failed
2024-11-29T07:43:37.6463164Z   Obtained: 0.1633
2024-11-29T07:43:37.6463298Z   Expected: 0.005037214858781963 ± 5.0e-07
2024-11-29T07:43:37.6463959Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:sports_understanding|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:sports_understanding|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6464113Z assert 0.1633 == 0.0049194014382352 ± 4.9e-07
2024-11-29T07:43:37.6464191Z   comparison failed
2024-11-29T07:43:37.6464271Z   Obtained: 0.1633
2024-11-29T07:43:37.6464398Z   Expected: 0.0049194014382352 ± 4.9e-07
2024-11-29T07:43:37.6465099Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:sports_understanding|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:sports_understanding|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6465239Z assert 0.1633 == 0.0049194014382352 ± 4.9e-07
2024-11-29T07:43:37.6465328Z   comparison failed
2024-11-29T07:43:37.6465406Z   Obtained: 0.1633
2024-11-29T07:43:37.6465537Z   Expected: 0.0049194014382352 ± 4.9e-07
2024-11-29T07:43:37.6466189Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:temporal_sequences|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:temporal_sequences|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6466328Z assert 0.1 == 0.00294961187...1973 ± 2.9e-07
2024-11-29T07:43:37.6466414Z   comparison failed
2024-11-29T07:43:37.6466495Z   Obtained: 0.1
2024-11-29T07:43:37.6466629Z   Expected: 0.0029496118745031973 ± 2.9e-07
2024-11-29T07:43:37.6467308Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:temporal_sequences|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:temporal_sequences|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6467442Z assert 0.1 == 0.00294961187...1973 ± 2.9e-07
2024-11-29T07:43:37.6467525Z   comparison failed
2024-11-29T07:43:37.6467727Z   Obtained: 0.1
2024-11-29T07:43:37.6467866Z   Expected: 0.0029496118745031973 ± 2.9e-07
2024-11-29T07:43:37.6468663Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:tracking_shuffled_objects_five_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:tracking_shuffled_objects_five_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6468973Z assert 0.13333333333333333 == 0.004030304374777823 ± 4.0e-07
2024-11-29T07:43:37.6469052Z   comparison failed
2024-11-29T07:43:37.6469132Z   Obtained: 0.13333333333333333
2024-11-29T07:43:37.6469270Z   Expected: 0.004030304374777823 ± 4.0e-07
2024-11-29T07:43:37.6470041Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_five_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_five_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6470237Z assert 0.13333333333333333 == 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6470316Z   comparison failed
2024-11-29T07:43:37.6470404Z   Obtained: 0.13333333333333333
2024-11-29T07:43:37.6470534Z   Expected: 0.003987436475939113 ± 4.0e-07
2024-11-29T07:43:37.6471346Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_five_objects|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_five_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6471489Z assert 0.1 == 0.00294083125...9783 ± 2.9e-07
2024-11-29T07:43:37.6471576Z   comparison failed
2024-11-29T07:43:37.6471651Z   Obtained: 0.1
2024-11-29T07:43:37.6471791Z   Expected: 0.0029408312580779783 ± 2.9e-07
2024-11-29T07:43:37.6472593Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:tracking_shuffled_objects_seven_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:tracking_shuffled_objects_seven_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6472779Z assert 0.15275252316519464 == 0.004588830718970504 ± 4.6e-07
2024-11-29T07:43:37.6472856Z   comparison failed
2024-11-29T07:43:37.6472940Z   Obtained: 0.15275252316519464
2024-11-29T07:43:37.6473070Z   Expected: 0.004588830718970504 ± 4.6e-07
2024-11-29T07:43:37.6473854Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_seven_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_seven_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6473996Z assert 0.1 == 0.00301340183...0794 ± 3.0e-07
2024-11-29T07:43:37.6474078Z   comparison failed
2024-11-29T07:43:37.6474153Z   Obtained: 0.1
2024-11-29T07:43:37.6474289Z   Expected: 0.0030134018302560794 ± 3.0e-07
2024-11-29T07:43:37.6475087Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_lighteval|bigbench:tracking_shuffled_objects_three_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval lighteval|bigbench:tracking_shuffled_objects_three_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6475260Z assert 0.1633 == 0.00504952053...2955 ± 5.0e-07
2024-11-29T07:43:37.6475338Z   comparison failed
2024-11-29T07:43:37.6475419Z   Obtained: 0.1633
2024-11-29T07:43:37.6475548Z   Expected: 0.0050495205374032955 ± 5.0e-07
2024-11-29T07:43:37.6476334Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_three_objects|3_acc_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_three_objects|3, metric acc_stderr incorrect
2024-11-29T07:43:37.6476515Z assert 0.13333333333333333 == 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6476598Z   comparison failed
2024-11-29T07:43:37.6476678Z   Obtained: 0.13333333333333333
2024-11-29T07:43:37.6476812Z   Expected: 0.00393258505632432 ± 3.9e-07
2024-11-29T07:43:37.6478003Z FAILED tests/test_main.py::test_model_prediction[gpt2_lite_harness|bigbench:tracking_shuffled_objects_three_objects|3_acc_norm_stderr] - AssertionError: Model gpt2 on lite samples, for eval harness|bigbench:tracking_shuffled_objects_three_objects|3, metric acc_norm_stderr incorrect
2024-11-29T07:43:37.6478222Z assert 0.15275252316519466 == 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6478303Z   comparison failed
2024-11-29T07:43:37.6478528Z   Obtained: 0.15275252316519466
2024-11-29T07:43:37.6478670Z   Expected: 0.004569520819574033 ± 4.6e-07
2024-11-29T07:43:37.6478847Z ====== 90 failed, 482 passed, 4 skipped, 4 warnings in 2109.48s (0:35:09) ======