haesleinhuepf / human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation
MIT License
19 stars 11 forks source link

How to evaluate #17

Open tischi opened 5 months ago

tischi commented 5 months ago

Currently the adapted pass@k formula is applied.

I wonder about the following alternative:

How do you currently handle the E case?

In any case, I think it could be important and very interesting to explicitly separate and report E/N and F/N, because, arguably, F/N is worse than E/N.

tischi commented 5 months ago

@haesleinhuepf given your comments in the Google Doc, we could also consider reporting several metrics in the table, serving different reader-ships, including pass@k.

haesleinhuepf commented 5 months ago

Try N times Report: C/N: fraction of passed tests

We do have data for this. All prompts were executed 10 times and the responses are stored in the json files. In the paper we report the pass rate how many of these 10 times produced functionally correct code. This is equal to pass@1.

E/N: fraction of cases where the code did not execute to the end, because it threw an error F/N: fraction of cases where the code executed, but the test did not pass

Also data for this analysis is available in the _results.jsonl files. Example from here:

{

"task_id": "../test_cases/detect_edges.ipynb", 

"completion": "\nimport cv2\n\ndef detect_edges(image):\n    \"\"\"\n    Applies an edge-detection filter to an image using OpenCV's Canny edge detection method.\n    \"\"\"\n\n    # Convert the image to grayscale\n    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)\n\n    # Apply Canny edge detection\n    edges = cv2.Canny(gray, 50, 150)\n\n    return edges\n", 

"full_response": "```python\nimport cv2\n\ndef detect_edges(image):\n    \"\"\"\n    Applies an edge-detection filter to an image using OpenCV's Canny edge detection method.\n    \"\"\"\n\n    # Convert the image to grayscale\n    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)\n\n    # Apply Canny edge detection\n    edges = cv2.Canny(gray, 50, 150)\n\n    return edges\n```", 

"result": "failed: OpenCV(4.9.0) d:\\a\\opencv-python\\opencv-python\\opencv\\modules\\imgproc\\src\\color.simd_helpers.hpp:92: error: (-2:Unspecified error) in function '__cdecl cv::impl::`anonymous-namespace'::CvtHelper<struct cv::impl::`anonymous namespace'::Set<3,4,-1>,struct cv::impl::A0x59191d0d::Set<1,-1,-1>,struct cv::impl::A0x59191d0d::Set<0,2,5>,4>::CvtHelper(const class cv::_InputArray &,const class cv::_OutputArray &,int)'\n> Invalid number of channels in input image:\n>     'VScn::contains(scn)'\n> where\n>     'scn' is 1\n", 

"passed": false}
tischi commented 5 months ago

OK, great, is there also already code that parses this JSON into the table that is shown in the paper?

I could try to modify the code to add more columns along the lines of what I suggested above, if you agree that this could be useful.

haesleinhuepf commented 5 months ago

is there also already code that parses this JSON into the table that is shown in the paper?

Yes, under the table, there is a link to this notebook: https://github.com/haesleinhuepf/human-eval-bia/blob/main/demo/summarize_by_case.ipynb which created the Figure/Table

tischi commented 5 months ago

I installed everything, but executing

> jupyter notebook

throws me:

Jupyter commandjupyter-notebooknot found.

Do you know what I may be doing wrong?

haesleinhuepf commented 5 months ago

Use jupyter lab instead