haesleinhuepf / human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation
MIT License
20 stars 13 forks source link

grouping test cases and categorisation #112

Open pr4deepr opened 2 months ago

pr4deepr commented 2 months ago

Hi @haesleinhuepf I was going through the preprint and one thought I had was grouping the test cases under categories. We get an overall view of how well LLMs perform but lose the granularity on whether LLMs perform well/worse on certain tasks and not on others..

For example, with our test cases perhaps grouping can be like:

etc..

It may give an idea of where we need more or less test cases as well. I remember you had a preprint on ontologies and standards for bioimage analysis. Perhaps that can be used as a reference.

Cheers Pradeep

haesleinhuepf commented 2 months ago

Hey @pr4deepr ,

great idea! This categorization is obviously a subjective task. We could automate this and make it more objective using an LLM; a text-classification LLM. Do you by chance have experience with this?

Cheers, Robert

pr4deepr commented 2 months ago

No, I do not. Would you put it through chatgpt for example to get a first pass?

If I copied the readme in the main repo containing description of the current test cases and used the question in Chat-GPT (GPT-4o):

I have some python functions and each of them perform a specific operation in bioimage analysis. Classify them into categories based on their function and where they will fit in the image analysis pipeline.

1. Image Preprocessing These functions focus on preparing the image by modifying its format, reducing noise, or performing other transformations to enhance analysis.

rgb_to_grey_image_transform rotate_image_by_90_degrees crop_quarter_image transpose_image_axes remove_noise_edge_preserving subtract_background_tophat mask_image remove_labels_on_edges remove_small_labels expand_labels_without_overlap

2. Image Enhancement Functions that enhance specific features of the image, such as sharpening edges or enhancing contrast.

detect_edges convolve_images deconvolve_image remove_noise_edge_preserving

3. Segmentation These functions segment images into meaningful regions or objects, usually through intensity-based methods.

apply_otsu_threshold_and_count_postiive_pixels region_growing_segmentation workflow_segmentation_counting workflow_watershed_segmentation_correction_measurement workflow_segmentation_measurement_summary workflow_segment_measure_umap

4. Morphological Operations These functions apply morphological transformations to binary images for cleaning or refining shapes.

binary_closing binary_skeleton label_binary_image_and_count_labels label_sequentially remove_labels_on_edges remove_small_labels

5. Quantification and Measurement Functions that quantify different properties of labeled regions or objects in the image.

convex_hull_measure_area count_number_of_touching_neighbors measure_aspect_ratio_of_regions measure_intensity_of_labels measure_intensity_over_time measure_mean_image_intensity measure_pixel_count_of_labels measure_properties_of_regions extract_surface_measure_area measure_properties_of_regions count_objects_over_time count_overlapping_regions

6. Feature Extraction Functions that extract geometrical or intensity features from the image.

fit_circle radial_intensity_profile convex_hull_measure_area measure_aspect_ratio_of_regions

7. File I/O Functions that handle loading, saving, or listing files.

list_image_files_in_folder open_image_read_voxel_size open_image_return_dimensions open_nifti_image open_zarr

8. Statistical Analysis Functions for performing statistical analysis on datasets related to the images.

bland_altman t_test pair_wise_correlation_matrix mean_std_column mean_squared_error

9. Pipeline/Workflow Automation These functions automate a sequence of steps, combining multiple image processing operations.

workflow_batch_process_folder_count_labels workflow_batch_process_folder_measure_intensity workflow_segment_measure_umap workflow_segmentation_measurement_summary workflow_watershed_segmentation_correction_measurement

10. Miscellaneous Functions that don't fit directly into other categories but are useful for specific tasks.

return_hello_world

haesleinhuepf commented 2 months ago

Awesome! I couldn't have done better ☺️ Now what do we do with this? Shall we preserve it somewhere in the repository?

pr4deepr commented 2 months ago

I'm creating a branch with the new categories.. Plan will be to save this as a yaml file..

There are errors and repetitions above, so need to clean it up as well..

https://github.com/pr4deepr/human-eval-bia/tree/function_categorize

haesleinhuepf commented 2 months ago

repetitions above

Well I think test-cases can be in multiple categories.

pr4deepr commented 2 months ago

Good point

pr4deepr commented 2 months ago

So, I've done an initial pass. image

Interesting findings:

The categorisation of course is important and if not done properly can misrepresent the results..

The function categorisation can be found here: https://github.com/pr4deepr/human-eval-bia/blob/function_categorize/demo/create_function_category_yaml.ipynb which saves is as a yaml file

I can create a separate notebook for the data processing and graphing as its currently here: https://github.com/pr4deepr/human-eval-bia/blob/function_categorize/demo/summarize_by_case.ipynb..

pr4deepr commented 2 months ago

Happy to create a PR, but wasn't sure if it should be to main..

haesleinhuepf commented 2 months ago

Yes! I certainly need such a figure for talks, because showing the blue table for all test-cases doesn't fit on a slide. It could also be in the paper... Curious what @tischi says about this figure:

download (2)

What I'm a bit concerned about this the static list of categories in the other notebook. It could be a pain to maintain this mid/long term. Would it be possible to put them in a dataframe, and add some code that warns if a test-case is in no category? Or even better, code that uses gpt4-o to categorize tesr-cases that are in no category and then adding them to the dataframe?

jkh1 commented 2 months ago

Regarding categorization, why not require some metadata tag to be present with each submitted test case? Maybe it's too late for this but we could add it to existing test cases as the numbers seem manageable. Doing it with an LLM would anyway require manual review of the outcome.

haesleinhuepf commented 2 months ago

Yet, we don't have any infrastructure for handling meta data of test-cases. I was hoping to fully automate this, so that only minimal manual curation is necessary. In an earlier discussion, also categorizing code depending on its complexity was discussed. No matter how we do these things, I'd love to have a semi-automatic solution with minimal code/infrastructure to maintain.

pr4deepr commented 2 months ago

We can use the GPT4-o idea, but is there a way to have a seed or something similar to guarantee relatively similar responses? The categories change everytime I ask... OR we just need to really be specific on question we ask GPT..

jkh1 commented 2 months ago

The tagging could be left to the author of the test case given a choice of predefined categories. Then its should be a matter of reading the tags when compiling the results. If classification is automated with an LLM, the outcome is likely to change over time and with the LLM used. I think we would need a deterministic algorithm for this.

jkh1 commented 2 months ago

@pr4deepr Exactly what I thought likely :)

haesleinhuepf commented 2 months ago

Ok, I leave the decision about this to you guys. Whatever works :-)

pr4deepr commented 2 months ago

I'm happy with solution from @jkh1 , i.e., having a few tags, and getting the author of new test cases to put those tags in their functions. We can have a few different tags for each category.

This could be a requirement when submitting a new test cases.

For existing functions, perhaps myself and @jkh1 could

Cheers Pradeep

haesleinhuepf commented 2 months ago

put those tags in their functions

Can you give an example how this could look like?

pr4deepr commented 2 months ago

Either in functions or in each notebook. I need to look at the code first. Will update it here

pr4deepr commented 2 months ago

Upon looking at the code again, I think we'll want to minimize any modifications to existing test functions & with creating yaml files for cases at this point.

I propose we have all the categorisation information in a yaml file with:

The categories can be:

  1. file_i_o
  2. image_preprocessing
  3. morphological_operations
  4. segmentation
  5. feature_extraction
  6. statistical_analysis
  7. workflow_automation
  8. oher

example yaml file:

create_umap

  • feature_extraction

workflow_segment_measure_umap

  • segmentation
  • workflow_automation
  • feature_extraction

I'm happy to go through existing test cases and create this yaml file..

When a test case PR is submitted, the yaml file will have to be modified to add the new function and category. The PR template will need to be modified.

If the need arises we can expand the categories, but I feel like this should cover it.

haesleinhuepf commented 2 months ago

Yes, great idea!

The PR template will need to be modified.

We can also add some python code which tests if all test cases are in this yaml.file. e.g. in create_cases.ipynb or as github worklfow.

jkh1 commented 2 months ago

Sounds good to me. My intiial idea was to use notebook tags but I realized this may be more complicated to get at.

pr4deepr commented 2 months ago

I've made the necessary changes with commit history here:

https://github.com/pr4deepr/human-eval-bia/commits/function_categorize/

If you are happy with this, I can open a PR. Let me know which branch you'd prefer.

haesleinhuepf commented 2 months ago

Awesome @pr4deepr , thanks for working on this! Yes, please send a PR!