Improve how we ask the LLMs to perform a task

tischi commented 4 months ago

Hi, apparently, this task is difficult for most LLMs.

However, using gpt-4o this works for me out of the box:

Q:
Hi, using skimage, please implement a python function that opens an image from a path and returns its dimensions.
^D

A:

  Sure! Below is a Python function using  skimage  that opens an image from a given path and returns its
  dimensions.

    from skimage import io

    def get_image_dimensions(image_path):
        # Read the image from the given path
        image = io.imread(image_path)
        # Get the dimensions of the image
        dimensions = image.shape
        return dimensions

Maybe the way we ask the LLMs to perform a task sometimes is "unfairly" vague?

jkh1 commented 4 months ago

Yes but the difficulty seems that different models react differently to different prompts. I think a way around this is to come up with 3-4 different ways of asking for the same task and maybe we can see if there's one way that more consistently produces the desired output. For example I found that asking for using a particular library sometimes helps but sometimes the model just invokes a non-existent function.

tischi commented 4 months ago

Here is what I got when asking less specific:

Q:
Hi, please implement a python function that opens an image from a path and returns its dimensions.
^D

A:

  Certainly! You can use the  Pillow  library in Python to open an image and get its dimensions. Here’s a
  function that does just that:

    from PIL import Image

    def get_image_dimensions(image_path):
        """
        Open an image from the given path and return its dimensions (width, height).

        Args:
        image_path (str): The path to the image file.

        Returns:
        tuple: A tuple containing the width and height of the image.
        """
        with Image.open(image_path) as img:
            width, height = img.size
        return width, height

I think the answer is totally sensible, but will fail our test, which I think is not fair.

tischi commented 4 months ago

I think for instance that it would be fair to be explicit about the input and output data types, because it cannot know this.

Here is a modified prompt that does not ask use skimage but should stil pass our test:

Q:
Write a python function that takes an image path (string) as an input, opens the image and returns the image dimensions (array).
^D

A:

  Certainly! To achieve this, you can use the  Pillow  library in Python, which is a fork of the Python
  Imaging Library (PIL). If you don't have it installed, you can install it using pip:

    pip install pillow

  Here's a function that takes an image path as input, opens the image, and returns its dimensions:

    from PIL import Image

    def get_image_dimensions(image_path):
        """
        Opens an image and returns its dimensions.

        Parameters:
            image_path (str): The path to the image file.

        Returns:
            list: A list containing the width and height of the image.
        """
        with Image.open(image_path) as img:
            width, height = img.size
        return [width, height]

    # Example usage:
    image_path = 'path/to/your/image.jpg'
    dimensions = get_image_dimensions(image_path)
    print(dimensions)

  This function uses the  Image.open  method from the  Pillow  library to open the image. It then retrieves the dimensions using the  size  attribute, which returns a tuple containing the width and height. Finally, it returns the dimensions as a list.

haesleinhuepf commented 4 months ago

Great discussion over here! I was also thinking in the context of #76: we might run some tests on the most failing tasks, to modify the prompt and see if all LLMs improve. Such a modification would be fair imo...

tischi commented 4 months ago

Along those lines: I don't think how we generated the prompts is best practice. They were written by different people without much instructions and without much review. Thus they could be very inconsistent in much they guy the LLMs towards the correct solution.

As mentioned above, I think we should be (have to be?) explicit about the exact input and output parameters: how many, which order, and which datatypes. Because if that's wrong the test will fail, is it?

What do you think?

ian-coccimiglio commented 4 months ago

Hi, apparently, this task is difficult for most LLMs.

However, using gpt-4o this works for me out of the box: [snip]

Maybe the way we ask the LLMs to perform a task sometimes is "unfairly" vague?

I agree - the prompt, inputs, and outputs need to be more rigorously defined. I'm coding some functionality to introspect the model results. It should make it a lot easier for us to be able pick out why models are failing/passing.

For example, see if you can spot what happened for the 'open_image_return_dimensions' task, haha. (model here is gemini-1.5)

haesleinhuepf commented 4 months ago

As mentioned above, I think we should be (have to be?) explicit about the exact input and output parameters: how many, which order, and which datatypes. Because if that's wrong the test will fail, is it?

What do you think?

As long this is short text, I agree. For example, we could ask "provide the dimensionality of an image in an array with two elements: width and height"

In the original HumanEval, they tested super long prompts, which I think are not practical in real-world scenarios. I would not write 30 lines of prompt to ask for a function that has 3 lines of code.

tischi commented 4 months ago

As long this is short text, I agree. For example, we could ask "provide the dimensionality of an image in an array with two elements: width and height"

How could we practically proceed with this? Divide up the current tasks by the number of "developers"="us" and assign "us" to fix this in the ones assigned to us?

This would still lead to a slightly inconsistent way of doing this, as we all might have slightly different styles, but maybe that's OK?

haesleinhuepf commented 4 months ago

Dividing efforts would be great. We can also minimize different style by peer-reviewing. Also, as guidance, that's what we wrote in the paper:

We kept the docstring intentionally brief and natural, because we intend to use LLMs to facilitate coding for bio-image analysts and this would better reflect a typical use-case.

tischi commented 4 months ago

for bio-image analysts

I think that means people that are reasonably educated in coding.

tischi commented 3 months ago

Hi @haesleinhuepf,

since there is a now a growing number of people contributing, maybe organising a zoom call with all interested to discuss whether we want some standardisation on how to specify input and output parameters in the prompt could be a good idea? What do you think?

(personally I am ooo until beginning of September)

haesleinhuepf commented 3 months ago

Great idea! I'm for standardization. Before scheduling a meeting, I propose:

We go through the list of prompts and
compare it to the visualization pass-rates and
identify the prompts 10 which are failing most often as discussed in #76 . We may see a pattern at this point.
Then, we can make a proposal how to modify these prompts (and potentially others), and
formulate a standard.

I'm happy to work on this, but not before August 26th. If anyone wants to dive into this before, go ahead and feel free to share your thoughts! :-)

haesleinhuepf / human-eval-bia

Improve how we ask the LLMs to perform a task #79