Open tischi opened 4 months ago
Yes but the difficulty seems that different models react differently to different prompts. I think a way around this is to come up with 3-4 different ways of asking for the same task and maybe we can see if there's one way that more consistently produces the desired output. For example I found that asking for using a particular library sometimes helps but sometimes the model just invokes a non-existent function.
Here is what I got when asking less specific:
Q:
Hi, please implement a python function that opens an image from a path and returns its dimensions.
^D
A:
Certainly! You can use the Pillow library in Python to open an image and get its dimensions. Here’s a
function that does just that:
from PIL import Image
def get_image_dimensions(image_path):
"""
Open an image from the given path and return its dimensions (width, height).
Args:
image_path (str): The path to the image file.
Returns:
tuple: A tuple containing the width and height of the image.
"""
with Image.open(image_path) as img:
width, height = img.size
return width, height
I think the answer is totally sensible, but will fail our test, which I think is not fair.
I think for instance that it would be fair to be explicit about the input and output data types, because it cannot know this.
Here is a modified prompt that does not ask use skimage but should stil pass our test:
Q:
Write a python function that takes an image path (string) as an input, opens the image and returns the image dimensions (array).
^D
A:
Certainly! To achieve this, you can use the Pillow library in Python, which is a fork of the Python
Imaging Library (PIL). If you don't have it installed, you can install it using pip:
pip install pillow
Here's a function that takes an image path as input, opens the image, and returns its dimensions:
from PIL import Image
def get_image_dimensions(image_path):
"""
Opens an image and returns its dimensions.
Parameters:
image_path (str): The path to the image file.
Returns:
list: A list containing the width and height of the image.
"""
with Image.open(image_path) as img:
width, height = img.size
return [width, height]
# Example usage:
image_path = 'path/to/your/image.jpg'
dimensions = get_image_dimensions(image_path)
print(dimensions)
This function uses the Image.open method from the Pillow library to open the image. It then retrieves the dimensions using the size attribute, which returns a tuple containing the width and height. Finally, it returns the dimensions as a list.
Great discussion over here! I was also thinking in the context of #76: we might run some tests on the most failing tasks, to modify the prompt and see if all LLMs improve. Such a modification would be fair imo...
Along those lines: I don't think how we generated the prompts is best practice. They were written by different people without much instructions and without much review. Thus they could be very inconsistent in much they guy the LLMs towards the correct solution.
As mentioned above, I think we should be (have to be?) explicit about the exact input and output parameters: how many, which order, and which datatypes. Because if that's wrong the test will fail, is it?
What do you think?
Hi, apparently, this task is difficult for most LLMs.
However, using gpt-4o this works for me out of the box: [snip]
Maybe the way we ask the LLMs to perform a task sometimes is "unfairly" vague?
I agree - the prompt, inputs, and outputs need to be more rigorously defined. I'm coding some functionality to introspect the model results. It should make it a lot easier for us to be able pick out why models are failing/passing.
For example, see if you can spot what happened for the 'open_image_return_dimensions' task, haha. (model here is gemini-1.5)
As mentioned above, I think we should be (have to be?) explicit about the exact input and output parameters: how many, which order, and which datatypes. Because if that's wrong the test will fail, is it?
What do you think?
As long this is short text, I agree. For example, we could ask "provide the dimensionality of an image in an array with two elements: width and height"
In the original HumanEval, they tested super long prompts, which I think are not practical in real-world scenarios. I would not write 30 lines of prompt to ask for a function that has 3 lines of code.
As long this is short text, I agree. For example, we could ask "provide the dimensionality of an image in an array with two elements: width and height"
How could we practically proceed with this? Divide up the current tasks by the number of "developers"="us" and assign "us" to fix this in the ones assigned to us?
This would still lead to a slightly inconsistent way of doing this, as we all might have slightly different styles, but maybe that's OK?
Dividing efforts would be great. We can also minimize different style by peer-reviewing. Also, as guidance, that's what we wrote in the paper:
We kept the docstring intentionally brief and natural, because we intend to use LLMs to facilitate coding for bio-image analysts and this would better reflect a typical use-case.
for bio-image analysts
I think that means people that are reasonably educated in coding.
Hi @haesleinhuepf,
since there is a now a growing number of people contributing, maybe organising a zoom call with all interested to discuss whether we want some standardisation on how to specify input and output parameters in the prompt could be a good idea? What do you think?
(personally I am ooo until beginning of September)
Great idea! I'm for standardization. Before scheduling a meeting, I propose:
I'm happy to work on this, but not before August 26th. If anyone wants to dive into this before, go ahead and feel free to share your thoughts! :-)
Hi, apparently, this task is difficult for most LLMs.
However, using gpt-4o this works for me out of the box:
Maybe the way we ask the LLMs to perform a task sometimes is "unfairly" vague?