landing-ai / vision-agent

Vision agent
Apache License 2.0
1.2k stars 125 forks source link

About vision agent always running #167

Closed hmppt closed 1 month ago

hmppt commented 2 months ago

I looked at the code and thought that if it still failed after three attempts, the process would be terminated. However, I ran several samples and none of them seemed to be completed by myself. After trying to debug 3 times, it still initializes the code and continues execution.

`INFO:vision_agent.agent.vision_agent:Start debugging attempt 3 WARNING:traitlets:Could not destroy zmq context for <jupyter_client.asynchronous.client.AsyncKernelClient object at 0x321d1f580> Code and test after attempted fix: ============================== Code ============================== 1 from typing import
2 from pillow_heif import register_heif_opener
3 register_heif_opener()
4 import vision_agent as va
5 from vision_agent.tools import register_tool
6
7 from typing import

8 from pillow_heif import register_heif_opener
9 register_heif_opener()
10 import vision_agent as va
11 from vision_agent.tools import register_tool
12 # The fixed code is provided above.
13 # The fixed test code is provided above.
============================== Test ============================== 1 The fixed test code is provided above.
INFO:vision_agent.agent.vision_agent:Reflection: The error was due to invalid Python syntax. It seems like the lines causing the error were meant to be comments or placeholders, but they were not marked as comments. The fix was to comment out these lines. Code execution result after attempted fix: ----- stdout -----

----- stderr -----

----- Error ----- Traceback (most recent call last): File "/opt/miniconda3/envs/ben/lib/python3.10/site-packages/vision_agent/utils/execute.py", line 573, in exec_cell self.nb_client.execute_cell(cell, len(self.nb.cells) - 1) File "/opt/miniconda3/envs/ben/lib/python3.10/site-packages/jupyter_core/utils/init.py", line 165, in wrapped return loop.run_until_complete(inner) File "/opt/miniconda3/envs/ben/lib/python3.10/asyncio/base_events.py", line 641, in run_until_complete return future.result() File "/opt/miniconda3/envs/ben/lib/python3.10/site-packages/nbclient/client.py", line 1062, in async_execute_cell await self._check_raise_for_error(cell, cell_index, exec_reply) File "/opt/miniconda3/envs/ben/lib/python3.10/site-packages/nbclient/client.py", line 918, in _check_raise_for_error raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content) nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:

from typing import from pillow_heif import register_heif_opener register_heif_opener() import vision_agent as va from vision_agent.tools import register_tool from typing import from pillow_heif import register_heif_opener register_heif_opener() import vision_agent as va from vision_agent.tools import register_tool

The fixed code is provided above.

The fixed test code is provided above.

The fixed test code is provided above.

Cell In[1], line 13 The fixed test code is provided above. ^ SyntaxError: invalid syntax. Perhaps you forgot a comma?

Final code and tests: ============================== Code ============================== 1 from typing import
2 from pillow_heif import register_heif_opener
3 register_heif_opener()
4 import vision_agent as va
5 from vision_agent.tools import register_tool
6
7 from typing import

8 from pillow_heif import register_heif_opener
9 register_heif_opener()
10 import vision_agent as va
11 from vision_agent.tools import register_tool
12 # The fixed code is provided above.
13 # The fixed test code is provided above.
============================== Test ============================== 1 The fixed test code is provided above.
INFO:vision_agent.agent.vision_agent: ┍━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑ │ instructions │ ┝━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥ │ Use the Dalle3_text2img tool to generate 20 images with the prompt 'Belle'. │ ├─────────────────────────────────────────────────────────────────────────────────┤ │ Use the save_image tool to save each generated image to the specified directory │ │ '/Users/feifan/benchmark/first_test'. │ ┕━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙ INFO:vision_agent.agent.vision_agent: ┍━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑ │ instructions │ ┝━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥ │ Use the Dalle3_prompt_gen tool to generate prompts related to 'Belle'. │ ├─────────────────────────────────────────────────────────────────────────────────┤ │ Use the Dalle3_text2img tool to generate 20 images using the generated prompts. │ ├─────────────────────────────────────────────────────────────────────────────────┤ │ Use the save_image tool to save each generated image to the specified directory │ │ '/Users/feifan/benchmark/first_test'. │ ┕━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙ INFO:vision_agent.agent.vision_agent: ┍━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑ │ instructions │ ┝━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥ │ Use the Dalle3_text2img tool to generate images with the prompt 'Belle'. │ ├──────────────────────────────────────────────────────────────────────────────┤ │ Use the vit_image_classification tool to classify each generated image. │ ├──────────────────────────────────────────────────────────────────────────────┤ │ If the top label of the classification result is 'Belle', use the save_image │ │ tool to save the image to the specified directory │ │ '/Users/feifan/benchmark/first_test'. Repeat the process until 20 images are │ │ saved. │ ┕━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙ INFO:vision_agent.agent.vision_agent:Tools Description: 'vit_image_classification' is a tool that can classify an image. It returns a list of classes and their probability scores based on image content. 'Dalle3_text2img' is a tool for generating images from text prompts using a Dalle3 model. This function allows the user to specify prompts, providing finer control over the generation process. 'clip' is a tool that can classify an image or a cropped detection given a list of input classes or tags. It returns the same list of the input classes along with their probability scores based on image content. 'save_video' is a utility function that saves a list of frames as a mp4 video file on disk. 'Dalle3_prompt_gen' is a tool for writing prompt for Dalle model to generate diffusion image. This function imagine relevant scenes or objects and returns a list of words which are visually specific concepts and vibes. 'save_image' is a utility function that saves an image to a file path. 'load_image' is a utility function that loads an image from the given file path string. WARNING:traitlets:Could not destroy zmq context for <jupyter_client.asynchronous.client.AsyncKernelClient object at 0x321d1f4c0> Initial code and tests: ============================== Code ============================== 1 from typing import *
2 from pillow_heif import register_heif_opener
3 register_heif_opener()
4 import vision_agent as va
5 from vision_agent.tools import register_tool
6
7
8 from vision_agent.tools import Dalle3_text2img, save_image, Dalle3_prompt_gen, vit_image_classification
9 import os
10
11 # Plan 1
12 output_dict = {}
13 for i in range(20):
14 image_dict = Dalle3_text2img('Belle')
15 image_url = image_dict['img_url']
16 image = load_image(image_url)
17 save_image(image, os.path.join('/Users/feifan/benchmark/firsttest', f'image{i}.png'))
18 output_dict['Plan1'] = 'Images saved successfully'
19
20 # Plan 2
21 prompts = Dalle3_prompt_gen('Belle')
22 for i, prompt in enumerate(prompts):
23 image_dict = Dalle3_text2img(prompt['prompt'])
24 image_url = image_dict['img_url']
25 image = load_image(image_url)
26 save_image(image, os.path.join('/Users/feifan/benchmark/firsttest', f'image{i}.png'))
27 output_dict['Plan2'] = 'Images saved successfully'
28
29 # Plan 3
30 saved_images = 0
31 while saved_images < 20:
32 image_dict = Dalle3_text2img('Belle')
33 image_url = image_dict['img_url']
34 image = load_image(image_url)
35 classification = vit_image_classification(image)
36 if classification['labels'][0] == 'Belle':
37 save_image(image, os.path.join('/Users/feifan/benchmark/firsttest', f'image{saved_images}.png'))
38 saved_images += 1
39 output_dict['Plan3'] = 'Images saved successfully'
40
41 print(output_dict)
42
INFO:vision_agent.agent.vision_agent:Initial code execution result: ----- stdout -----

----- stderr -----

----- Error ----- Traceback (most recent call last): File "/opt/miniconda3/envs/ben/lib/python3.10/site-packages/vision_agent/utils/execute.py", line 573, in exec_cell self.nb_client.execute_cell(cell, len(self.nb.cells) - 1) File "/opt/miniconda3/envs/ben/lib/python3.10/site-packages/jupyter_core/utils/init.py", line 165, in wrapped return loop.run_until_complete(inner) File "/opt/miniconda3/envs/ben/lib/python3.10/asyncio/base_events.py", line 641, in run_until_complete return future.result() File "/opt/miniconda3/envs/ben/lib/python3.10/site-packages/nbclient/client.py", line 1062, in async_execute_cell await self._check_raise_for_error(cell, cell_index, exec_reply) File "/opt/miniconda3/envs/ben/lib/python3.10/site-packages/nbclient/client.py", line 918, in _check_raise_for_error raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content) nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:

from typing import * from pillow_heif import register_heif_opener register_heif_opener() import vision_agent as va from vision_agent.tools import register_tool

from vision_agent.tools import Dalle3_text2img, save_image, Dalle3_prompt_gen, vit_image_classification import os

Plan 1

output_dict = {} for i in range(20): image_dict = Dalle3_text2img('Belle') image_url = image_dict['img_url'] image = load_image(image_url) save_image(image, os.path.join('/Users/feifan/benchmark/firsttest', f'image{i}.png')) output_dict['Plan1'] = 'Images saved successfully'

Plan 2

prompts = Dalle3_prompt_gen('Belle') for i, prompt in enumerate(prompts): image_dict = Dalle3_text2img(prompt['prompt']) image_url = image_dict['img_url'] image = load_image(image_url) save_image(image, os.path.join('/Users/feifan/benchmark/firsttest', f'image{i}.png')) output_dict['Plan2'] = 'Images saved successfully'

Plan 3

saved_images = 0 while saved_images < 20: image_dict = Dalle3_text2img('Belle') image_url = image_dict['img_url'] image = load_image(image_url) classification = vit_image_classification(image) if classification['labels'][0] == 'Belle': save_image(image, os.path.join('/Users/feifan/benchmark/firsttest', f'image{saved_images}.png')) saved_images += 1 output_dict['Plan3'] = 'Images saved successfully'

print(output_dict)


----- stdout ----- Generated image URL: https://sampool-bucket.cn-shanghai.oss.aliyun-inc.com/aispace/llm/dall-e-output/0ba4998be0b6987bb3956b62d4f16b83.png?Expires=2036147759&OSSAccessKeyId=LTAI5tKQQgDsNWULhifdJCPC&Signature=JRzRWpKLrPpFusEUUzTROLmZ08E%3D


NameError Traceback (most recent call last) Cell In[1], line 16 14 image_dict = Dalle3_text2img('Belle') 15 image_url = image_dict['img_url'] ---> 16 image = load_image(image_url) 17 save_image(image, os.path.join('/Users/feifan/benchmark/firsttest', f'image{i}.png')) 18 output_dict['Plan1'] = 'Images saved successfully'

NameError: name 'load_image' is not defined

WARNING:traitlets:Could not destroy zmq context for <jupyter_client.asynchronous.client.AsyncKernelClient object at 0x321f94940> WARNING:traitlets:Could not destroy zmq context for <jupyter_client.asynchronous.client.AsyncKernelClient object at 0x321d1df60>`

dillonalaird commented 2 months ago

Thanks for testing this out! I see you're also using a custom tool. Is the issue here that it keeps re-trying but never exits? We have a couple areas where it retries right now:

  1. the main outer loop https://github.com/landing-ai/vision-agent/blob/main/vision_agent/agent/vision_agent.py#L679
  2. debugging the test plans https://github.com/landing-ai/vision-agent/blob/main/vision_agent/agent/vision_agent.py#L201
  3. debugging the main code/tests https://github.com/landing-ai/vision-agent/blob/main/vision_agent/agent/vision_agent.py#L354

If you could share your prompt and custom tools I can run it on my end and try to reproduce. If you are uncomfortable sharing here you can also reach out to me on Discord and share privately https://discord.gg/RVcW3j9RgR

hmppt commented 2 months ago

Yes, I think the problem is the same as what you said.

Of course, my prompt is Generate 20 pictures of beautiful young woman and save them in /path/to/save/image I added three custom tools to the originally defined tool list

def Dalle2_text2img(
    prompt: str,
)-> Dict[str, Any]:
    """'Dalle2_text2img' is a tool for generating images from text prompts using a  
    Dalle2 model. This function allows the user to specify prompts, providing finer control over the generation process. 
    Parameters:
        prompt (str): The text prompt to generate the image from. This is the main description of the desired image.
    Returns:
        Dict[str, Any]: A dictionary containing the generated image url.
    Example
    -------
        >>> Dalle2_text2img(prompt="a photo of young girl")
         {'imgfile': 'https://oaidalleapiprodscus.blob.core.windows.net/private/org-TeSt22FBWEC7NQLwRraLiXm8/user-3LIIFxvhKPI8jsc4MSblg3KD/img-5vStBsnLxTSiOcEr35JkePBb.png?st=2024-07-03T08%3A34%3A29Z&se=2024-07-03T10%3A34%3A29Z&sp=r&sv=2023-11-03&sr=b&rscd=inline&rsct=image/png&skoid=6aaadede-4fb3-4698-a8f6-684d7786b067&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2024-07-03T02%3A51%3A34Z&ske=2024-07-04T02%3A51%3A34Z&sks=b&skv=2023-11-03&sig=DMyFFUQEASKKPDaTKEOqoW5M8ygx6OOhiThmjW/Fz84%3D'}
    """
    url = "https://api.openai.com/v1/images/generations"
    headers = {
        "Authorization": _OPENAI_API_KEY,
        "Content-Type": "application/json"
    }
    data = {
        "prompt": prompt,
        "n": 1,  # Number of images to generate
        "size": "256x256"  # Size of the generated image
    }
    response = requests.post(url, headers=headers, data=json.dumps(data))

    if response.status_code == 200:
        image_url = response.json()["data"][0]["url"]
        print(f"Generated image URL: {image_url}")
    else:
        print(f"Failed to generate image. Status code: {response.status_code}")
        print(response.json())

    outdata = {
        "imgfile": image_url
    }
    return outdata
def Dalle2_prompt_gen(
    text: str
)-> Dict[str, Any]:
    """'Dalle2_prompt_gen' is a tool for writing prompt for Dalle model to generate diffusion image. This function imagine relevant scenes or objects and returns a list of words which are visually specific concepts and vibes. 
    Parameters:
        text (str): The input text
    Returns:
        List[Dict[str, Any]]: A list of dictionaries containing the generated prompt for diffusion model Dalle.
    Example
    -------
        >>> Dalle2_prompt_gen(text="beautiful young woman")
        [
            {'prompt': 'a photo of bella'},
            {'prompt': 'a photo of young woman with yellow hat'},
        ]    
    """
    client = OpenAI(
    # This is the default and can be omitted
    api_key=os.environ.get("OPENAI_API_KEY"),
    )

    chat_completion = client.chat.completions.create(
        messages=[
            {"role": "system", "content": "You are a Diffusion model prompt generator. You onlu output English. You should imagine relevant scenes or objects and returns a list of words which are visually specific concepts and vibes.  For example, if the input prompt is Plants, output a python list of length 2 as follows: ['a photo of a tree', 'a photo of grass'] "},
            {"role": "user", "content": text}
        ],
        model="gpt-3.5-turbo",
    )

    return_data=[]
    promptlist = ast.literal_eval(chat_completion.choices[0].message.content)
    for i in range(len(promptlist)):
        return_data.append(
            {
                "prompt": promptlist[i]
            }
        )
    return return_data
def Dalle2_imgvaria(
    imgurl: str,
)-> Dict[str, Any]:
    """'Dalle2_imgvaria' is a tool for generating variated images from an input image using a  
    Dalle2 model. This function allows the user to reimagin more different but senmantic similar images. 
    Parameters:
        imgurl (str): The input image from. This is the main reference of the desired image.
    Returns:
        Dict[str, Any]: A dictionary containing the generated image filename.
    Example
    -------
        >>> Dalle2_imgvaria(imgurl="https://xxxx/e02210d7-5ce3-4230-b4a3-918066d1c6fc_20231028005328.jpg")
         {'imgfile': './tmp.jpeg'}
    """
    url = "https://api.openai.com/v1/images/generations"
    headers = {
        "Authorization": _OPENAI_API_KEY,
        "Content-Type": "application/json"
    }

    with open(image_path, "rb") as image_file:
        image_data = image_file.read()

    # If necessary, resize and convert the image
    image = Image.open(io.BytesIO(image_data))
    image = image.resize((256, 256))  # Resize to 1024x1024 if required
    buffered = io.BytesIO()
    image.save(buffered, format="PNG")
    image_data = buffered.getvalue()

    multipart_data = {
        "image": ("image.png", image_data, "image/png")
    }

    response = requests.post(url, headers=headers, files=multipart_data)

    if response.status_code == 200:
        image_url = response.json()["data"][0]["url"]
        print(f"Generated image variation URL: {image_url}")
    else:
        print(f"Failed to generate image variation. Status code: {response.status_code}")
        print(response.json())

    outdata = {
        "imgfile": image_url
    }
    return outdata

The above three custom tools need to be added to the tool list and need to be set _OPENAI_API_KEY.

When I run my prompt, I thought it should exit after three unsuccessful attempts to modify the code, but it turns out that after three modifications, it starts to regenerate the plan and modify the code. Maybe you can explain to me what the default process is, including how many times you choose plan and how many times you fix code.

dillonalaird commented 2 months ago

Got it, thanks for the explanation and context. Here's the current retries it does:

for a maximum of 3 retries, do the following:
    create 3 sets of plans
    for a maximum of 3 retries, create and execute code to test the tools for those plans
    pick the best plan
    for a maximum of 3 retries, create and execute code and tests for the best plan
    if success:
        return code

However, the outer loop was originally because it would reflect on failed output and retry again. I will remove this in a PR since I think it's more likely to cause confusion now (the reflection didn't work that well anyways).

To get your code running I modified a few things:

  1. I included imports in the register_tool call
  2. for global variables in tools like _OPENAI_API_KEY I moved them into the function (tools can retrieve global variables)
  3. I modified the prompt a little to generate N pictures but test with only 2. This makes testing a lot faster, you can save the code and run it later with higher N once you know it works.

On this PR https://github.com/landing-ai/vision-agent/pull/169 I've removed the outer loop and also increased the number of error lines sent to the debugger so it doesn't get stuck as easily. Here's the modified code I ran:

import ast
import io
import json
import os
from typing import Any, Dict, List

import requests
from openai import OpenAI
from PIL import Image

import vision_agent as va

@va.tools.register_tool(imports=["import os", "import requests", "import json"])
def Dalle2_text2img(
    prompt: str,
) -> Dict[str, Any]:
    """'Dalle2_text2img' is a tool for generating images from text prompts using a
    Dalle2 model. This function allows the user to specify prompts, providing finer
    control over the generation process.

    Parameters:
        prompt (str): The text prompt to generate the image from. This is the main
            description of the desired image.
    Returns:
        Dict[str, Any]: A dictionary containing the generated image url.
    Example
    -------
        >>> Dalle2_text2img(prompt="a photo of young girl")
         {'imgfile': 'https://oaidalleapiprodscus.blob.core.windows.net/private/org-TeSt22FBWEC7NQLwRraLiXm8/user-3LIIFxvhKPI8jsc4MSblg3KD/img-5vStBsnLxTSiOcEr35JkePBb.png?st=2024-07-03T08%3A34%3A29Z&se=2024-07-03T10%3A34%3A29Z&sp=r&sv=2023-11-03&sr=b&rscd=inline&rsct=image/png&skoid=6aaadede-4fb3-4698-a8f6-684d7786b067&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2024-07-03T02%3A51%3A34Z&ske=2024-07-04T02%3A51%3A34Z&sks=b&skv=2023-11-03&sig=DMyFFUQEASKKPDaTKEOqoW5M8ygx6OOhiThmjW/Fz84%3D'}
    """
    url = "https://api.openai.com/v1/images/generations"
    headers = {
        "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}",
        "Content-Type": "application/json",
    }
    data = {
        "prompt": prompt,
        "n": 1,  # Number of images to generate
        "size": "256x256",  # Size of the generated image
    }
    response = requests.post(url, headers=headers, data=json.dumps(data))

    image_url = None
    if response.status_code == 200:
        image_url = response.json()["data"][0]["url"]
        print(f"Generated image URL: {image_url}")
    else:
        print(f"Failed to generate image. Status code: {response.status_code}")
        print(response.json())

    outdata = {"imgfile": image_url}
    return outdata

@va.tools.register_tool(
    imports=["import os", "from openai import OpenAI", "import ast"]
)
def Dalle2_prompt_gen(text: str) -> List[Dict[str, Any]]:
    """'Dalle2_prompt_gen' is a tool for writing prompt for Dalle model to generate
    diffusion image. This function imagine relevant scenes or objects and returns a list
    of words which are visually specific concepts and vibes.

    Parameters:
        text (str): The input text
    Returns:
        List[Dict[str, Any]]: A list of dictionaries containing the generated prompt
            for diffusion model Dalle.
    Example
    -------
        >>> Dalle2_prompt_gen(text="beautiful young woman")
        [
            {'prompt': 'a photo of bella'},
            {'prompt': 'a photo of young woman with yellow hat'},
        ]
    """
    client = OpenAI(
        # This is the default and can be omitted
        api_key=os.environ.get("OPENAI_API_KEY"),
    )

    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": "You are a Diffusion model prompt generator. You onlu output English. You should imagine relevant scenes or objects and returns a list of words which are visually specific concepts and vibes.  For example, if the input prompt is Plants, output a python list of length 2 as follows: ['a photo of a tree', 'a photo of grass'] ",
            },
            {"role": "user", "content": text},
        ],
        model="gpt-3.5-turbo",
    )

    return_data = []
    promptlist = ast.literal_eval(chat_completion.choices[0].message.content)
    for i in range(len(promptlist)):
        return_data.append({"prompt": promptlist[i]})
    return return_data

@va.tools.register_tool(
    imports=["import os", "import io", "from PIL import Image", "import requests", "import json"]
)
def Dalle2_imgvaria(
    imgurl: str,
) -> Dict[str, Any]:
    """'Dalle2_imgvaria' is a tool for generating variated images from an input image
    using a Dalle2 model. This function allows the user to reimagin more different but
    senmantic similar images.

    Parameters:
        imgurl (str): The input image from. This is the main reference of the desired image.
    Returns:
        Dict[str, Any]: A dictionary containing the generated image filename.
    Example
    -------
        >>> Dalle2_imgvaria(imgurl="https://xxxx/e02210d7-5ce3-4230-b4a3-918066d1c6fc_20231028005328.jpg")
         {'imgfile': './tmp.jpeg'}
    """
    url = "https://api.openai.com/v1/images/generations"
    headers = {
        "Authorization": os.environ.get("OPENAI_API_KEY"),
        "Content-Type": "application/json",
    }

    image = Image.open(requests.get(imgurl, stream=True).raw)
    image = image.resize((256, 256))  # Resize to 1024x1024 if required
    buffered = io.BytesIO()
    image.save(buffered, format="PNG")
    image_data = buffered.getvalue()

    multipart_data = {"image": ("image.png", image_data, "image/png")}

    response = requests.post(url, headers=headers, files=multipart_data)

    image_url = None
    if response.status_code == 200:
        image_url = response.json()["data"][0]["url"]
        print(f"Generated image variation URL: {image_url}")
    else:
        print(
            f"Failed to generate image variation. Status code: {response.status_code}"
        )
        print(response.json())

    outdata = {"imgfile": image_url}
    return outdata

if __name__ == "__main__":
    # generate_and_save_images()
    agent = va.agent.VisionAgent(verbosity=2)
    resp = agent.chat_with_workflow(
        [
            {
                "role": "user",
                "content": "Create a python script that generates N pictures of beautiful young women and saves them to the output_images/ folder. To test, start with N=2."
            }
        ]
    )
    with open("code.py", "w") as f:
        f.write(f"{resp['code']}\n{resp['test']}")
dillonalaird commented 2 months ago

Here's the final code it generates (note I run this in the above file so I have the Dalle2 functions available):

import numpy as np
import os
import requests
from PIL import Image
from io import BytesIO
from vision_agent.tools import save_image

def generate_images_of_beautiful_young_women(N):
    # Ensure the output directory exists
    output_dir = 'output_images'
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    else:
        # Clear the output directory
        for file in os.listdir(output_dir):
            file_path = os.path.join(output_dir, file)
            if os.path.isfile(file_path):
                os.unlink(file_path)

    # Generate prompts
    prompts = Dalle2_prompt_gen("beautiful young woman")

    # Loop through the prompts and generate images
    for i in range(min(N, len(prompts))):
        prompt = prompts[i]['prompt']
        image_info = Dalle2_text2img(prompt)
        image_url = image_info['imgfile']

        # Download the image
        response = requests.get(image_url)
        image = Image.open(BytesIO(response.content))
        image_np = np.array(image)

        # Save the image
        file_path = os.path.join(output_dir, f"image_{i+1}.png")
        save_image(image_np, file_path)

If you are still having trouble getting it to work, I find it works better if you can have your functions return a numpy array rather than URl. That way you avoid the extra code of downloading the image from the image url.

hmppt commented 2 months ago

Thank you, I will try again to resolve this issue. In addition, I would also like to ask whether the final generated code can be automatically saved as a py file. Or can all the processes be printed to the log file, because the intermediate process is too long?

dillonalaird commented 2 months ago

That's a good suggestion, we can look in to adding those features. Currently to save the code output what I do is save the code from the response:

import vision_agent as va
if __name__ == "__main__":
    agent = va.agent.VisionAgent()
    resp = agent.chat_with_workflow(...)
    with open("code.py", "w") as f:
        f.write(f"{resp['code']}\n{resp['test']}")

Would you want something like agent.chat_with_workflow([{"role": "user", ...}], save_file="code.py") where is saves the output to the code.py file?

hmppt commented 2 months ago

Yes, if this function can be added, it will facilitate review and execution

hmppt commented 2 months ago

Hello, I modified max_retries. I don’t know if my understanding is correct. If it is incorrect, please correct me: first for user input, the agent will give three plans, then give the code and modify it repeatedly (the default is 3 times ); then three plans will be generated again for the same input, and the code will be modified repeatedly. The entire process will be repeated max_retries max_retries times, that is, plan loop fix code loop.

I changed max_retries to 2, and the process time was greatly shortened. On average, each task will be executed for 760 seconds.

hmppt commented 2 months ago

I also found some problems when running. When write_plans(), the plan and the tools needed for the plan can be given. Why do I need retrieve_tools() later?

dillonalaird commented 2 months ago

plan loop * fix code loop

This is correct, it's this loop https://github.com/landing-ai/vision-agent/blob/main/vision_agent/agent/vision_agent.py#L679 times either the loop for testing plans https://github.com/landing-ai/vision-agent/blob/main/vision_agent/agent/vision_agent.py#L201 or the loop for debugging the code https://github.com/landing-ai/vision-agent/blob/main/vision_agent/agent/vision_agent.py#L354

Both the inner loops only run if the code they write fails though. So they are only for debugging failed code. I have also removed the outerloop in this PR https://github.com/landing-ai/vision-agent/pull/169

When write_plans(), the plan and the tools needed for the plan can be given. Why do I need retrieve_tools() later?

write_plan actually gets all the tool descriptions. These are short descriptions on what each tool does so it can decide which tools to use. However, it does not get the full tool documentation because that would make the prompt too long. After a plan is written, for that particular plan we can then retrieve the tool documentation for the tools used in that plan. This documentation is longer but is only for a few tools relevant to the plan.

hmppt commented 1 month ago

Thank you very much for your prompt answer!