nasa-petal / bidara-deep-chat

BIDARA is a GPT-4o chatbot that was instructed to help scientists and engineers understand, learn from, and emulate the strategies used by living things to create sustainable designs and technologies using the Biomimicry Institute's step-by-step design process.
https://bit.ly/bidara-ai
Other
21 stars 5 forks source link

`vision-only` does not describe image #79

Closed bruffridge closed 6 months ago

bruffridge commented 6 months ago

I must be doing something wrong.

image

jackitaliano commented 6 months ago

This was somewhat a "design choice". There's no way for the assistant to automatically know the file type, and thus that it's an image. Options:

  1. Tell it that the file is an image
  2. Function call to determine type of file

Downsides:

  1. Only way to tell it is with a user message, which must appear in the chat. This would look something like an image with a text below it that says "(user uploaded an image)". This did not seem ideal.
  2. Function call results in a rather significant additional "load" time to response, and cost, for every file uploaded because it has to make an additional function call.

I could do either one of these, and would prefer the function call, but neither are ideal.

The current solution relies on somehow mentioning it is an image. Based on my own testing, what you said would work almost all of the time because you mentioned "image". Not sure why it didn't in this case...

gayatri-sharma commented 6 months ago

Understanding that file extensions are independent of how the file was created. When a file is uploaded to BIDARA, the name or path of the file can be accessed. In order to determine the file type based on the file extension, one can use the determine_file_type() function.

Assuming you have a dictionary mapping the file extensions to types, here is the sample code:

def determine_file_type(file_path):
    # Get file extension
    _, file_extension = os.path.splitext(file_path)
    file_extension = file_extension.lower().strip(".")

    # Lookup file extension in the dictionary
    return extension_mappings.get(file_extension, "Unknown")

# Function to handle file upload in the BIDARA
def handle_file_upload(uploaded_file_path):
    file_type = determine_file_type(uploaded_file_path)
    print(f"The uploaded file '{uploaded_file_path}' is of type: {file_type}")

Hoping I didn't miss any detail, and this helps. Let me know.

gayatri-sharma commented 6 months ago

If the image analysis fails after fixing the file type, we can use the Mask R-CNN model that I previously used for object detection and segmentations in Computer Vision.

I found out about COCO (Common Objects in Context) dataset. It is used as the train set with Mask R-CNN and can classify objects, returning class IDs that are integers identifying each class. The COCO dataset has assigned unique values to its classes. Here's an example of object detection code:

# Load a random image from the images folder
file_names = next(os.walk(IMAGE_DIR)) [2] # the code retrieves the list of filenames from the directory specified by IMAGE_DIR.
image = skimage.io.imread(os.path.join(IMAGE_DIR, random.choice(file_names)))

# Run Detection
results = model.detect ([image], verbose=1) # typically indicates that detailed progress and debug information will be displayed during the detection process

# Visualize results
r = results[0]
visualize.display_instances(image, r['rois'], None, r['class_ids'], 
                            class_names, r['scores']) # replacing r['mask'] with 'None' will not highlight the pixels of objects in the image
jackitaliano commented 6 months ago

Understanding that file extensions are independent of how the file was created. When a file is uploaded to BIDARA, the name or path of the file can be accessed. In order to determine the file type based on the file extension, one can use the determine_file_type() function.

Assuming you have a dictionary mapping the file extensions to types, here is the sample code:

def determine_file_type(file_path):
    # Get file extension
    _, file_extension = os.path.splitext(file_path)
    file_extension = file_extension.lower().strip(".")

    # Lookup file extension in the dictionary
    return extension_mappings.get(file_extension, "Unknown")

# Function to handle file upload in the BIDARA
def handle_file_upload(uploaded_file_path):
    file_type = determine_file_type(uploaded_file_path)
    print(f"The uploaded file '{uploaded_file_path}' is of type: {file_type}")

Hoping I didn't miss any detail, and this helps. Let me know.

Thank you for the reply.

The issue isn't with the client knowing the file type, because we could put in exactly as you said. The issue is with the assistant knowing the file type, as there's no way to pass information to the assistant without it being a user message.

The flow with deep-chat and assistants works something like this:

  1. User uploads file to client, let it be a png image.
  2. Client uploads the png to Open AI assistant file storage, and returns a file handle "file-xyz123" (without type information)
    • Assistants are only able to work with files in assistant file storage, so this is the only way to upload files
  3. Client adds a message to the current thread with contents similar to: {..., "role": "ai or user here", "content": "message contents here", file_ids: [ "file-xyz123" ]}
    • The only information we can adjust here is content, which in this case is the user message. This must be treated as any other user message, and thus is part of the actual chat.
    • The list of files must contain valid file handles in the form "file-..." like in "file-xyz123" that are already uploaded to Open AI.
  4. Client runs the assistant on this updated thread
    • At this step, the assistant may choose to call any of its available "functions", return a message, or both.
    • After the run is complete, it must return a message.
  5. The assistant sees a file is uploaded because the array "file_ids" is populated with one file id.
    • It has no other information on file, only that a file_id was passed
  6. We'll assume the user said "Describe this image". Then the assistant can infer that the file was an image, and will choose to call the function "image_to_text".
    • If the user does not say this, there's no way for the assistant to infer because it has no other information to go off of.
    • This is similar to how it works for any other files uploaded. The user must tell it to do something that file type, otherwise there's no way to infer.
  7. The assistant receives the response from it's "image_to_text" function (a description of the image) and describes that back to the user with a message response.
  8. Client takes recent messages and adds them to the ui.

As you can see, even if we have the file extension and thus the file type, we can't share this information with the assistant directly.

As mentioned previously, the only way to give the file type to the assistant without it being in the user message (as far as I'm aware) is through a function call. This fact would remain even in the case of using other image recognition software, as the assistant itself still has to make that call.

Let me know if you have any ideas with this, or questions. We can also discuss when we meet 😊

jackitaliano commented 6 months ago

This was somewhat a "design choice". There's no way for the assistant to automatically know the file type, and thus that it's an image. Options:

  1. Tell it that the file is an image
  2. Function call to determine type of file

Downsides:

  1. Only way to tell it is with a user message, which must appear in the chat. This would look something like an image with a text below it that says "(user uploaded an image)". This did not seem ideal.
  2. Function call results in a rather significant additional "load" time to response, and cost, for every file uploaded because it has to make an additional function call.

I could do either one of these, and would prefer the function call, but neither are ideal.

The current solution relies on somehow mentioning it is an image. Based on my own testing, what you said would work almost all of the time because you mentioned "image". Not sure why it didn't in this case...

Pushed a with the mentioned function call.

As I said, there is a time hit involved with this. It does have a better feel that the assistant already knows what it is, but it feels worse to wait that long at the same time. Maybe it'll feel better when streaming is implemented (properly) by deep-chat.

Though you could also argue that in some instances it saves time, like yours, where it doesn't know and you have to tell it again what kind of file it is.