Support for new formats of content

hellohejinyu commented 6 months ago

import OpenAI from "openai";

const openai = new OpenAI();

async function main() {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: "What’s in this image?" },
          {
            type: "image_url",
            image_url: {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
            },
          },
        ],
      },
    ],
  });
  console.log(response.choices[0]);
}
main();

openai supports passing images and text at the same time, but the token calculation rules for images depend on the image size and processing mode. So I think we need to artificially supplement the two parameters of the image size and processing mode to calculate the token of the image.

sean-nicholas commented 6 months ago

I guess you could add a lib that extracts the size from the images like https://www.npmjs.com/package/image-size Should be pretty easy to fetch an image or create a buffer from base64 to pipe that into image-size. But currently this won't work in cloudflare workers: https://github.com/image-size/image-size/issues/405

I'm not quite sure if you can guess what detail level is chosen when you do not send it (when it's in auto mode), but from the documentation I would guess if it's smaller than 512px in both directions it will be low otherwise high.

Funny there are two description of costs in the official docs. The one that you posted and this: https://platform.openai.com/docs/guides/vision/low-or-high-fidelity-image-understanding In this section they say the costs are 65 tokens per crop:

low will enable the "low res" mode. The model will receive a low-res 512px x 512px version of the image, and represent the image with a budget of 65 tokens. This allows the API to return faster responses and consume fewer input tokens for use cases that do not require high detail.

high will enable "high res" mode, which first allows the model to see the low res image and then creates detailed crops of input images as 512px squares based on the input image size. Each of the detailed crops uses twice the token budget (65 tokens) for a total of 129 tokens.

I'm not quite sure why two time 65 should be 129 but hey 🤷‍♂️😁

hellohejinyu commented 6 months ago

function calculateHighDetailTokens(width: number, height: number): number {
  // First, check if the image needs to be scaled to fit within the 2048 x 2048 size limit
  if (width > 2048 || height > 2048) {
    const aspectRatio = width / height;
    if (width > height) {
      width = 2048;
      height = Math.round(2048 / aspectRatio);
    } else {
      height = 2048;
      width = Math.round(2048 * aspectRatio);
    }
  }

  // Next, scale the image so that the shortest side is 768px
  const minSideLength = 768;
  const currentMinSide = Math.min(width, height);
  if (currentMinSide > minSideLength) {
    const scaleFactor = minSideLength / currentMinSide;
    width = Math.round(width * scaleFactor);
    height = Math.round(height * scaleFactor);
  }

  // Calculate how many 512px tiles the image is composed of
  const tilesWide = Math.ceil(width / 512);
  const tilesHigh = Math.ceil(height / 512);
  const totalTiles = tilesWide * tilesHigh;

  // The token cost for each tile is 170, with an additional 85 tokens added at the end
  const totalTokens = totalTiles * 170 + 85;

  return totalTokens;
}

// Example usage
console.log(calculateHighDetailTokens(1024, 1024)); // Should output 765
console.log(calculateHighDetailTokens(2048, 4096)); // Should output 1105

In our project, we actually only use high mode, and the front-end knows the image width and height when uploading images. So I asked gpt to write a code to calculate the code in high mode. This temporarily solves the problem of image message token calculation.😂

hmarr / openai-chat-tokens

Support for new formats of content #21