Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️
const imageProcessor = new ImageProcessor(configPath);
await imageProcessor.init();
const processedImages = await imageProcessor.process("path/to/image.png");
const imageEncoder = new ImageEncoder(modalityPaths.image_encoder, imageProcessor);
await imageEncoder.init();
const imageOutput = await imageEncoder.encode(processedImages);
assert(imageOutput.embeddings.dims.length === 2, "Output should be 2D");
Embed queries:
const textProcessor = new TextProcessor(configPath, tokenizerPath);
await textProcessor.init();
const processedTexts = await textProcessor.process("a small red panda in a zoo");
const textEncoder = new TextEncoder(modalityPaths.text_encoder, textProcessor);
await textEncoder.init();
const textOutput = await textEncoder.encode(processedTexts);
assert(textOutput.embeddings.dims.length === 2, "Output should be 2D");
await textEncoder.dispose();
Swift
Embed images:
let imageModel = try await ImageEncoder(modelName: "unum-cloud/uform3-image-text-english-small")
let imageURL = "https://github.com/ashvardanian/ashvardanian/blob/master/demos/bbq-on-beach.jpg?raw=true"
guard let url = URL(string: imageURL),
let imageSource = CGImageSourceCreateWithURL(url as CFURL, nil),
let cgImage = CGImageSourceCreateImageAtIndex(imageSource, 0, nil) {
throw Exception("Could not load image from URL: \(imageURL)")
}
var imageEmbedding: Embedding = try imageModel.encode(cgImage)
var imageVector: [Float32] = embedding.asFloats()
Embed queries:
let textModel = try await TextEncoder(modelName: "unum-cloud/uform3-image-text-english-small")
let text = "A group of friends enjoy a barbecue on a sandy beach, with one person grilling over a large black grill, while the other sits nearby, laughing and enjoying the camaraderie."
let textEmbedding: Embedding = try textModel.encode(text)
let textVector: [Float32] = textEmbedding.asFloats()
text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background'
model_text = models[Modality.TEXT_ENCODER]
processor_text = processors[Modality.TEXT_ENCODER]
text_data = processor_text(text)
text_features, text_embedding = model_text.encode(text_data, return_features=True)
How many AI models can run on-device out of the box? UForm multimodal embeddings can 🥳
uform3-image-text-english-large
🆕uform3-image-text-english-base
uform3-image-text-english-small
🆕uform3-image-text-multilingual-base
JavaScript
Load the models and preprocessors for different modalities:
Embed images:
Embed queries:
Swift
Embed images:
Embed queries:
Python
Load model:
Embed images:
Embed queries: