Florence-2 is an advanced vision foundation model that uses a
prompt-based approach to handle a wide range of vision and
vision-language tasks. Florence-2 can interpret simple text prompts to
perform tasks like captioning, object detection, and segmentation. It
leverages our FLD-5B dataset, containing 5.4 billion annotations across
126 million images, to master multi-task learning. The model's
sequence-to-sequence architecture enables it to excel in both zero-shot
and fine-tuned settings, proving to be a competitive vision foundation
model.
Model | Model size | Model Description
-- | -- | --
Florence-2-base[HF] | 0.23B | Pretrained model with FLD-5B
Florence-2-large[HF] | 0.77B | Pretrained model with FLD-5B
Florence-2-base-ft[HF] | 0.23B | Finetuned model on a colletion of downstream tasks
Florence-2-large-ft[HF] | 0.77B | Finetuned model on a colletion of downstream tasks
Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks. Florence-2 can interpret simple text prompts to perform tasks like captioning, object detection, and segmentation. It leverages our FLD-5B dataset, containing 5.4 billion annotations across 126 million images, to master multi-task learning. The model's sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model.
Resources and Technical Documentation:
[Florence-2 technical report](https://arxiv.org/abs/2311.06242).
[Jupyter Notebook for inference and visualization of Florence-2-large model](https://huggingface.co/microsoft/Florence-2-large/blob/main/sample_inference.ipynb)
Model Model size Model Description
Florence-2-base[[HF]](https://huggingface.co/microsoft/Florence-2-base) 0.23B Pretrained model with FLD-5B
Florence-2-large[[HF]](https://huggingface.co/microsoft/Florence-2-large) 0.77B Pretrained model with FLD-5B
Florence-2-base-ft[[HF]](https://huggingface.co/microsoft/Florence-2-base-ft) 0.23B Finetuned model on a colletion of downstream tasks
Florence-2-large-ft[[HF]](https://huggingface.co/microsoft/Florence-2-large-ft) 0.77B Finetuned model on a colletion of downstream tasks
Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks. Florence-2 can interpret simple text prompts to perform tasks like captioning, object detection, and segmentation. It leverages our FLD-5B dataset, containing 5.4 billion annotations across 126 million images, to master multi-task learning. The model's sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model.
Resources and Technical Documentation: