microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.26k stars 2.56k forks source link
beit beit-3 bitnet deepnet document-ai foundation-models kosmos kosmos-1 layoutlm layoutxlm llm minilm mllm multimodal nlp pre-trained-model textdiffuser trocr unilm xlm-e

aka.ms/GeneralAI

Hiring

We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on Foundation Models (aka large-scale pre-trained models) and General AI, NLP, MT, Speech, Document AI and Multimodal AI, please send your resume to fuwei@microsoft.com.

Foundation Architecture

TorchScale - A Library of Foundation Architectures (repo)

Fundamental research to develop new architectures for foundation models and AI, focusing on modeling generality and capability, as well as training stability and efficiency.

Stability - DeepNet: scaling Transformers to 1,000 Layers and beyond

Generality - Foundation Transformers (Magneto): towards true general-purpose modeling across tasks and modalities (including language, vision, speech, and multimodal)

Capability - A Length-Extrapolatable Transformer

Efficiency & Transferability - X-MoE: scalable & finetunable sparse Mixture-of-Experts (MoE)

The Revolution of Model Architecture

BitNet: 1-bit Transformers for Large Language Models

RetNet: Retentive Network: A Successor to Transformer for Large Language Models

LongNet: Scaling Transformers to 1,000,000,000 Tokens

Foundation Models

The Evolution of (M)LLM (Multimodal LLM)

Kosmos-2.5: A Multimodal Literate Model

Kosmos-2: Grounding Multimodal Large Language Models to the World

Kosmos-1: A Multimodal Large Language Model (MLLM)

MetaLM: Language Models are General-Purpose Interfaces

The Big Convergence - Large-scale self-supervised pre-training across tasks (predictive and generative), languages (100+ languages), and modalities (language, image, audio, layout/format + language, vision + language, audio + language, etc.)

Language & Multilingual

UniLM: unified pre-training for language understanding and generation

InfoXLM/XLM-E: multilingual/cross-lingual pre-trained models for 100+ languages

DeltaLM/mT6: encoder-decoder pre-training for language generation and translation for 100+ languages

MiniLM: small and fast pre-trained models for language understanding and generation

AdaLM: domain, language, and task adaptation of pre-trained models

EdgeLM(NEW): small pre-trained models on edge/client devices

SimLM (NEW): large-scale pre-training for similarity matching

E5 (NEW): text embeddings

MiniLLM (NEW): Knowledge Distillation of Large Language Models

Vision

BEiT/BEiT-2: generative self-supervised pre-training for vision / BERT Pre-Training of Image Transformers

DiT: self-supervised pre-training for Document Image Transformers

TextDiffuser/TextDiffuser-2 (NEW): Diffusion Models as Text Painters

Speech

WavLM: speech pre-training for full stack tasks

VALL-E: a neural codec language model for TTS

Multimodal (X + Language)

LayoutLM/LayoutLMv2/LayoutLMv3: multimodal (text + layout/format + image) Document Foundation Model for Document AI (e.g. scanned documents, PDF, etc.)

LayoutXLM: multimodal (text + layout/format + image) Document Foundation Model for multilingual Document AI

MarkupLM: markup language model pre-training for visually-rich document understanding

XDoc: unified pre-training for cross-format document understanding

UniSpeech: unified pre-training for self-supervised learning and supervised learning for ASR

UniSpeech-SAT: universal speech representation learning with speaker-aware pre-training

SpeechT5: encoder-decoder pre-training for spoken language processing

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

VLMo: Unified vision-language pre-training

VL-BEiT (NEW): Generative Vision-Language Pre-training - evolution of BEiT to multimodal

BEiT-3 (NEW): a general-purpose multimodal foundation model, and a major milestone of The Big Convergence of Large-scale Pre-training Across Tasks, Languages, and Modalities.

Toolkits

s2s-ft: sequence-to-sequence fine-tuning toolkit

Aggressive Decoding (NEW): lossless and efficient sequence-to-sequence decoding algorithm

Applications

TrOCR: transformer-based OCR w/ pre-trained models

LayoutReader: pre-training of text and layout for reading order detection

XLM-T: multilingual NMT w/ pretrained cross-lingual encoders

Links

LLMOps (repo)

General technology for enabling AI capabilities w/ LLMs and MLLMs.

News

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the transformers project.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using the pre-trained models, please submit a GitHub issue.

For other communications, please contact Furu Wei (fuwei@microsoft.com).