manisnesan / til

collection of today i learned scripts
4 stars 0 forks source link

Multilingual Models #70

Open manisnesan opened 6 months ago

manisnesan commented 6 months ago

https://www.sarvam.ai/blog/announcing-openhathi-series - Bilingual LLMs frugally

The OpenHathi series of work at Sarvam AI is to make contributions to the ecosystem with open models and datasets to encourage innovation in Indian language AI. It is a partnership with our academic partners at AI4Bharat who have contributed language resources and benchmarks. We encourage people to innovate on top of this release by building fine-tuned models for different use-cases. Sarvam AI will additionally release enterprise-grade models on its full stack GenAI platform, which will launch soon.

Here are the notes from Sarvam's OpenHathi Series Launch. For people unfamiliar: Sarvam is an Indian startup focused on training Foundational LLMs for Indian languages. They launched OpenHathi series of model yesterday. Open Hathi is an attempt to add support of a new language to an existing open model such as LLama2 or Mistral.

  1. Here is a detailed technical blog on how the base model and the finetuned models were trained: https://www.sarvam.ai/blog/announcing-openhathi-series. The key advantages include: -- Efficient Tokenizer in comparison to GPT and LLama Tokenizer. Nearly 3x to 4x reduction in tokenization hence extremely low inference cost! -- Pretained on Translation instead of Vanilla Continual Pretraining - The Base model has been pretrained on two translation tasks instead of Vanilla CPT -- Open Weights for the Base Pretrained Model: https://huggingface.co/sarvamai/OpenHathi-7B-Hi-v0.1-Base. This is a base model and not meant to be used as is. Please first finetuning it on task(s) you are interested in. -- Supervised Finetuning - The base model has been supervised finetuned on translation, toxicity classification, text simplification, write-in-English-then-in-Hindi etc. -- Finetuned models on Kissan and Koo datasets -- Trained on Romanised Hindi and not just Pure Devanagiri Hindi -- Proprietary Finetuned Models outperform GPT 3.5 and GPT 4 in a large number tasks
  2. Here is a demo of the proprietary finetuned model by Prof. @⁨~Pratyush⁩ : https://www.youtube.com/watch?v=WKfVzJSDAd8 -- Generation with an efficient tokenizer -- Think-in-English, Answer-in-Hindi -- Cross Lingual RAG -- Simplified translation -- Romanized text translation
    1. Dhenu 1.0 - First LLM for farmers by collaboration of Kissan AI and Sarvam AI : https://www.youtube.com/watch?v=Z-hXubdVTQ0 @⁨~Dr. Pratik Desai⁩ demos the powerful capabilities

Amazing work Prof. @Prof. Pratyush Kumar Sarvam. This is the ChatGPT moment for Bharath!

manisnesan commented 6 months ago

JaColBERT and Hard Negatives, Towards Better Japanese-First Embeddings for Retrieval: Early Technical Report

JaColBERT vastly outperform all previous monolingual retrieval approaches and competes with the best multilingual methods, despite unfavourable evaluation settings (out-of-domain vs. in-domain for the multilingual models). JaColBERT reaches an average Recall@10 of 0.813, noticeably ahead of the previous monolingual best-performing model (0.716) and only slightly behind multilingual-e5-base (0.820)