homebrewltd / ichigo

Llama3.1 learns to Listen
150 stars 5 forks source link
# :strawberry: Ichigo: Local real-time voice AI (Formerly llama3-s).

Homebrewed early-fusion speech model

[!NOTE]
Update: September 30, 2024

  • We have rebranded from llama3-s to :strawberry: Ichigo.
  • Our custom-built early-fusion speech model now has a name and a voice.
  • It has improved multiturn capabilities and can now refuse to process inaudible queries.

[!WARNING]
:strawberry: Ichigo is an open research experiment

  • Join us in the #research channel in Homebrew's Discord
  • We livestream training runs in #research-livestream

About

:strawberry: Ichigo is an open, ongoing research experiment to extend a text-based LLM to have native "listening" ability. Think of it as an open data, open weight, on device Siri.

It uses an early fusion technique inspired by Meta's Chameleon paper.

We build train in public:

Progress

Join Us

:strawberry: Ichigo is an open research project. We're looking for collaborators, and will likely move towards crowdsourcing speech datasets in the future.

Quickstart with Google Colab

Checkout this notebook to try our latest model:

Open In Colab

Synthetic Generation

For detailed information on synthetic generation, please refer to the Synthetic Generation Guide.

Organize the input/output directory

  1. First Clone the Repo from github:

    git clone --recurse-submodules https://github.com/homebrewltd/llama3-s.git
  2. The folder structure is as follows:

    Ichigo
    ├── HF_Trainer                               # HF training code (deprecated)
    ├── synthetic_data                           # Synthetic data generation pipeline
    ├── configs                              # Audio pipeline configs
        ├── audio_to_audio                   # Parler audio (.wav) to semantic tokens
        ├── synthetic_generation_config      # TTS semantic tokens
    ├── scripts                                  # Setup scripts for Runpod
    ├── torchtune                                # Submodule: our fork of fsdp with checkpointing
    ├── model_zoo                                # Model checkpoints
    │   ├── LLM
    │   │   ├── Meta-Llama-3-8B-Instruct
    │   │   ├── Meta-Llama-3-70B-Instruct
    ├── demo                                     # Selfhost this demo (vllm)
    ├── inference                                # Google Colab

Training with HF Trainer

  1. Install Dependencies
    python -m venv hf_trainer
    chmod +x scripts/install.sh
    ./scripts/install.sh

    Restart shell now

    chmod +x scripts/setup.sh
    ./scripts/setup.sh
    source myenv/bin/activate
  2. Logging Huggingface
    huggingface-cli login --token=<token>
  3. Training
    export CUTLASS_PATH="cutlass"
    export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
    accelerate launch --config_file ./accelerate_config.yaml train.py 

Training with Torchtune

  1. Install Package

    python -m venv torchtune
    pip install torch torchvision tensorboard
    cd ./torchtune
    pip install -e .

    You can also download the model using tune:

    tune download homebrewltd/llama3.1-s-whispervq-init --hf-token <token>  --output-dir ../model_zoo/llama3.1-s-whispervq-init --ignore-patterns "original/consolidated*"

    Setup the Dataset from HF path by change the path and change the name of the model in the following YAML file.

    nano torchtune/recipes/configs/jan-llama3-s/8B_full.yaml
  2. Training Multi GPU (1-8GPUs Supported)

    tune run --nproc_per_node 4 full_finetune_fsdp2 --config recipes/configs/jan-llama3-1-s/8B_full.yaml

    Demo

Gradio Web UI

We offer code for users to create a web UI demo. Please follow the instructions below:

python -m venv demo
source demo/bin/activate
# First install all required packages
pip install --no-cache-dir -r ./demo/requirements.txt

Then run the command below to launch a Gradio demo locally. You can add the variables use-4bit and use-8bit for quantized usage:

python -m demo.app --host 0.0.0.0 --port 7860 --max-seq-len 1024 

You can also host a demo using vLLM for faster inference but its not support streaming output:

python -m demo.app_vllm

Alternatively, you can easily try our demo on HuggingFace 🤗

References

@misc{chameleonteam2024chameleonmixedmodalearlyfusionfoundation,
      title={Chameleon: Mixed-Modal Early-Fusion Foundation Models}, 
      author={Chameleon Team},
      year={2024},
      eprint={2405.09818},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      journal={arXiv preprint}
}

@misc{zhang2024adamminiusefewerlearning,
      title={Adam-mini: Use Fewer Learning Rates To Gain More}, 
      author={Yushun Zhang and Congliang Chen and Ziniu Li and Tian Ding and Chenwei Wu and Yinyu Ye and Zhi-Quan Luo and Ruoyu Sun},
      year={2024},
      eprint={2406.16793},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      journal={arXiv preprint}
}

@misc{defossez2022highfi,
      title={High Fidelity Neural Audio Compression},
      author={Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
      year={2022},
      eprint={2210.13438},
      archivePrefix={arXiv},
      journal={arXiv preprint}
}

@misc{WhisperSpeech,
      title={WhisperSpeech: An Open Source Text-to-Speech System Built by Inverting Whisper}, 
      author={Collabora and LAION},
      year={2024},
      url={https://github.com/collabora/WhisperSpeech},
      note={GitHub repository}
}

Acknowledgement