homebrewltd / ichigo

Llama3.1 learns to Listen
154 stars 5 forks source link

eng: 8th Oct ichigo.homebrew.ltd Demo #68

Open dan-homebrew opened 1 week ago

dan-homebrew commented 1 week ago

Goal

Questions

Tasklist

Scope

Features

Design

Image

Image

dan-homebrew commented 1 week ago

@louis-jan @urmauur @namchuai @nguyenhoangthuan99 I would like to explore doing the Ichijo demo as a fork of Jan, that is run server-side:

Why?

Direction

Scope

For features, see the OG post - in this post, I focus on changes to Jan

Unsure

Out-of-scope

Everything we do for this demo, should add towards Jan being able to support Voice Mode though - this should not be a separate repo.

cc @0xSage @tikikun @bachvudinh

0xSage commented 1 week ago

Comment moved to https://github.com/homebrewltd/internal/issues/36

tikikun commented 1 week ago

@louis-jan @urmauur @nguyenhoangthuan99 reference codebase https://github.com/tikikun/public_demo_llama3s

louis-jan commented 1 week ago

Ichigo as an additional orphaned repository

Separation of Concerns

  1. Will we remove any irrelevant parts that could cause side effects? E.g. CI Pipelines that push release artifacts to Jan's S3.

    To me, it should only add more features if it's intended for a short-lived repository. Otherwise, it would make syncing them up or merging them back a nightmare.

  2. It's a server-side demo - going back to the previous Jan web server demo. How does the current architecture work?

    Conversational extension is the only one involved for now. You can disable the others, as the demo works with remote endpoints.

  3. Should it be an extension DB or filesystem?

    To me, multi-user support works better with a relational DB. The filesystem would take more effort. E.g. An auto-generated endpoints DB system would be great. Audio files will be stored in a single location (is that a bad idea?)

  4. How does the content rating system work?

    It would be great to be part of the message object. It is a part of the Message Update endpoint, so no need to introduce a new one?

Ichigo
image

Draw.io

dan-homebrew commented 1 week ago

@louis-jan I realize I may have been mis-communicated by asking for Ichigo to be a fork of Jan.

I would like to clarify my position: Jan should support Ichigo as a model

Direction

Scope

I would like to use Ichigo to drive improvements at Jan and Cortex:

Jan

Cortex

Or:

dan-homebrew commented 1 week ago

Decision: Keep ichigo.homebrew.ltd demo separate

Structure

Reason

Assignees

Key Tasks

0xSage commented 1 week ago

@nguyenhoangthuan99 , I'm excited to see the progress you're making on the backend for the demo! I have a few questions to help me understand your design choices:

  1. When you say "Fish is faster," could you give me a rough estimate of the speedup we can expect? I'd love to understand the trade-offs with quality degradation.
  2. I noticed that Ichigo uses Whisper semantic tokens, but Fish is incompatible with a future version of Ichigo that outputs semantic tokens directly. Can you help me understand why this isn't a concern for you?
  3. I was a bit concerned by the demo this morning - the performance of Fish seemed a bit stilted. Are there any plans to improve this aspect, or is it not a priority?
tikikun commented 1 week ago

@nguyenhoangthuan99 , I'm excited to see the progress you're making on the backend for the demo! I have a few questions to help me understand your design choices:

  1. When you say "Fish is faster," could you give me a rough estimate of the speedup we can expect? I'd love to understand the trade-offs with quality degradation.
  2. I noticed that Ichigo uses Whisper semantic tokens, but Fish is incompatible with a future version of Ichigo that outputs semantic tokens directly. Can you help me understand why this isn't a concern for you?
  3. I was a bit concerned by the demo this morning - the performance of Fish seemed a bit stilted. Are there any plans to improve this aspect, or is it not a priority?
  1. We can get tangible number on this front, but the idea is to get the demo out asap.
  2. This shouldn't be an issue unless we want to go full sems-to-sems model, right now it's still sems-to-text. Currently it's just replace a sound model with text input, which is trivial. (This is much much easier than to answer so how sems-to-sems)
  3. Chunking needs some optimization, if you chunk the data input to WhisperSpeech it will still have similar issue. We need to make some logic around chunking size.
nguyenhoangthuan99 commented 1 week ago

I just finished Test both whisper speech and fish-speech and here is the result

I tested with this prompt

In the realm of advanced technology, the evolution of artificial intelligence stands as a 
monumental achievement. This dynamic field, constantly pushing the boundaries of what 
machines can do, has seen rapid growth and innovation. From deciphering complex data 
patterns to driving cars autonomously, AI's applications are vast and diverse.
_ Whisper speech Fish speech
VRAM 9 GB 2 GB
Time spend 22 s 4 s
tikikun commented 4 days ago

https://github.com/user-attachments/assets/b4f5bf1f-4f5f-42ed-8bc4-01ac7e356ff5