dan-homebrew commented 1 week ago

Goal

8th October demo at our event
Intended as a server-side demo: learnings to be then applied in https://github.com/janhq/jan/issues/3488 (Sprint 22-23)

Questions

What data can we collect from a Research Demo, that can help us improve our model?
Can we show latency, token speed, etc?

Tasklist

[ ] #70
[ ] Architecture for ichijo.homebrew.ltd @louis-jan @nguyenhoangthuan99 (use this thread)
[ ] epic: Infra for Ichigo and Fish Speech
[ ] epic: Solve for Desktop vs. Mobile microphone or sound capture issue
[ ] epic: Data Collection for thumbs-up, thumbs-down data
[ ] epic: Continuous Voice Detection on client-side
[ ] epic: User can choose voice
[ ] epic: Stats for Latency, token speed etc

Scope

Features

Be very clear with users that this is a Research Demo, and data will be used
ChatUI with transcription
Voice detection on client-side? (fallback: Press-to-talk)
TTS for replies (plus: voice selector): Fish Speech, etc
Data will be used for Research Purposes
Collect Thumbs up/Thumbs down for poor transcription or reply -> Saved and persisted to DB
Each Session should be unique for a user (by cookie?)
Function Calling?

Design

dan-homebrew commented 1 week ago

@louis-jan @urmauur @namchuai @nguyenhoangthuan99 I would like to explore doing the Ichijo demo as a fork of Jan, that is run server-side:

Why?

I would like to focus our efforts on improving Jan, vs. support an additional orphaned repo
Opportunity: implement swappable persistence layer (e.g. persistence vs. filesystem)
Opportunity: implement "multi-user" (i.e. you don't see others' convos when hosted server side)
Opportunity: implement feedback loops (i.e. thumbs up/thumbs down), which helps us generate datasets

Direction

Ichijo Demo will be a fork of Jan that is run server-side (with hidden UI)
In the future, this will be merged into Jan as "Voice Mode" (look at ChatGPT Mobile for an example)

Scope

For features, see the OG post - in this post, I focus on changes to Jan

Jan should have a "Voice Mode" UI, which we will use for the Ichijo demo
TBD: Ichijo demo is Jan with a "thumbs down" or "thumbs up" on threads > messages, to collect RL type dataset
We should be able to persist Threads, Messages, and associated audio files
We should have clear APIs for these - e.g. for Thumbs up, Thumbs down, Audio files

Unsure

TBD: Ichijo demo is Jan with a 2nd persistence extension DB, vs. filesystem. (Dan's note: is this actually necessary?)
TBD: Ichijo demo is Jan with "multi-user" in DB (i.e. use session_id), to ensure Threads are associated with Session

Out-of-scope

Ichijo demo should not need to show past conversations (i.e. refresh page = new thread)
Jan Voice Mode - this is out-of-scope to Sprint 22 or Sprint 23 as I think Jan has a lot of cross-platform edge cases and it is not feasible for us to rush this to market

Everything we do for this demo, should add towards Jan being able to support Voice Mode though - this should not be a separate repo.

cc @0xSage @tikikun @bachvudinh

0xSage commented 1 week ago

Comment moved to https://github.com/homebrewltd/internal/issues/36

tikikun commented 1 week ago

@louis-jan @urmauur @nguyenhoangthuan99 reference codebase https://github.com/tikikun/public_demo_llama3s

louis-jan commented 1 week ago

Ichigo as an additional orphaned repository

Separation of Concerns

Will we remove any irrelevant parts that could cause side effects? E.g. CI Pipelines that push release artifacts to Jan's S3.

To me, it should only add more features if it's intended for a short-lived repository. Otherwise, it would make syncing them up or merging them back a nightmare.
It's a server-side demo - going back to the previous Jan web server demo. How does the current architecture work?

Conversational extension is the only one involved for now. You can disable the others, as the demo works with remote endpoints.
Should it be an extension DB or filesystem?

To me, multi-user support works better with a relational DB. The filesystem would take more effort. E.g. An auto-generated endpoints DB system would be great. Audio files will be stored in a single location (is that a bad idea?)
How does the content rating system work?

It would be great to be part of the message object. It is a part of the Message Update endpoint, so no need to introduce a new one?

Ichigo

Draw.io

dan-homebrew commented 1 week ago

@louis-jan I realize I may have been mis-communicated by asking for Ichigo to be a fork of Jan.

I would like to clarify my position: Jan should support Ichigo as a model

Direction

Jan should support Ichigo as a model
Jan (web, containerized) should be used to run https://ichigo.homebrew.ltd
We can have a special UI route, to hide Jan's desktop UI and instead have a simple UI for Ichigo

Scope

I would like to use Ichigo to drive improvements at Jan and Cortex:

Jan

Jan should support DB option for persistence (in addition to the FS we already have)
Jan should be able to have Threads that are linked to Users or Sessions (i.e. enable multi-user Jan in the future)

Cortex

Cortex should support /audio/completions
Cortex should support /audio/speech

Or:

Jan supports vLLM
Jan supports /audio/encoding (thru WhisperSpeech)

dan-homebrew commented 1 week ago

Decision: Keep ichigo.homebrew.ltd demo separate

Structure

Python backend (vLLM + WhisperSpeech)
Simple UI
No persistence (outside of saving voice snippets)
No "feedback"
No scaling

Reason

Jan and cortex need more time
Need Python strategy for Jan + Cortex

Assignees

@urmauur UI
@nguyenhoangthuan99 backend
@hiento09 infra
Research

Key Tasks

Debugging Chrome microphone issues (majority of demos will be on web)

0xSage commented 1 week ago

@nguyenhoangthuan99 , I'm excited to see the progress you're making on the backend for the demo! I have a few questions to help me understand your design choices:

When you say "Fish is faster," could you give me a rough estimate of the speedup we can expect? I'd love to understand the trade-offs with quality degradation.
I noticed that Ichigo uses Whisper semantic tokens, but Fish is incompatible with a future version of Ichigo that outputs semantic tokens directly. Can you help me understand why this isn't a concern for you?
I was a bit concerned by the demo this morning - the performance of Fish seemed a bit stilted. Are there any plans to improve this aspect, or is it not a priority?

tikikun commented 1 week ago

@nguyenhoangthuan99 , I'm excited to see the progress you're making on the backend for the demo! I have a few questions to help me understand your design choices:

When you say "Fish is faster," could you give me a rough estimate of the speedup we can expect? I'd love to understand the trade-offs with quality degradation.

I noticed that Ichigo uses Whisper semantic tokens, but Fish is incompatible with a future version of Ichigo that outputs semantic tokens directly. Can you help me understand why this isn't a concern for you?

I was a bit concerned by the demo this morning - the performance of Fish seemed a bit stilted. Are there any plans to improve this aspect, or is it not a priority?

We can get tangible number on this front, but the idea is to get the demo out asap.
This shouldn't be an issue unless we want to go full sems-to-sems model, right now it's still sems-to-text. Currently it's just replace a sound model with text input, which is trivial. (This is much much easier than to answer so how sems-to-sems)
Chunking needs some optimization, if you chunk the data input to WhisperSpeech it will still have similar issue. We need to make some logic around chunking size.

nguyenhoangthuan99 commented 1 week ago

I just finished Test both whisper speech and fish-speech and here is the result

I tested with this prompt

In the realm of advanced technology, the evolution of artificial intelligence stands as a 
monumental achievement. This dynamic field, constantly pushing the boundaries of what 
machines can do, has seen rapid growth and innovation. From deciphering complex data 
patterns to driving cars autonomously, AI's applications are vast and diverse.

_	Whisper speech	Fish speech
VRAM	9 GB	2 GB
Time spend	22 s	4 s

tikikun commented 4 days ago

https://github.com/user-attachments/assets/b4f5bf1f-4f5f-42ed-8bc4-01ac7e356ff5

homebrewltd / ichigo

eng: 8th Oct ichigo.homebrew.ltd Demo #68

Goal

Questions

Tasklist

Scope

Features

Design

Why?

Direction

Scope

Unsure

Out-of-scope

Ichigo as an additional orphaned repository

Separation of Concerns

Direction

Scope