epic: Paper for Ichigo model

homebrewltd / ichigo

Llama3.1 learns to Listen

151 stars 5 forks source link

epic: Paper for Ichigo model #74

Open hahuyhoang411 opened 6 days ago

hahuyhoang411 commented 6 days ago

Goal

Release an academic paper for our effort of training the sound modality.

Description

We are pushing out the paper for claim our result for a better position in the research community. We are one of the very first team to do sound model using Tokenized Early Fusion.

Tasklist

[x] 1. Structure draft
[x] 2. Table of relevant paper (On going)
[ ] 3. Introduction
[ ] 4. Methodology
[ ] 5. Result
[ ] 6. Discussion
[ ] 7. ...

hahuyhoang411 commented 6 days ago

Structure:

Title
Abstract
1. Introduction
    1. Multimodal
    2. Early fusion
    3. Latency
2. Related works 
3. Model Architecture
    1. Architecture: TypeD
    2. Tokenization: WhisperVQ
4. Pre-training
    1. Data Source: Multilingual 7langs
    2. Training Technique: Stabilize training
    3. Training Stages and Hyper-parameters
5. Post-training
    1. Instruction data
        1. Data Format
    2. Data Mixture: Tackle the catastrophic forgetting + recovering knowledge
    3. Training Stages and Hyper-parameters
6. Evaluation
    1. Text benchmarks
    2. Audio benchmarks
7. Conclusion

Appendix: 
    1. Inference
    2. Failed experiments

Key:
    1. State out key points
    2. Takeaways?

hahuyhoang411 commented 2 days ago

Task 1: table of paper Description: gather related paper for reference https://www.notion.so/jan-ai/748e90f9a29a4a49b49cc07ebf4bc03a?v=d245885fa1c24492837cd7b6439709e6

Note: I read through Moshi (200ms) and Llama-omni (226ms) but they don't disclose how they measured that number
Moshi is also caped at 10s (250frames) + Multistream input and multistream output
Salmonn use beats (acoustics encoder) + whisper (semantic encoder)

hahuyhoang411 commented 1 day ago

Task 2: Introduction: https://www.notion.so/jan-ai/Ichigo-Paper-0f9b351e9bfe4517be816bbf4c4d6cbd