jungwoo-ha / WeeklyArxivTalk

[Zoom & Facebook Live] Weekly AI Arxiv 시즌2
972 stars 41 forks source link

[20230312] Weekly AI ArXiv 만담 시즌2 - 9회차 #75

Open scene-the-ella opened 1 year ago

jungwoo-ha commented 1 year ago

News

ArXiv

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Scaling up GANs for Text-to-Image Synthesis

PaLM-E: An Embodied Multimodal Language Model

gyunggyung commented 1 year ago

가벼운 소식 찍먹

  1. 뉴스: GPT-4비슷한 기능을 보여주는 최신 논문들
  2. NEXT AI (Yann André LeCun은 LLM처럼 크기만 키워서는 AGI를 못 만든다고 합니다. 사람의 뇌와 비슷한 모델을 만들어야지 AGI를 만들 수 있다고 합니다. 세상에 대한 정보를 알고, 본능적인 반응, 깊은 생각을 하고 반응, 이를 조절하는 부분을 구현해야 한다고 합니다. 다만 저는 그의 주장은 100% 동의하지는 않습니다!)

LLaMA

최신 소식 공유

맥북에서 LLM

https://github.com/gyunggyung/KoChatLLaMA.cpp https://www.facebook.com/groups/1272877526915876/permalink/1277329939803968/

llama.cpp Inference of Facebook's LLaMA model in pure C/C++

Hot topics

Description

The main goal is to run the model using 4-bit quantization on a MacBook.


Here is a typical run using LLaMA-7B:

make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

make: Nothing to be done for `default'.
main: seed = 1678486056
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

main: prompt: 'Building a website can be done in 10 simple steps:'
main: number of tokens in prompt = 15
     1 -> ''
  8893 -> 'Build'
   292 -> 'ing'
   263 -> ' a'
  4700 -> ' website'
   508 -> ' can'
   367 -> ' be'
  2309 -> ' done'
   297 -> ' in'
 29871 -> ' '
 29896 -> '1'
 29900 -> '0'
  2560 -> ' simple'
  6576 -> ' steps'
 29901 -> ':'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000

Building a website can be done in 10 simple steps:
1) Select a domain name and web hosting plan
2) Complete a sitemap
3) List your products
4) Write product descriptions
5) Create a user account
6) Build the template
7) Start building the website
8) Advertise the website
9) Provide email support
10) Submit the website to search engines
A website is a collection of web pages that are formatted with HTML. HTML is the code that defines what the website looks like and how it behaves.
The HTML code is formatted into a template or a format. Once this is done, it is displayed on the user's browser.
The web pages are stored in a web server. The web server is also called a host. When the website is accessed, it is retrieved from the server and displayed on the user's computer.
A website is known as a website when it is hosted. This means that it is displayed on a host. The host is usually a web server.
A website can be displayed on different browsers. The browsers are basically the software that renders the website on the user's screen.
A website can also be viewed on different devices such as desktops, tablets and smartphones.
Hence, to have a website displayed on a browser, the website must be hosted.
A domain name is an address of a website. It is the name of the website.
The website is known as a website when it is hosted. This means that it is displayed on a host. The host is usually a web server.
A website can be displayed on different browsers. The browsers are basically the software that renders the website on the user’s screen.
A website can also be viewed on different devices such as desktops, tablets and smartphones. Hence, to have a website displayed on a browser, the website must be hosted.
A domain name is an address of a website. It is the name of the website.
A website is an address of a website. It is a collection of web pages that are formatted with HTML. HTML is the code that defines what the website looks like and how it behaves.
The HTML code is formatted into a template or a format. Once this is done, it is displayed on the user’s browser.
A website is known as a website when it is hosted

main: mem per token = 14434244 bytes
main:     load time =  1332.48 ms
main:   sample time =  1081.40 ms
main:  predict time = 31378.77 ms / 61.41 ms per token
main:    total time = 34036.74 ms

And here is another demo of running both LLaMA-7B and whisper.cpp on a single M1 Pro MacBook:

Usage

Here are the step for the LLaMA-7B model:

# build this repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# obtain the original LLaMA model weights and place them in ./models
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model

# install Python dependencies
python3 -m pip install torch numpy sentencepiece

# convert the 7B model to ggml FP16 format
python3 convert-pth-to-ggml.py models/7B/ 1

# quantize the model to 4-bits
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2

# run the inference
./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128

LLaMA의 한계와 발전 방향

Coming soon.

Google USM

우리의 인코더는 사전 훈련을 통해 300개 이상의 언어를 통합한다. 우리는 YouTube Caption의 다국어 음성 데이터에 대한 미세 조정을 통해 사전 훈련된 인코더의 효과를 입증한다. 감독된 유튜브 데이터는 73개 언어를 포함하고 있으며 언어당 평균 3,000시간 미만의 데이터를 가지고 있다. 제한된 감독 데이터에도 불구하고, 이 모델은 73개 언어에서 평균 30% 미만의 단어 오류율(WER; 낮은 것이 더 좋다)을 달성하며, 이는 우리가 이전에 달성한 적이 없는 이정표이다. en-US의 경우 USM은 현재 내부 최첨단 모델에 비해 상대적으로 WER이 6% 낮다. 마지막으로, 우리는 최근 출시된 대형 모델인 Whisper(large-v2)와 비교하는데, 이 모델은 40만 시간 이상의 레이블링된 데이터로 훈련되었다. 비교를 위해, 우리는 위스퍼가 40% 미만의 WER로 성공적으로 디코딩할 수 있는 18개 언어만 사용한다. 우리 모델은 이러한 18개 언어의 Whisper에 비해 평균적으로 32.7% 낮은 WER을 가지고 있다.

-- USM supports all 73 languages in the YouTube Captions' Test Set and outperforms Whisper on the languages it can support with lower than 40% WER. Lower WER is better.

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Françoise Beaufays, Yonghui Wu We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages. arXivGPT "default" prompt is used The paper "Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages" describes a single large model, the Universal Speech Model (USM), that performs automatic speech recognition (ASR) across 100+ languages by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million hours spanning over 300 languages and fine-tuning on a smaller labeled dataset.

Key insights and lessons learned from the paper include:

Multilingual pre-training with random-projection quantization and speech-text modality matching can achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. USM exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages compared to the Whisper model, despite using a labeled training set 1/7th the size of Whisper's training set. USM significantly reduces model complexity and inference latency compared to traditional approaches that require multiple language-specific models. The paper highlights the importance of a large, diverse multilingual dataset for pre-training and fine-tuning the model, as well as the effectiveness of random-projection quantization and speech-text modality matching. Three questions to ask the authors:

How does USM compare to other large-scale multilingual speech recognition models, such as Facebook's wav2vec and wav2vec 2.0 models? Have you explored using USM for other speech-related tasks, such as speaker identification or emotion recognition? Can USM be extended to handle low-resource languages with limited labeled training data, and if so, what techniques might be effective? Three suggestions for related topics or future research directions:

Investigate the transfer learning capabilities of USM for other natural language processing tasks, such as text classification or named entity recognition. Explore the impact of additional pre-training tasks on USM's performance, such as masked language modeling or sequence-to-sequence translation. Investigate the effectiveness of USM for speech recognition in noisy or adverse acoustic environments. Relevant references:

Baevski, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv preprint arXiv:2006.11477. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Joulin, A. (2021). Unsupervised Cross-lingual Representation Learning at Scale. arXiv preprint arXiv:2012.15761. Ghoshal, A., & Swietojanski, P. (2017). Multi-lingual training of convolutional neural networks for low-resource speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5220-5224). IEEE. Hu, B., Chen, Y., Zhang, W., Han, W., & Wu, Y. (2020). Exploring large-scale pretraining for speech recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (pp. 2386-2392). Khurana, U., Mahajan, M., Dhingra, B., Carlini, N., & Liu, Y. (2021). Multilingual speech recognition: A survey of recent advances. arXiv preprint arXiv:2103.03247.✏

image image image image image

MuAViC

뭐가 더 좋은지 확인 요망. API 신청 완료. 대부분 + 다국어는 구글, 이상 값은 메타?

nick-jhlee commented 1 year ago

(오랜만에 돌아왔습니다..)

Upcoming Conferences/Deadlines

Papers (emphasis on diffusion models)

Screenshot 2023-03-11 at 9 23 19 PM Screenshot 2023-03-11 at 8 33 28 PM

Screenshot 2023-03-11 at 8 30 45 PM Screenshot 2023-03-11 at 8 30 57 PM

jwlee-neubla commented 1 year ago

소식

논문

image image image

(ref) ((masked autoencoder)) image

veritas9872 commented 1 year ago

Hyena Hierarchy: Towards Larger Convolutional Language Models

Blog: https://hazyresearch.stanford.edu/blog/2023-03-07-hyena ArXiv: https://arxiv.org/abs/2302.10866 GitHub: https://github.com/HazyResearch/safari

image image image image image image