News

Conferences
- ICCV 2023: 3월 9일 오전 9시 (3월 8일 GMT 23:59) - 서머타임 주의
- ICML 2023 리뷰: 새벽까지 입니다!!
ChatGPT가 촉발한 초거대 AI시대 우리의 대응 전략
AI미래포럼 초거대 AI 웨비나 시리즈2: 초거대 AI 비즈니스 생태계에 관하여
- 3.13 오전 10시-12시
- 배순민(KT), 이세영(뤼튼), 김지현(Employee Labs), 박성호(DTNI), 성낙호(네이버 클라우드)
POE: 4가지 초거대 언어모델 기반 챗봇 체험 - Claude 도 써볼 수 있어요!
- ChatGPT
- Claude
- 하이퍼클로바
Must read: the 100 most cited AI papers in 2022 2022 1️⃣ AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models -> (From DeepMind, 1372 citations) Using AlphaFold to augment protein structure database coverage.

2️⃣ ColabFold: making protein folding accessible to all -> (From multiple institutions, 1162 citations) An open-source and efficient protein folding model.

3️⃣ Hierarchical Text-Conditional Image Generation with CLIP Latents -> (From OpenAI, 718 citations) DALL·E 2, complex prompted image generation that left most in awe.

4️⃣ A ConvNet for the 2020s -> (From Meta and UC Berkeley, 690 citations) A successful modernization of CNNs at a time of boom for Transformers in Computer Vision.

5️⃣ PaLM: Scaling Language Modeling with Pathways -> (From Google, 452 citations) Google's mammoth 540B Large Language Model, a new MLOps infrastructure, and how it performs.

2021 1️⃣ Highly accurate protein structure prediction with AlphaFold -> (From DeepMind, 8965) AlphaFold, a breakthrough in protein structure prediction using Deep Learning.

2️⃣ Swin Transformer: Hierarchical Vision Transformer using Shifted Windows -> (From Microsoft, 4810 citations) A robust variant of Transformers for Vision.

3️⃣ Learning Transferable Visual Models From Natural Language Supervision -> (From OpenAI, 3204 citations) CLIP, image-text pairs at scale to learn joint image-text representations in a self supervised fashion

4️⃣ On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? -> (From U. Washington, Black in AI, The Aether, 1266 citations) Famous position paper very critical of the trend of ever-growing language models, highlighting their limitations and dangers.

5️⃣ Emerging Properties in Self-Supervised Vision Transformers -> (From Meta, 1219 citations) DINO, showing how self-supervision on images led to the emergence of some sort of proto-object segmentation in Transformers.

2020 1️⃣ An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale -> (From Google, 11914 citations) The first work showing how a plain Transformer could do great in Computer Vision.

2️⃣ Language Models are Few-Shot Learners -> (From OpenAI, 8070 citations) GPT-3, This paper does not need further explanation at this stage.

3️⃣ YOLOv4: Optimal Speed and Accuracy of Object Detection -> (From Academia Sinica, Taiwan, 8014 citations) Robust and fast object detection sells like hotcakes.

4️⃣ Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer -> (From Google, 5906 citations) A rigorous study of transfer learning with Transformers, resulting in the famous T5.

5️⃣ Bootstrap your own latent: A new approach to self-supervised Learning -> (From DeepMind and Imperial College, 2873 citations) Showing that negatives are not even necessary for representation learning.

ArXiv

LLaMA 이야기.

최근 페이스북이 공유한 LLaMA의 더 최신 이야기 입니다.

LLaMA-7B: 체크포인드를 공개해주셨습니다. 어떻게 만드신 건지 아직 모르겠습니다.
llama-up-data: hunkim님께서 LLaMA로 챗봇을 만들어주셨습니다. 다만 작다 보니 성능은....
llama-int8: 양자화를 잘 해서 3090, 4090에서도 돌아갈 수 있게 만들어주셨습니다. LLaMA INT8 Inference guide
LLaMA 질문: 궁금한 질문을 했지만 누구도 답을 주지 않는....

이미지 관련 모델.

Beating OpenAI CLIP with 100x less data and compute: 100배 적은 데이터로도 좋은 성능을 보여줍니다. 범용성이 높아서 추후에 많이 쓰지 않을까 싶습니다. 심지어 한국어도 잘합니다. 관련해서 질문이 있으시면 제가 대신 전달 드릴 수도 있습니다.
AI Generated Images Are Getting Too Real | Asmongold Reacts: 이제 이미지 생성을 정말 자연스럽게 합니다. 특히 LoRA는 공부를 해봐야겠습니다. 관련 결과를 몇 개 공유합니다.
AI Art is getting too good! Can YOU Tell the Difference?: 인공지능이 그린 것을 찾아보세요!

High-resolution image reconstruction with latent diffusion models from human brain activity BioArXiv: https://www.biorxiv.org/content/10.1101/2022.11.18.517004 Website: https://sites.google.com/view/stablediffusion-with-brain/

어제부터 트위터에서 크게 화제가 된 논문을 공유해드립니다. 뇌의 fMRI 신호에서 L2 regularized linear model(???)을 학습해 stable diffusion의 image와 text latent encoding에 맞추는 모델을 만들었을 때 대상에게 보여준 영상과 유사한 영상을 복원할 수 있다는 것을 보여준 연구입니다.

각 대상마다 수천장의 영상이 있어야하며 한 모델은 한 대상에게만, 그리고 아마 한 장치에 대해서만 사용 가능할 것으로 예상하지만 뇌파 정보에서 딥러닝 학습을 이용하지 않고 pretrained model과 linear model 학습만을 통해 복원 가능하다는 것을 보여주어 파급력이 매우 클 것으로 생각됩니다. 다만, 재현성을 확인해야지만 신뢰 가능할 것 같습니다.

Dropout Reduces Underfitting ArXiv: https://arxiv.org/abs/2303.01500 GitHub: https://github.com/facebookresearch/dropout

Consistency Models ArXiv: https://arxiv.org/abs/2303.01469

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages ArXiv: https://arxiv.org/abs/2303.01037

Full Stack Optimization of Transformer Inference: a Survey ArXiv: https://arxiv.org/abs/2302.14017

Transformer 모델의 최적화에 대한 하드웨어 및 소프트웨어 최적화 및 이슈에 대해 잘 정리된 survey paper 공유해드립니다.

Arxiv

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio
- Keyword : ASR
- Code : https://github.com/m-bain/whisperX
- Affiliation : Visual Geometry Group, University of Oxford
- Whisper의 경우 길이가 긴 오디오 전사 (long audio transcription) 에서 성능이 저하됨
- sliding window 방식의 경우, audio overlapping 때문에 오디오와 전사간의 불일치가 생길 수 있고, 각 segment에서 단어들이 경계에 놓이게 되어 잘못된 전사를 하게 되는 경우가 생긴다.
- 이 논문에서는 VAD를 이용하여 active speech region을 구하여 segment를 하고 짧은 segment들은 병합하여 (최대 길이 30초가 되도록) whisper와 phoneme model을 이용하여 align 시킴
- VAD의 경우 여러 언어에 대해서 robust하게 작동을 하지만, alignment phoneme model은 언어에 따라 추가적인 작업이 필요해 보임
  - phoneme recognition 없이 word-level timestamp를 만들 수 있지만 성능이 떨어짐
흥미로운 연구
- ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations
  - Keyword : TTS
  - Sample : https://parrottts.github.io/tts/
  - mel-spectrogram을 사용하지 않고, audio 데이터 만으로 SSL 처럼 학습하여 representation을 만들고 해당 representation을 Text를 이용하여 예측하도록 만들어 더 자연스러운 TTS 모델을 만들었다는 논문
- PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS
  - Keyword : TTS
  - Sample : https://github.com/anonymous-pits/pits
  - 기존의 TTS 들은 F0 (fundamental frequency)을 직접 모델링하는 방식에 의존하기 때문에 합성음에 variance가 적다는 것을 지적하고 VITS 연구를 기반으로 pitch-controllable TTS를 제안하고 있음

Interactive Text Generation
- user simulator를 이용하여 interactive한 text generation 학습
- generation model은 one-shot에 생성하기 때문에 hallucination과 같은 문제가 있어서 interactive한 방법이 중요함을 강조

jungwoo-ha / WeeklyArxivTalk

[20230305] Weekly AI ArXiv 만담 시즌2 - 8회차 #74

News

ArXiv

LLaMA 이야기.

이미지 관련 모델.

Arxiv