[20220619] Weekly AI ArXiv 만담 - 55회차

News

Conferences
- Interspeech 2022: Notification: 모두 축하드립니다. 송도에서 뵈요!
- CVPR 2022: 오늘부터 드디어 시작!!! 부스 많이 들러 주세요! (6.19 ~ 24)
- 네이버의 발표 스케쥴 (17개 발표): https://naver-career.gitbook.io/en/teams/clova-cic/events/clova-ai-lab-cvpr-2022
- ACM FAccT 2022: 6.21~24: https://facctconference.org/2022/
- HyperscaleFAccT CRAFT (6.21): https://naver-career.gitbook.io/en/teams/clova-cic/events/hyperscalefacct-facct-2022
- Tutorial (6.22): Shortcut Learning in Machine Learning: Challenges, Analysis, Solutions (In-person)
HyperCLOVA 연구용으로 드디어 일반 공개 (선신청 노고민)
- 대상: 공공기관, 연구기관, 대학교(원)

ArXiv

Disentangling visual and written concepts in CLIP
- CLIP에서 written text 와 visual concept이 entangle 되어 있는데 이걸 distentangle이 가능한지 연구(from MIT)
- Learn to spell 과 Fortget to spell 이 되도록 각각 학습해서 disentangle 시켜봄
- 각각의 feature로 text-control 이미지 생성결과 및 image2text retrieval 성능 비교
OmniMAE: Single Model Masked Pretraining on Images and Videos
- Meta AI Omnivore-CVPR 2022 oral 에 MAE로 Image와 Video Joint Self-supervised learning 연구 (MAE + VideoMAE)
- 특별한 모델구조 변경없이 spatial (+temporal) patch -> 다수 masking -> encoder -> decoder -> pixel recon 하는 MAE구조
- Joint 트레이닝하니 개별 학습보다 좋은 성능을 보인다고.
- Pretraining data는 ImageNet-1k와 SSv2, Downstream은 iNaturalist-2018, Place365, K400, Epic-Kitchen등
- Masking 관련 ablation은 다른 비디오나 이미지 연구에도 도움될 듯
Beyond Supervised vs. Unsupervised: Representative Benchmarking and Analysis of Image Representation Learning
- 다양한 SSL method들이 나오는데 성능을 제대로 평가하기 위한 비교방법고 두개의 새로운 평가지표 제안 (CVPR 2022)
- https://mgwillia.github.io/exploring-unsupervised/
BYOL-Explore: Exploration by Bootstrapped Prediction
- RL하시는 분들은 참조해보세요~~
Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models
- 언어모델 연구 + 서비스하시는 분들을 위하여.
Language Models are General-Purpose Interfaces
- MSR에서 나온 Image + Text + Multiligual 동시 학습 LM (Semi-causal LM이라고)
- https://github.com/microsoft/unilm 에서 다운 가능 (아직은 껍데기만)

Arxiv (Audio and Speech Processing)

미리 만나는 INTERSPEECH2022
- Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech
  - NCSOFT / Jae-Sung Bae, Jinhyeok Yang, Tae-Jun Bak, Young-Sun Joo
  - arxiv 톡 : https://github.com/jungwoo-ha/WeeklyArxivTalk/issues/48#issuecomment-1100804328
- Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation
  - LINE, NAVER / Ryo Terashima (Line), Ryuichi Yamamoto (Line), Eunwoo Song (NAVER CLOVA), Yuma Shirahata (Line), Hyunwook Yoon (NAVER CLOVA), Jae-Min Kim (NAVER CLOVA), Kentaro Tachibana (Line)
  - arxiv 톡 : https://github.com/jungwoo-ha/WeeklyArxivTalk/issues/49#issuecomment-1107469641
- Adversarial Multi-Task Learning for Disentangling Timbre and Pitch in Singing Voice Synthesis
  - NCSOFT / Tae-Woo Kim, Min-Su Kang, Gyeong-Hoon Lee
  - arxiv 대기중 ㅎㅎ
Automatic Prosody Annotation with Pre-Trained Text-Speech Model
- INTERSPEECH2022 / Tencent AI Lab, Peking Univ. / Prosody annotation
- 코드 : https://github.com/Daisyqk/Automatic-Prosody-Annotation
- 샘플 URL : https://daisyqk.github.io/Automatic-Prosody-Annotation_w/ (듣고 운율을 비교할 수 있다면 당신은 중국인!)
- Problem : prosody modeling은 TTS에서 자연성(naturalness)에 중요한 역할을 하고 있지만 다음과 같은 문제가 있음
  1. explicit hierarchical prosodic boundary annotation을 하는 것이 중국어(Mandarin) TTS에서 아무런 annotation없이 하는 것보다 성능이 좋다는 연구 결과가 있지만 수작업으로 annotation하는 것은 시간이 많이 들고 비용이 높음
  2. 수작업 annotation일 좋지만, annotator 마다 annotation이 모호하고나 일관성이 없을 때가 있음
- Mandarin speech의 구조와 TTS training data collection pipeline
  - Mandarin speech는 5개의 level로 나눔, Character (CC), Lexicon Word (LW), Prosodic Word (PW), Prosodic Phrase (PPH), Intonational Phrase (IPH) (한국어 운율도 이런 식으로 구조화 해서 나누다면?)
- Method : 이 논문은 audio와 text를 입력으로 하는 automatic prosody annotator를 제안하였음
  - Autio Encoder 는 PPG extractor와 Character-based encoder로 이루어져 있음
    - PPG extractor : phoneme-based PPG model로 각 frame이 어떤 phoneme에 대응되는지
    - Character-based encoder : CNN과 conformer 기반의 architecture를 이용하여 local, global context를 추출
    - 왜냐면, phoneme sequence는 똑같은데 character sequence는 다른 예시가 존재
      - ”大学生物，必修课” and ”大学生, 务必修课
  - Text Encoder : pre-trained Chinese Bert 모델을 사용
  - Multi-modal fusion decoder : text representation과 audio representation 간 cross-attention을 취해서 예측
- Dataset and TTS model
  - Text Encoder : 300GB news corpus
  - Pre-trained Audio encoder : 10k hour Wenet Speech dataset
  - DurIAN TTS & HiFi-GAN vocoder (prosodic boundary를 어떻게 이용했나??)
- Evaluation
  - BERT는 text encoder만 사용 / human은 사람의 annotation (7명) / CNN-Char : CNN 기반의 audio encoder / Conformer-Char : Conformer 기반의 audio encoder / Conformer-PPG : PPG 기반의 audio encoder
  - 해석이 다양하지만 결과적으로는 text만 쓴 경우보다 audio를 같이 쓴 것이 좋고 특히 7번, 9번 모델은 사람보다 좋다
  - A/B 테스트 결과 automatic annotation이 51%로 살짝 높았다.
  - MOS도 살짝 높았다.
흥미 있는 연구
- GoodBye WaveNet - A Language Model for Raw Audio with Context of 1/2 Million Samples
  - Stanford Univ. / Technical report / Ongoing Work
  - 이미 MelGAN concept vocoder들이 많이 나와서 WaveNet 이겼다는데?? -> negative-log likelihood value를 사용하는 기법들하고 먼저 비교, 얼마나 다음 step에서 에측을 잘 하는가? 에 초점
  - 500,000 샘플보다 큰 large context의 audio waveform을 auto-regressive하게 만드는 구조를 제안
  - 키 포인트는 large context! (굳이? 잘라서 만들면 안되나??) -> 피아노 데이터셋이기 때문에 large context를 보는 것이 중요함
- To Dereverb Or Not to Dereverb? Perceptual Studies On Real-Time Dereverberation Targets
  - supervised speech enhancement에서 training target을 적절히 선택하는게 중요하다는 것을 언급하는 논문
  - dereverberation을 하면 더 좋게 들리지만, 제안하는 방법이 다른 방법과 큰 차이가 나지 않았다는 안타까운...
  - dereverberation에 관심이 있으신 분들이 읽으시면 좋을 듯
- Multi-instrument Music Synthesis with Spectrogram Diffusion
  - Google Research / Music synthesis
  - 샘플 URL : https://storage.googleapis.com/music-synthesis-with-spectrogram-diffusion/index.html
  - realtime으로 임의의 악기로 이루어진 MIDI sequence로부터 audio를 생성하는 neural synthesizer를 제안
  - 다양한 악기에 대하여 전사가 되어 있는 데이터 셋으로 학습을 하면 특정 악기에 대하여 MIDI note 레벨로 제어가 가능
  - T5 (T5.1.1) 기반의 encoder-decoder 구조로 MIDI 에서 Mel-spectrogram을 만들고 MelGAN을 커스터마이징하여 waveform을 만듦
  - 여기서 decoder를 autoregressive 모델과 DDPM 모델로 구성하여 실험
  - small은 작은 diffusion model, base는 그보다 큰 diffusion model, context는 mel-spectrogram
  - note sequence를 짧게 잘라서 실험을 하였는데 goodbye 논문하고 콜라보하면 좋을 듯??

Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction

Batchnorm, Layernorm, 등 normalization layer와 weight decay를 같이 적용했을 때 neural network 학습에 어떤 영향을 미치는지에 대해 수학적인 분석 및 실험 결과에 대한 해석입니다. Normalization은 neural network weight에 대해 scale invariance를 가지게끔 하지만 weight decay는 여전히 gradient의 magnitude에 영향을 미치기 때문에 학습 안정성에 기여한다고 저자들이 주장합니다.

또한, sharpness의 개념을 보다 체계화하기 위해 unit sphere로 projection한 weight matrix의 Hessian을 기준으로 삼아 normalization으로 인한 norm magnification issue를 해결하고자 합니다.

Speak Like a Dog: Human to Non-human creature Voice Conversion

Arxiv: https://arxiv.org/abs/2206.04780 GitHub: https://github.com/suzuki256/dog-dataset 인간 목소리를 비인간 음성의 특성을 띄도록하는 연구가 공개되었습니다. 애니메이션이나 판타지 영화 등에서 짐승 및 괴물의 음성을 생성하기 위해 여러 각색을 시도해야 하는데 Speak like a dog 모델을 사용하면 기존 방법에 비해 손쉽게 비인간 음성을 생성할 수 있을 것으로 생각됩니다.

논문 중에는 StarGAN을 mel spectrogram 영상에 적용하여 인간 발화와 개의 소리 사이에 변환을 진행합니다. 인간 음성과 개의 소리 사이에 소리의 길이 등 차이를 감안하는 등 task-specific issue에 대한 설명도 있습니다.

아래 참조를 위해 생성 음성 예시를 공유해드립니다. 일본어이고 퀄리티가 매우 높지는 않지만 시장성이 매우 높은 연구분야라고 생각됩니다.

Samples: https://drive.google.com/drive/folders/1aQ5o0Ond50nbAvZsp_me4b97j8VtLYbz

News
- Google LaMDA에 대한 이슈
  - LaMDA(Language Model for Dialogue Applications)는 구글에서 개발한 137B 크기의 대화형 서비스에 들어가는 모델
  - 근데 LaMDA라는 대화형 모델에 인지, 지각이 있다고 Google Responsible AI의 Blake Lemoine이라는 연구자가 주장함
  - 이에 대한 워싱턴포스트와의 인터뷰
  - LaMDA에 어떤 인간적인(?) 인지능력이 있다는 주장하는 이유는 LaMDA의 대화능력 때문
  - 그리고 LLM이 인지능력을 갖고 있다는 주장에 반대하는 의견 Andrew Ng 교수님 뉴스레터 Melanie Mitchell 언론사 인터뷰
  - 요지는 AI의 인지능력에 대한 과학적인 척도가 없는 상황에서 LaMDA와 같은 LLM에 인지능력이 있다는 주장은 인정할 수 없다는 것
  - 결국 구글은 르모인을 기밀 유지를 어겼다는 내용으로 휴직처리
  - 회사가 LaMDA가 인지능력을 가졌다는 주장에 대한 검증 자체를 기피하며, 구글이 AI 윤리 이슈와 이를 다루는 구성원을 홀대한다는 핑퐁 발생..
- Meta AI의 조직 구조 개편
  - Building with AI across all of Meta
  - Responsible AI팀을 Social Impact팀에 통합
  - AI for Product팀을 product engineering팀과 통합
  - AI for AR팀을 Reality Labs의 XR팀에 통합
  - FAIR는 Reality Labs Research의 중요한 축으로
  - 지난 몇 년 동안 AI를 뾰족하게 연구하던 중앙 연구 조직에서 제품에 잘 적용하기 위한 방향성에 입각해 조직을 개편했다는 맥락
  - AI 연구도 잘하면서 제품에 대한 적용을 위해서는 어떤 조직 구조가 좋을지 항상 흥미롭습니다

Impact of Artificial Intelligence Assistance on Chest CT Interpretation Times: A Prospective Randomized Study

https://www.ajronline.org/doi/10.2214/AJR.22.27598 https://www.eurekalert.org/news-releases/955757 https://m.medicaltimes.com/News/NewsView.html?ID=1147887

실제 임상 환경에서 흉부 CT 해석을 위한 임상 워크플로에 통합된 자동화된 AI 플랫폼이 방사선 전문의의 해석 시간에 미치는 영향을 평가

AI 판독 지원 시스템을 사용했을때의 효과를 단일센터 전향적 연구 결과로 검증

2021년 1월 19일부터 28일까지 사우스캐롤라이나 의과대학(MUSC)에서 외래 환자 흉부 CT를 받은 390명의 환자(여성 204명, 남성 186명, 평균 연령 62.8세)를 대상

결론: 흉부 영상의학과 전문의는 AI 지원 플랫폼을 사용할 경우 흉부 CT 해석 시간이 22.1% 감소 (하루 1시간 단축)

AI/ML Trustworthiness characteristics matrix

https://github.com/hollobit/WG3_TCM

ISO/IEC JTC1/SC 42/WG3 roadmapping AHG에서 작업 중인 matrix

SC42에서 개발된(중인) 표준들과 trustworthiness characteristic들과의 연관성을 mapping 하는 분류 작업 시작

github을 이용한 첫번째 형태, 앞으로 어떻게 이쁘게 직관적으로 관계와 관련성들을 잘 표현할지 계속 고민할 예정

News

ICML Workshops results out! (almost)
Amii AI-Week 2022
- https://www.ai-week.ca/
- DeepMind Edmonton 투어 (feat. Rich Sutton, Patrick Pilarski...etc), 네트워킹, academic keynotes...etc
- 제가 들은거: Artificial Intelligence and Control (Martin Riedmiller), Policy and Heuristic-Guided Tree Search (Levi Lelis), Principles of Reinforcement Learning (Dale Schuurmans & Jincheng Mei)
- 나름 (제 기준에선) 북아메리카 fundamental RL의 "큰" 곳
- "Eyes on the Prize" (Keynote by Prof. Richard Sutton)
  - "reward hypothesis"
- 생각보다 컸고, 갠적으로 이렇게 모일 수 있는 행사가 있는게 너무 좋았슴다 (특히 networking 굳...)
- P.S. Rockies

Papers

Asynchronous SGD Beats Minibatch SGD Under Arbitrary Delays
- DI ENS, Ecole normale supérieure, Université PSL, CNRS, INRIA
- much better guarantees for the same asynchronous SGD algorithm regardless of the delays in the gradients
- depending instead just on the number of parallel devices used to implement the algorithm.
- "delays do not matter"
Heavy-Tail Phenomenon in Decentralized SGD
- Rutgers Business School, Alibaba, Florida State University, DI ENS, Ecole normale supérieure, Université PSL, CNRS, INRIA
- heavy tails can arise in many machine learning problems using gradient-based methods (see e.g. The Heavy-Tail Phenomenon of SGD)
- what about decentralized setting?
- tail-index can be estimated as a function of the step-size, batch-size, and the topological properties of the network of the computational nodes
Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt (ICML 2022)
- Oxford University (OTAML, Computer Science, Statistics), Cohere, University of Toronto
- Reducible Holdout Loss Selection (RHOLOSS: a simple but principled technique which selects approximately those points for training that most reduce the model’s generalization loss.
- On the large web-scraped image dataset Clothing-1M, RHO-LOSS trains in 18x fewer steps and reaches 2%higher final accuracy than uniform data shuffling.
Wide Bayesian neural networks have a simple weight posterior: theory and accelerated sampling (ICML 2022)
- Google Research, University of Cambridge
- repriorisation: a data-dependent reparameterisation which transforms a BNN posterior to a distribution whose KL divergence to the BNN prior vanishes as layer widths grow.
- Because it acts directly on the parameter space, and it's very simple, analytically
- MCMC algorithm that mixes much faster (>50x speedup!)
A Deep Dive into Dataset Imbalance and Bias in Face Identification
- Maryland University, Johns Hopkins University, New York University
- https://twitter.com/micahgoldblum/status/1535999363700367362?s=20&t=Q0kElqEVp_BPzDP1zPph3w