[20230219] Weekly AI ArXiv 만담 시즌2 - 6회차

scene-the-ella commented 1 year ago

Within 7 days Conferences

ACM WSDM(Web Search and Deep Mining) 2023 https://www.wsdm-conference.org/2023/

2/27~3/3, Singapore
NDSS (Network and Distributed System Security Symposium) https://www.ndss-symposium.org/ndss2023/

2/27~3/3, San Diego US
IEEE HCPA(High-Performance Computer Architecture) https://hpca-conf.org/2023/

2/25~3/1, Montreal Canada

dhlee347 commented 1 year ago

Non-arxiv

How should AI systems behave, and who should decide? by OpenAI (16 Feb 2023)
- https://openai.com/blog/how-should-ai-systems-behave/
- ChatGPT는 어떻게 운영되는가?
- A two step process : pre-training, fine-tuning
- Aligning AI systems with human values is a top priority
- Fine-tuning 단계에서 reviewer가 몇가지 지침을 주기도 하지만, 미래의 사용자가 입력하는 모든 입력을 예측할 수 없으므로, reviewer와의 강력한 iterative feedback loop를 유지함으로서, 모델이 시간에 따라 점점 향상되게 된다.
- rule-based rewards (from DeepMind Sparrow) 나 Constitutional AI (from Anthropic) 등도 연구하고 있다.
- Future Works
- Improve default behavior : 몇가지 개선 사항 - 가끔 답변을 거절하면 안될때 거절하는 경우, 그 반대, 내용을 지어내는 문제 등을 해결하는 노력. 사용자 피드백이 중요한 역할을 할 것.
- Define your AI’s values : 개별 사용자에 쉽게 커스터마이징 할 수 있는 버젼을 개발중. 이것이 남용되지 않도록 그 경계를 잘 정의하는 노력.
- Public input on defaults and hard bounds : ChatGPT를 사용하거나 영향받는 사람들이 개발 지침에 영향을 끼칠 수 있도록 하는 노력. 되도록 많은 이들의 의견을 반영하려는 노력.
- 우리는 채용중이다 ! (다른덴 해고하기 바쁜데...)
MarioGPT
- DistillGPT2를 마리오 데이터에 Fine-tuning 해서 text-to-mario game map 모델을 만들었음.
Noam Chomsky도 ChatGPT에 대해서 언급
- https://www.openculture.com/2023/02/noam-chomsky-on-chatgpt.html
- ChatGPT는 기본적으로 표절의 첨단기술, 학생들이 학습을 피하는 방법임.
- 에세이 작성을 통한 교육은 무너질 것이다.
- 하지만, 새로운 첨단기술을 활용한, 학생들을 더 흥미롭게 교육하는 새로운 방법이 나올지도 모른다.
The Bitterest of Lessons: The Role of Data and Optimization in Emergence (by Sergey Levine)
- https://youtu.be/aDzQwewwvO0
- 딥러닝의 두 가지 큰 (scalable) 성공 사례 :
- Learning (Data) : LLM, text2image, ...
- Search (Optimization) : DRL, AlphaGo, ...
- Data without Optimization 은 데이터에 없는 새로운 방법으로 문제를 풀지 못함.
- Optimization without Data 는 simulator 바깥의 real world에 적용하기 힘듬.
- Offline (data-driven) RL은 Data + Optimization 이 가능하다?!
- 단, Data에 없는 unseen action에 대한 안전한 처리가 필요.
- (주의) 여기서 Optimization은 SGD같은게 아니고 AlphaGo에서처럼 최적의 policy를 탐색하는 것을 뜻함.

Arxiv

Benchmarking Large Language Models for News Summarization (https://arxiv.org/abs/2301.13848, 31 Jan 2023, 정독X)
- 최근 유행하는 LLM의 요약 성능을 Human-Evaluation으로 측정해보았음. (자세히 못읽음;;)
- 모델 크기보다는 Instruction Tuning이 중요하다.
- 기존 평가들은 low-quality references로 인해 한계가 있어서 freelance 작가들에게 부탁해서 high-quality summaries와 비교.
- 결론 : 작가의 요약과 Instruct GPT-3 요약 성능은 비슷하다 ! (다만 본문 변형을 하지 않고 copy를 많이함.) -
- A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity (https://arxiv.org/abs/2302.04023, 8 Feb 2023, 정독X)
- ChatGPT 성능을 체계적으로 조사 (자세히 못읽음;;)
- 21 data sets covering 8 different common NLP application tasks
- ChatGPT는 zero-shot에서는 대부분의 태스크에서 다른 LLM보다 성능이 좋고, 일부 태스크에서는 심지어 Fine-tuned model 보다 성능이 좋음.
- text prompt로 multi-modal content도 잘 만듦. (code gen을 활용)
- 10개의 다른 Reasoning Task에서 64.33% : 잘 못함. (귀납보다 연역을 잘함. 왜?)
- Hallucination 문제가 있음.
- multi-turn interactive feature로 인한 성능 향상이 있음.
- Code Gen이 되니까 이런 것도 되네?
Is ChatGPT a General-Purpose Natural Language Processing Task Solver? (https://arxiv.org/abs/2302.06476, 15 Feb 2023, 정독X)
- ChatGPT가 zero-shot으로 많은 태스크를 수행할 수 있는 Generalist Model이냐?
- 20 popular NLP datasets covering 7 representative task categories
- Symbolic Reasoning "Last Letter Concatenation"을 잘 못함 : 각 단어의 마지막 글자들을 붙이는 것.
- Commensense Reasoning "COPA" 잘 못함 : 어떤 현상이 일어났을때 그 다음 어떤 일이 벌어지나? causal relation
The Capacity for Moral Self-Correction in Large Language Models (https://arxiv.org/abs/2302.07459, 15 Feb 2023)
- RLHF로 학습된 언어모델이 도덕적이 되도록 스스로 교정할 수 있는가? “morally self-correct”
- 보통 모델 사이즈가 커지면 일반적으로 더 harmful 해지는데, instruction following 능력이나 harmful concept을 이해하는 능력도 같이 향상됨 --> 간단한 Prompting으로 언어모델이 스스로 harmful한 response를 내놓지 않도록 할 수 있을까?
- 결론 : 22B 가 넘으면 쌉가능 !
Languages are Rewards: Chain of Hindsight Finetuning using Human Feedback by Pieter Abbeel (https://arxiv.org/abs/2302.02676, 13 Feb 2023)
- Instruction model 만들 때, curated model generation preferred by human labelers (모델이 생성한 답 중에 사람이 고른 것) 을 Supervised Fine-tuning하곤 하는데, human feedback 모을때 꼭 나오는 negative example을 사용하지 못함.
- 이것들을 다 사용하는 학습법 : "Hindsight Finetuning"
- motivated by Hindsight Experience Replay (HER)
- (엄청 간단한 방법인데) 요약과 대화에서 엄청난 성능 향상 !
- 마지막 best output만 학습
- copy 학습 방지를 위해 worse outputs 의 15%를 masking
- pre-training dataset 학습도 병행
- Human Evaluation
The Wisdom of Hindsight Makes Language Models Better Instruction Followers by Pieter Abbeel (https://arxiv.org/abs/2302.05206, 10 Feb 2023)
- RLHF는 인상적이지만 RL학습은 너무 복잡하다.
- instruction alignment 문제를 decision making에서의 goal-reaching 문제로 보자.
- model output으로부터 instruction을 relabeling 해서 output 뒷 부분을 fine-tuning (reward-free approaches)
- Two-stage Reinforcement Learning
- Instruction Relabeling Strategy
- 생성한 답이 맞으면, "Generate a correct answer to this problem" 으로,
- 생성한 답이 틀리면, “Generate a wrong answer to this problem” 으로 고침.
- Contrastive Instruction Following
- Entropy Regularization
REPLUG: Retrieval-Augmented Black-Box Language Models (https://arxiv.org/abs/2301.12652, 1 Feb 2023)
- Black Box LM (예를들면 API로만 접근가능한 LLM - davinci) 에 Retriever를 붙이는 방법
- REPLUG : 기 존재하는 Retriever로 추출한 Documents 를 length limit를 넘겨서 LM에 붙이는 ensemble 방법
- REPLUG LSR : Frozen LM에 맞게 Retriever를 학습하는 방법
- Computing Retrieval Likelihood
- Computing LM likelihood
- KL divergence minimization between them
- Results
- Language Modeling on Pile : +6.3% with GPT-3 175B
- Five-shot MMLU : +5.1% with Codex
- Rare Entities benefit from Retrieval
Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning (https://arxiv.org/abs/2302.02662, 6 Feb 2023, 정독X)
- LLM policy로 interactive textual environment에 연결하여 RL을 수행했더니 성능 향상
- Sample efficiency
- Generalization to new objects
- Generalization to new tasks
- better than just Behaviour Cloning
- 아쉬운 점 : Grounding을 통해 LLM의 일반 성능이 향상되는가? 예를들면 Spatial Reasoning..

veritas9872 commented 1 year ago

삶의 목적을 찾는 45가지 방법 Yes24: http://www.yes24.com/Product/Goods/117506954

ChatGPT와 파파고가 쓴 첫 책이 나왔습니다. 일러스트 또한 Shutterstock AI를 통해 생성되었습니다.

Symbolic Discovery of Optimization Algorithms ArXiv: https://arxiv.org/abs/2302.06675 GitHub: https://github.com/google/automl/tree/master/lion

구글에서 AutoML을 통해 새로운 Optimizer 알고리즘을 제안했습니다.

gyunggyung commented 1 year ago

재밌는 논문 이미 보셨을 수도 있지만 공유합니다.

@channel 이번주 목요일 오늘이죠. ChatGPT와 그 이상의 소형 모델 스터디 하실분 찾습니다. 시간은 하는 분들에 따라서 협의 합니다. 오후 2시 5시 11시가 일단 후보입니다. 저는 4시와 6시반에 일정이 있습니다.

InstructGPT : Training language models to follow instructions with human feedback 논문 링크 : https://cdn.openai.com/papers/Training_language_models_to_follow_instructions_with_human_feedback.pdf

블로그 포스팅 : https://towardsdatascience.com/the-new-version-of-gpt-3-is-much-much-better-53ac95f21cfb

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan

Download PDF

We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work. https://arxiv.org/abs/2204.05862

Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola

Download PDF

Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. With Multimodal-CoT, our model under 1 billion parameters outperforms the previous state-of-the-art LLM (GPT-3.5) by 16 percentage points (75.17%->91.68% accuracy) on the ScienceQA benchmark and even surpasses human performance. Code is publicly available available at this https URL. https://arxiv.org/abs/2302.00923

BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining, by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu.

News!

• BioGPT-Large model with 1.5B parameters is coming, currently available on PubMedQA task with SOTA performance of 81% accuracy. See Question Answering on PubMedQA for evaluation

https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac409/6713511?guestAccessKey=a66d9b5d-4f83-4017-bb52-405815c907b9

We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a 2 trillion token database, our Retrieval-Enhanced Transformer (Retro) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25× fewer parameters. After fine-tuning, Retro performance translates to downstream knowledge-intensive tasks such as question answering. Retro combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train Retro from scratch, yet can also rapidly Retrofit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving language models through explicit memory at unprecedented scale. Study: https://lnkd.in/g2Xn4462
Code: https://lnkd.in/gJ7BxVJJ Datasets: https://lnkd.in/gdp6NF9k

jungwoo-ha / WeeklyArxivTalk