Closed rubynguyen2505 closed 1 year ago
Here are the notes I’ve gathered on my parts
Privacy
Cybersecurity
Potential for Risky Emergent Behaviors
Interactions with other systems
Economic Impacts
Acceleration
Overreliance
Motivation: “GPT-4’s capabilities and limitations create significant and novel safety challenges…e risks we foresee around bias, disinformation, over-reliance, privacy, cybersecurity, proliferation, and more” (2)
Limitations/Challenge:
Intervention/Solution:
GPT-4 System Card (41-43) Purpose:
GPT-4 Intro
“... a large multimodal model capable of processing image and text inputs and producing text outputs.”
Performance
Often outscores the majority of human test takers when evaluated on a variety of academic and professional exams
Shows strong performance in other languages as well
Challenge
Needs to behave predictably with deep learning infrastructure & optimization methods
GPT-4:
Hallucinates, limited context window
Does not learn from past experiences (RL)
Societal impacts
Important to study challenges
Possible risks:
Bias
Disinformation
Privacy
cybersecurity
Scope and limitations
Pre-trained using publicly available data and data from 3rd party providers
Focuses on capabilities, limitations, safety properties
Does not focus on: architecture, hardware, training compute, training method, etc
Predictable scaling
Created infrastructure and optimization methods with predictable behavior
Needed since it is not possible to do model-specific training on GPT-4 (very large training runs)
Predict GPT-4 performance training smaller models
Loss Prediction
Predicting final loss w/ high accuracy:
Fit a scaling law w/ irreducible loss term:
L(C) = aC^(b) + c
Used from models that had at most 10,000x less compute
Metric of capability:
Pass rate on HumanEval dataset: “...measures the ability to synthesize Python functions of varying complexity.”Created infrastructure and optimization methods with predictable behavior
Power law relationship for individual problem in HumanEval:
−E(p) [log(pass_rate(C))] = α∗C ^(−k)
Inverse Scaling Prize: “proposes several tasks for which model performance decreases as function of scale”
GPT-4 instead reverses this trend, making this capability hard to predict
Capabilities
tested using academic and professional exams. they removed questions that were used to train the model to make testing more accurate. included free response and multiple choice with some images as part of the questions.
Exhibits human-level performance. Stems from pre-training process and not RLHF
Outperforms state of the art systems (SOTA)
Other languages: 4 outperforms english-language performance of 3.5 and other existing LM for majority of tested languages
Substantially improvements for user intent.
70.2% of prompt responses were preferred over 3.5
Visual inputs
Exhibits similar capabilities as text-only inputs
Note: might remove intro
GPT-4 Observed Safety Challenges We just talked about the capabilities of GPT-4 which can be used in many aspects of our lives, from browsing to voice assistants, and it has the potential to have a huge societal impact. We will discuss the observed safety challenges of GPT-4 in the following slides.
Evaluation Approach So before we dive into the challenges or risks I’m going to discuss the evaluation process
Part One The first thing that they did was that they started hiring experts from outside to provide input on and test the GPT-4 models. This testing included stress testing, boundary testing, and red teaming. Red Teaming - Red teaming is a structured attempt to find flaws and vulnerabilities in a strategy, organization, or technical system. It is typically carried out by dedicated "red teams" that try to mimic an attacker's mindset and methods.
Part Two Categorization - To assess the possibility that a language model would produce content that would fall into categories including hate speech, self-harm information, and unlawful advice. Testing - These evaluations were created to compare several models on safety-related criteria to automate and speed up evaluations of various model checkpoints during training. They focused on topics that were designated as high risk and those that were intended to be minimized by models.
Hallucinations GPT-4 has the ability to "hallucinate" or "create material that is illogical or untrue in respect to particular sources." By using information from earlier models like ChatGPT, GPT-4 was taught to decrease its tendency to hallucinate. Based on the tests and comparisons, GPT-4-launch performs 29 percent better at avoiding closed-domain hallucinations and 19 percent better at avoiding open-domain hallucinations.
Harmful Content Language models can be prompted to generate different kinds of harmful content. This can include: Advice or encouragement for self harm behaviors, Graphic material such as inappropriate or violent content, Harassing, demeaning, and hateful content, Content useful for planning attacks or violence and Instructions for finding illegal content
Proliferation of Conventional and Unconventional Weapons The model still has gaps in this area of capability. Generations frequently produced solutions that were unworkable, were too imprecise to be useful, or were prone to factual mistakes that may obstruct or otherwise delay a threat actor. Moreover, longer responses had a higher likelihood of being inaccurate. Although inaccurate generations frequently gave off a convincing impression, they ultimately had the same issues as those mentioned in the section on hallucinations.
Hi all, here are the assigned pages of the technical paper that each of us needs to cover. Further readings can be done (look at the Appendix part) to help better our understanding of the topic.
Jocelyn: pg. 1 - 9 Tram: pg. 10 - 14, 41 - 43 Dimpal: 44 - 53 (up to Privacy part) Ruby: pg. 53 - 60 Jose: pg. 61 - 70