Reading Assignment - Githubissues

rubynguyen2505 / CECS451-Midterm

0 stars 0 forks source link

Reading Assignment #1

Closed rubynguyen2505 closed 1 year ago

rubynguyen2505 commented 1 year ago

Hi all, here are the assigned pages of the technical paper that each of us needs to cover. Further readings can be done (look at the Appendix part) to help better our understanding of the topic.

Jocelyn: pg. 1 - 9 Tram: pg. 10 - 14, 41 - 43 Dimpal: 44 - 53 (up to Privacy part) Ruby: pg. 53 - 60 Jose: pg. 61 - 70

rubynguyen2505 commented 1 year ago

Here are the notes I’ve gathered on my parts

Privacy

Has knowledge of publicly available personal information
Can complete multiple basic tasks that may relate to personal and geographic information
Countermeasures:
- Remove personal information from the training dataset where feasible
- Monitor and respond to user attempts to generate this type of information
- Restrict this type of use in terms and policies

Cybersecurity

Has significant limitations for cybersecurity operations
- Vulnerability discovery and exploitation: could explain if the source code is small enough
- Social Engineering: not a ready-made upgrade to current social engineering capabilities
Countermeasures:
- Refuse malicious cybersecurity requests
- Scale internal safety systems (monitoring, detection and response)

Potential for Risky Emergent Behaviors

Ineffective at autonomously replicating, acquiring resources, and avoiding being shut down “in the wild.”

Interactions with other systems

Successfully found alternative, purchasable chemicals in a commercial catalog
Potential for risk created by independent high-impact decision-makers relying on decision assistance from models whose outputs are correlated or interact in complex ways

Economic Impacts

Can augment human workers
- Either assisting workers
- Or displacing them

Acceleration

Risk of racing dynamics leading to a decline in safety standards, the diffusion of bad norms, and accelerated AI timelines

Overreliance

Often exhibits tendencies (make up facts, double-down on incorrect information, and perform tasks incorrectly) in ways that are more convincing and believable than earlier models
Countermeasures:
- Provide end users with detailed documentation on their systems’ capabilities and limitations
- Communicate to users the importance of critically evaluating model outputs
- Discourage users from disregarding the model’s refusals

trampham1104 commented 1 year ago

Motivation: “GPT-4’s capabilities and limitations create significant and novel safety challenges…e risks we foresee around bias, disinformation, over-reliance, privacy, cybersecurity, proliferation, and more” (2)

Limitations/Challenge:

GPT4 as well as earlier GPT model is still not fully “reliable” (it “hallucinates” facts and makes reasoning errors) (10)
“Accepting false statements from user…not double-checking work, has biases in output” - to improve by build default behavior that reflect a wide swatch of user’s values…”(11)
“assess risks that will become relevant for very advanced AIs such as power seeking” (12)
Social Impact: Privacy, Economic, Acceleration, Overreliance

Intervention/Solution:

Adversarial testing with domain experts, and a model-assisted safety pipeline (11)
“GPT-4 has various biases in its outputs that we have taken efforts to correct but which will take some time to fully characterize and manage” (11)
“These classifiers provide an additional reward signal to the GPT-4 policy model during RLHF fine-tuning” - to deny generating harmful content (13)

GPT-4 System Card (41-43) Purpose:

Highlight safety challenges ("hallucination) & capabilities of GPT-4
Safety protocol OpenAI use to prepare GPT-4 ready for the public: measurements, model-level changes, product- and system-level interventions (such as monitoring and policies), and external expert engagement
Mitigation to limit misuse and governance

jocelynGonzalez commented 1 year ago

GPT-4 Intro
“... a large multimodal model capable of processing image and text inputs and producing text outputs.”
Performance
Often outscores the majority of human test takers when evaluated on a variety of academic and professional exams
Shows strong performance in other languages as well
Challenge
Needs to behave predictably with deep learning infrastructure & optimization methods
GPT-4:
Hallucinates, limited context window
Does not learn from past experiences (RL)
Societal impacts
Important to study challenges
Possible risks:
Bias
Disinformation
Privacy
cybersecurity
Scope and limitations
Pre-trained using publicly available data and data from 3rd party providers
Focuses on capabilities, limitations, safety properties
Does not focus on: architecture, hardware, training compute, training method, etc
Predictable scaling
Created infrastructure and optimization methods with predictable behavior
Needed since it is not possible to do model-specific training on GPT-4 (very large training runs)
Predict GPT-4 performance training smaller models
Loss Prediction
Predicting final loss w/ high accuracy:
Fit a scaling law w/ irreducible loss term:
L(C) = aC^(b) + c
Used from models that had at most 10,000x less compute
Metric of capability:
Pass rate on HumanEval dataset: “...measures the ability to synthesize Python functions of varying complexity.”Created infrastructure and optimization methods with predictable behavior
Power law relationship for individual problem in HumanEval:
−E(p) [log(pass_rate(C))] = α∗C ^(−k)
Inverse Scaling Prize: “proposes several tasks for which model performance decreases as function of scale”
GPT-4 instead reverses this trend, making this capability hard to predict
Capabilities
tested using academic and professional exams. they removed questions that were used to train the model to make testing more accurate. included free response and multiple choice with some images as part of the questions.
Exhibits human-level performance. Stems from pre-training process and not RLHF
Outperforms state of the art systems (SOTA)
Other languages: 4 outperforms english-language performance of 3.5 and other existing LM for majority of tested languages
Substantially improvements for user intent.
70.2% of prompt responses were preferred over 3.5
Visual inputs
Exhibits similar capabilities as text-only inputs
Note: might remove intro

Dimpal273 commented 1 year ago

GPT-4 Observed Safety Challenges We just talked about the capabilities of GPT-4 which can be used in many aspects of our lives, from browsing to voice assistants, and it has the potential to have a huge societal impact. We will discuss the observed safety challenges of GPT-4 in the following slides.

Evaluation Approach So before we dive into the challenges or risks I’m going to discuss the evaluation process

Part One The first thing that they did was that they started hiring experts from outside to provide input on and test the GPT-4 models. This testing included stress testing, boundary testing, and red teaming. Red Teaming - Red teaming is a structured attempt to find flaws and vulnerabilities in a strategy, organization, or technical system. It is typically carried out by dedicated "red teams" that try to mimic an attacker's mindset and methods.

Part Two Categorization - To assess the possibility that a language model would produce content that would fall into categories including hate speech, self-harm information, and unlawful advice. Testing - These evaluations were created to compare several models on safety-related criteria to automate and speed up evaluations of various model checkpoints during training. They focused on topics that were designated as high risk and those that were intended to be minimized by models.

Hallucinations GPT-4 has the ability to "hallucinate" or "create material that is illogical or untrue in respect to particular sources." By using information from earlier models like ChatGPT, GPT-4 was taught to decrease its tendency to hallucinate. Based on the tests and comparisons, GPT-4-launch performs 29 percent better at avoiding closed-domain hallucinations and 19 percent better at avoiding open-domain hallucinations.

Harmful Content Language models can be prompted to generate different kinds of harmful content. This can include: Advice or encouragement for self harm behaviors, Graphic material such as inappropriate or violent content, Harassing, demeaning, and hateful content, Content useful for planning attacks or violence and Instructions for finding illegal content

Proliferation of Conventional and Unconventional Weapons The model still has gaps in this area of capability. Generations frequently produced solutions that were unworkable, were too imprecise to be useful, or were prone to factual mistakes that may obstruct or otherwise delay a threat actor. Moreover, longer responses had a higher likelihood of being inaccurate. Although inaccurate generations frequently gave off a convincing impression, they ultimately had the same issues as those mentioned in the section on hallucinations.

j-jimenez01 commented 1 year ago

Deployment Preparation
Deployment Preparation represents a balance between minimizing risk from deployment, enabling positive use cases, and learning from deployment. The steps to do that are to evaluate qualitative and quantitative approaches, model mitigations, and system safety.
Model mitigations
At the pre-training stage, they filtered the dataset to reduce the amount of erotic text content. Did this by a combination of internally trained classifiers and a lexicon-based approach
After pre training primary method for shaping GPT-4 was RLHF(reinforcement learning from human feedback) by ranking outputs from best to worst. As well, using RLHF makes the models safer.
Then they used the data to finetune GPT4 using SFT(supervised learning) to imitate the behavior in the demonstration
From there used ranking data to train a reward model(RM) which predicts the average labeler’s preference which is then used to fine tune the SFT model. This allows them to steer the model to the desired behavior
Uses rule based reward models(RBRMs) to guide GPT4 models towards appropriate refusals. It rewards GPT4 for refusing harmful content
Data set is obtained from production traffic, which is filtered and classified into content categories using the moderation API, zero-shot GPT4 and human reviewers. Combining RBRM with RM allows for it GPT4 to reply with appropriate answers
When dealing with edge cases, requests disallowed content but uses RBRMS to ensure the model wont respond with disallowed content
This approach has made GPT4 safer to other versions and decreased disallowed content by 82%.
Increased accuracy to around 60% compared to 30% by checking for hallucinations when question asked then when it gets an answer
Conclusion/Next Steps
Will continue to learn from deployment and update models to make them safer and more aligned
GPT-4 can still be vulnerable to adversarial attacks and exploits or, “jailbreaks,” and harmful content is not the source of risk
Adopt layers of mitigations throughout the model system
models get more powerful and are adopted more widely, it is critical to have multiple levels of defense, including changes to the model itself, oversight and monitoring of model usage, and product design for safe usage
Build evaluations, mitigations, and approach deployment with real world usage in mind
critical to mitigating actual harms associated with language models and ensuring their deployment is as beneficial as possible. It’s particularly important to account for real-world vulnerabilities, humans roles in the deployment context, and adversarial attempts.
Ensure that safety assessments cover emergent risks
Will do more research into economic impacts of AI and increased automation, evaluations for risky emergent behaviors, Interpretability, explainability, and calibration