[ ] Defining AGI: Exploring Six Key Principles for an Operational Definition

Defining AGI: Exploring Six Key Principles for an Operational Definition

Snippet

"3 Defining AGI: Six Principles

Reflecting on these nine example formulations of AGI (or AGI-adjacent concepts), we identify properties and commonalities that we feel contribute to a clear, operationalizable definition of AGI. We argue that any definition of AGI should meet the following six criteria:

Focus on Capabilities, not Processes. The majority of definitions focus on what an AGI can accomplish, not on the mechanism by which it accomplishes tasks. This is important for identifying characteristics that are not necessarily a prerequisite for achieving AGI (but may nonetheless be interesting research topics). This focus on capabilities allows us to exclude the following from our requirements for AGI:
- Achieving AGI does not imply that systems think or understand in a human-like way (since this focuses on processes, not capabilities)
- Achieving AGI does not imply that systems possess qualities such as consciousness (subjective awareness) (Butlin et al., 2023) or sentience (the ability to have feelings) (since these qualities not only have a process focus, but are not currently measurable by agreed-upon scientific methods)
Focus on Generality and Performance. All of the above definitions emphasize generality to varying degrees, but some exclude performance criteria. We argue that both generality and performance are key components of AGI. In the next section we introduce a leveled taxonomy that considers the interplay between these dimensions.
Focus on Cognitive and Metacognitive Tasks. Whether to require robotic embodiment (Roy et al., 2021) as a criterion for AGI is a matter of some debate. Most definitions focus on cognitive tasks, by which we mean non-physical tasks. Despite recent advances in robotics (Brohan et al., 2023), physical capabilities for AI systems seem to be lagging behind non-physical capabilities. It is possible that embodiment in the physical world is necessary for building the world knowledge to be successful on some cognitive tasks (Shanahan, 2010), or at least may be one path to success on some classes of cognitive tasks; if that turns out to be true then embodiment may be critical to some paths toward AGI. We suggest that the ability to perform physical tasks increases a system’s generality, but should not be considered a necessary prerequisite to achieving AGI. On the other hand, metacognitive capabilities (such as the ability to learn new tasks or the ability to know when to ask for clarification or assistance from a human) are key prerequisites for systems to achieve generality.
Focus on Potential, not Deployment. Demonstrating that a system can perform a requisite set of tasks at a given level of performance should be sufficient for declaring the system to be an AGI; deployment of such a system in the open world should not be inherent in the definition of AGI. For instance, defining AGI in terms of reaching a certain level of labor substitution would require real-world deployment, whereas defining AGI in terms of being capable of substituting for labor would focus on potential. Requiring deployment as a condition of measuring AGI introduces non-technical hurdles such as legal and social considerations, as well as potential ethical and safety concerns.
Focus on Ecological Validity. Tasks that can be used to benchmark progress toward AGI are critical to operationalizing any proposed definition. While we discuss this further in the “Testing for AGI” section, we emphasize here the importance of choosing tasks that align with real-world (i.e., ecologically valid) tasks that people value (construing “value” broadly, not only as economic value but also social value, artistic value, etc.). This may mean eschewing traditional AI metrics that are easy to automate or quantify (Raji et al., 2021) but may not capture the skills that people would value in an AGI.
Focus on the Path to AGI, not a Single Endpoint. Much as the adoption of a standard set of Levels of Driving Automation (SAE International, 2021) allowed for clear discussions of policy and progress relating to autonomous vehicles, we posit there is value in defining “Levels of AGI.” As we discuss in subsequent sections, we intend for each level of AGI to be associated with a clear set of metrics/benchmarks, as well as identified risks introduced at each level, and resultant changes to the Human-AI Interaction paradigm (Morris et al., 2023). This level-based approach to defining AGI supports the coexistence of many prominent formulations – for example, Aguera y Arcas & Norvig’s definition (Agüera y Arcas and Norvig, 2023) would fall into the “Emerging AGI” category of our ontology, while OpenAI’s threshold of labor replacement (OpenAI, 2018) better matches “Virtuoso AGI.” Our “Competent AGI” level is probably the best catch-all for many existing definitions of AGI (e.g., the Legg (Legg, 2008), Shanahan (Shanahan, 2015), and Suleyman (Mustafa Suleyman and Michael Bhaskar, 2023) formulations). In the next section, we introduce a level-based ontology of AGI.

4 Levels of AGI

Performance (rows) x Generality (columns)

Level	Narrow	General
Level 0: No AI	Narrow Non-AI (calculator software; compiler)	General Non-AI (human-in-the-loop computing, e.g., Amazon Mechanical Turk)
Level 1: Emerging	Emerging Narrow AI (GOFAI (Boden, 2014); simple rule-based systems, e.g., SHRDLU (Winograd, 1971))	Emerging AGI (ChatGPT (OpenAI, 2023), Bard (Anil et al., 2023), Llama 2 (Touvron et al., 2023), Gemini (Pichai and Hassabis, 2023))
Level 2: Competent	Competent Narrow AI (toxicity detectors such as Jigsaw (Das et al., 2022); Smart Speakers such as Siri, Alexa, or Google Assistant; VQA systems such as PaLI (Chen et al., 2023); Watson (IBM, ); SOTA LLMs for a subset of tasks (e.g., short essay writing, simple coding))	Competent AGI (not yet achieved)
Level 3: Expert	Expert Narrow AI (spelling & grammar checkers such as Grammarly (Grammarly, 2023); generative image models such as Imagen (Saharia et al., 2022) or Dall-E 2 (Ramesh et al., 2022))	Expert AGI (not yet achieved)
Level 4: Virtuoso	Virtuoso Narrow AI (Deep Blue (Campbell et al., 2002), AlphaGo (Silver et al., 2016, 2017))	Virtuoso AGI (not yet achieved)
Level 5: Superhuman	Superhuman Narrow AI (AlphaFold (Jumper et al., 2021; Varadi et al., 2021), AlphaZero (Silver et al., 2018), StockFish (Stockfish, 2023))	Artificial Superintelligence (ASI) (not yet achieved)

Table 1: A leveled, matrixed approach toward classifying systems on the path to AGI based on depth (performance) and breadth (generality) of capabilities. Example systems in each cell are approximations based on current descriptions in the literature or experiences interacting with deployed systems. Unambiguous classification of AI systems will require a standardized benchmark of tasks, as we discuss in the Testing for AGI section. Note that general systems that broadly perform at a level N may be able to perform a narrow subset of tasks at higher levels. The "Competent AGI" level, which has not been achieved by any public systems at the time of writing, best corresponds to many prior conceptions of AGI, and may precipitate rapid social change once achieved.

In accordance with Principle 2 ("Focus on Generality and Performance") and Principle 6 ("Focus on the Path to AGI, not a Single Endpoint"), in Table 1 we introduce a matrixed leveling system that focuses on performance and generality as the two dimensions that are core to AGI:

Performance refers to the depth of an AI system’s capabilities, i.e., how it compares to human-level performance for a given task. Note that for all performance levels above “Emerging,” percentiles are in reference to a sample of adults who possess the relevant skill (e.g., “Competent” or higher performance on a task such as English writing ability would only be measured against the set of adults who are literate and fluent in English).
Generality refers to the breadth of an AI system’s capabilities, i.e., the range of tasks for which an AI system reaches a target performance threshold.

This taxonomy specifies the minimum performance over most tasks needed to achieve a given rating – e.g., a Competent AGI must have performance at least at the 50th percentile for skilled adult humans on most cognitive tasks, but may have Expert, Virtuoso, or even Superhuman performance on a subset of tasks. As an example of how individual systems may straddle different points in our taxonomy, we posit that as of this writing in September 2023, frontier language models (e.g., ChatGPT (OpenAI, 2023), Bard (Anil et al., 2023), Llama2 (Touvron et al., 2023), etc.) exhibit “Competent” performance levels for some tasks (e.g., short essay writing, simple coding), but are still at “Emerging” performance levels for most tasks (e.g., mathematical abilities, tasks involving factuality). Overall, current frontier language models would therefore be considered a Level 1 General AI (“Emerging AGI”) until the performance level increases for a broader set of tasks (at which point the Level 2 General AI, “Competent AGI,” criteria would be met). We suggest that documentation for frontier AI models, such as model cards (Mitchell et al., 2019), should detail this mixture of performance levels. This will help end-users, policymakers, and other stakeholders come to a shared, nuanced understanding of the likely uneven performance of systems progressing along the path to AGI.

The order in which stronger skills in specific cognitive areas are acquired may have serious implications for AI safety (e.g., acquiring strong knowledge of chemical engineering before acquiring strong ethical reasoning skills may be a dangerous combination). Note also that the rate of progression between levels of performance and/or generality may be nonlinear. Acquiring the capability to learn new skills may particularly accelerate progress toward the next level.

While this taxonomy rates systems according to their performance, systems that are capable of achieving a certain level of performance (e.g., against a given benchmark) may not match this level in practice when deployed. For instance, user interface limitations may reduce deployed performance. Consider the example of DALLE-2 (Ramesh et al., 2022), which we estimate as a Level 3 Narrow AI (“Expert Narrow AI”) in our taxonomy. We estimate the “Expert” level of performance since DALLE-2 produces images of higher quality than most people are able to draw; however, the system has failure modes (e.g., drawing hands with incorrect numbers of digits, rendering nonsensical or illegible text) that prevent it from achieving a “Virtuoso” performance designation. While theoretically an “Expert” level system, in practice the system may only be “Competent,” because prompting interfaces are too complex for most end-users to elicit optimal performance (as evidenced by user studies (Zamfirescu-Pereira et al., 2023) and by the existence of marketplaces (e.g., PromptBase ) in which skilled prompt engineers sell prompts). This observation emphasizes the importance of designing ecologically valid benchmarks (that would measure deployed rather than idealized performance) as well as the importance of considering how human-AI interaction paradigms interact with the notion of AGI (a topic we return to in the “Capabilities vs. Autonomy” Section).

The highest level in our matrix in terms of combined performance and generality is ASI (Artificial Superintelligence). We define "Superhuman" performance as outperforming 100% of humans. For instance, we posit that AlphaFold (Jumper et al., 2021; Varadi et al., 2021) is a Level 5 Narrow AI ("Superhuman Narrow AI") since it performs a single task (predicting a protein’s 3D structure from an amino acid sequence) above the level of the world’s top scientists. This definition means that Level 5 General AI ("ASI") systems will be able to do a wide range of tasks at a level that no human can match. Additionally, this framing also implies that Superhuman systems may be able to perform an even broader generality of tasks than lower levels of AGI, since the ability to execute tasks that qualitatively differ from existing human skills would by definition outperform all humans (who fundamentally cannot do such tasks). For example, non-human skills that an ASI might have could include capabilities such as neural interfaces (perhaps through mechanisms such as analyzing brain signals to decode thoughts (Tang et al., 2023; Bellier et al., 2023)), oracular abilities (perhaps through mechanisms such as analyzing large volumes of data to make high-quality predictions (Schoenegger and Park, 2023)), or the ability to communicate with animals (perhaps by mechanisms such as analyzing patterns in their vocalizations, brain waves, or body language (Goldwasser et al., 2023; Andreas et al., 2022)).

5 Testing for AGI

Two of our six proposed principles for defining AGI (Principle 2: Generality and Performance; Principle 6: Focus on the Path to AGI) influenced our choice of a matrixed, leveled ontology for facilitating nuanced discussions of the breadth and depth of AI capabilities. Our remaining four principles (Principle 1: Capabilities, not Processes; Principle 3: Cognitive and Metacognitive Tasks; Principle 4: Potential, not Deployment; and Principle 5: Ecological Validity) relate to the issue of measurement.

While our performance dimension specifies one aspect of measurement (e.g., percentile ranges for task performance relative to particular subsets of people), our generality dimension leaves open important questions: What is the set of tasks that constitute the generality criteria? What proportion of such tasks must an AI system master to achieve a given level of generality in our schema? Are there some tasks that must always be performed to meet the criteria for certain generality levels, such as metacognitive tasks?

Operationalizing an AGI definition requires answering these questions, as well as developing specific diverse and challenging tasks. Because of the immense complexity of this process, as well as the importance of including a wide range of perspectives (including cross-organizational and multi-disciplinary viewpoints), we do not propose a benchmark in this paper. Instead, we work to clarify the ontology a benchmark should attempt to measure. We also discuss properties an AGI benchmark should possess.

Our intent is that an AGI benchmark would include a broad suite of cognitive and metacognitive tasks (per Principle 3), measuring diverse properties including (but not limited to) linguistic intelligence, mathematical and logical reasoning (Webb et al., 2023), spatial reasoning, interpersonal and intra-personal social intelligences, the ability to learn new skills (Chollet, 2019), and creativity. A benchmark might include tests covering psychometric categories proposed by theories of intelligence from psychology, neuroscience, cognitive science, and education; however, such “traditional” tests must first be evaluated for suitability for benchmarking computing systems, since many may lack ecological and construct validity in this context (Serapio-García et al., 2023).

One open question for benchmarking performance is whether to allow the use of tools, including potentially AI-powered tools, as an aid to human performance. This choice may ultimately be task dependent and should account for ecological validity in benchmark choice (per Principle 5). For example, in determining whether a self-driving car is sufficiently safe, benchmarking against a person driving without the benefit of any modern AI-assisted safety tools would not be the most informative comparison; since the relevant counterfactual involves some driver-assistance technology, we may prefer a comparison to that baseline.

While an AGI benchmark might draw from some existing AI benchmarks (Lynch, 2023) (e.g., HELM (Liang et al., 2023), BIG-bench (Srivastava et al., 2023)), we also envision the inclusion of open-ended and/or interactive tasks that might require qualitative evaluation (Papakyriakopoulos et al., 2021; Yang et al., 2023; Bubeck et al., 2023). We suspect that these latter classes of complex, open-ended tasks, though difficult to benchmark, will have better ecological validity than traditional AI metrics, or than adapted traditional measures of human intelligence.

It is impossible to enumerate the full set of tasks achievable by a sufficiently general intelligence. As such, an AGI benchmark should be a living benchmark. Such a benchmark should therefore include a framework for generating and agreeing upon new tasks.

Determining that something is not an AGI at a given level simply requires identifying several tasks that people can typically do but the system cannot adequately perform. Systems that pass the majority of the envisioned AGI benchmark at a particular performance level ("Emerging," "Competent," etc.), including new tasks added by the testers, can be assumed to have the associated level of generality for practical purposes (i.e., though in theory there could still be a test the AGI would fail, at some point unprobed failures are so specialized or atypical as to be practically irrelevant).

Developing an AGI benchmark will be a challenging and iterative process. It is nonetheless a valuable north-star goal for the AI research community. Measurement of complex concepts may be imperfect, but the act of measurement helps us crisply define our goals and provides an indicator of progress.

Read the full paper on arXiv

Suggested labels

{'label-name': 'AGI-Progress', 'label-description': 'Levels and criteria for evaluating progress towards achieving Artificial General Intelligence (AGI).', 'confidence': 68.27}

irthomasthomas / undecidability