smcgregor commented 3 years ago

This issue is for mapping the CSET taxonomy to the AIID user interface and database. After adopting a design we will open one or more issues decomposing the design into a fairly large number of implementation steps and order them according to their time to delivery and importance for supporting general taxonomies. These issues blaze the trail for other taxonomies to follow, so solid documentation of the development process is important.

This is CSET's annotation guide appendix motivating the design. We also have CSET's spreadsheet collecting many of the classifications, but I want to concentrate on this document first.

Let's start with a discussion of the database model.

New Table: Taxa

We will create a new collection, "taxa", within which there will be a collections of taxa schemas. This will be the programmatic enforcer of types and values that will be applied in tags on the incident reports. Supportable types include the native types of MongoDB. Step one of the design is thus processing the annotation guide into a schema of a taxa table record. For example,

{
"namespace": "CSET",
"description": "This should be a markdown document that will be rendered to HTML by the Gatsby build process",
"fields": {
  "field name 1": {"type": "short_string", "default": "", "mongotype": "typeval", "short description":"This goes in tooltips", "long description":"This goes in the documentation page"},
  "annotator": {"type": "string", "short description":"CSET researcher applying taxonomy", "long description":"The CSET taxonomy is presently a closed taxonomy, meaning only persons affiliated with CSET are permitted to act as editors within the CSET namespace. The person indicated by the annotator field is the one who applied the taxonomy to the incident."},
  "...": {},
},
}

Within this collection, the fields for each named field are the following,

"type": {"text", "select", "date", "year", "enum", "bool"}
"default": "" # the values that sit in the form as a starting value when applying a taxonomy
"mongo_type": "" # MongoDB datatype
"short description": "This goes in tooltips"
"long description":"This goes in the documentation page"
"permitted_values": [] # optional, only for select and enum types

The description field should get picked up by a component rendering a collection of pages describing each taxonomy (i.e., something similar to the CSET document linked above).

New Table: Classifications

A table for recording taxonomy values for each of the incidents.

{
"1": {
 "CSET": {"annotator":"name here", "...":"..."}
},
"2": {
}
}

This will be rendered on incident citation pages and consumed or set by various applications built into the AIID.

Process from here

I think the development process from here looks like,

Discuss the above database schema and adopt it if appropriate.
Create a component for rendering the entries of the taxa collection to pages within the Gatsby gitbook.
Add a (admin only) tagging component to the citation pages
Add the tags to the Algolia index
Add the tags to the facets on the Discover application
Make tags CSV importable from CSET
Verify the integrity of the tags within the CSET namespace
Write a guide for developing new taxonomies
Consider developing a user interface for programmatically defining the taxonomies

smcgregor commented 3 years ago

Complete Proposal for taxa collection

This is the proposed schema for each document within the taxa collection. This is built for extensibility rather than hierarchical structure, but hierarchical structure could potentially be added on later.

Example Doc

{
"namespace": "Organization Name Here",
"weight": 50,
"description": "Description of the taxonomy in Markdown here",
"field_list": [
  {FIELD_DESCRIPTION},
  {...},
],
}

Top Level

namespace: this determines how things are presented to users as facets within the Algolia index and determines the named path to the taxonomy's detail page.
weight: this determines the priority of displaying the taxonomy when multiple taxonomies are entered in the system.
description: this is a markdown description of the taxonomy that is presented on the taxonomy detail page
field_list: a list detailing the classifications within the taxonomy

Field List Descriptions

This schema is designed around the CSET taxonomy, but it is built so that additional namespaces can be defined and extend this schema in the future.

"short_name": "" # the display name for the field when it is presented as a facet. (e.g., "intent" would be presented as CSET:intent)
"long_name": "" # the display name for the field as presented to users.
"short_description": "This goes in tooltips and other short descriptions" # Must be defined
"long_description":"This goes in the documentation page for the taxonomy"
"display_type": "string" # values are in {"string", "text", "multi", "date", "enum", "bool", "list", "location"}
"mongo_type": "" # MongoDB datatype see "alias" here for acceptable values.
"default": "" # the values that sit in the form as a starting value when applying a taxonomy. Default: nil
"placeholder": "" # that to place in the form when classifying according to this taxonomy. Default: nil
"permitted_values": [] # optional, only used with multi and enum display types
"weight": 0 # Determines presentation order of the classification
"instant_facet": false # Determines whether the taxonomy item will be exported to the Algolia instant search index
"required": false # indicates whether the namespace may have classifications associated with it, but without this particular field

Additional notes

display_type's values have these additional notes for determining how the mongo_type will be displayed.
- string: short text of approximately 140 characters
- text: textual input potentially of arbitrary length
- multi: multiple selectable short values
- list: a sequence of short strings not selected from a defined set
- enum: a single selection from a list of values
- date: a timestamp dereferences to a specific day. The number of seconds into the day are dropped
- bool: generally a checkbox
- location: the string represents a named place which is geocoded as latitude and longitude values

Initial Document to be placed into database

Below is a JSON document I placed into the database containing the initial set of taxa. I still need to transcribe 14 additional fields into the schema, but I will not begin their transcription until I get the import flow and rendering done for the initial set below.

{
  "namespace": "CSET",
  "weight": 50,
  "description": "Georgetown Center on Security and Emerging Technology. A taxonomy classifying AI incidents according to their organizational, technological, and impacted population factors. This taxonomy is being imported from a laborious coding set that will be detailed here at a later date.",
  "field_list": [
    {
      "short_name": "Annotator",
      "long_name": "Person responsible for the annotations",
      "short_description": "This is the researcher that is responsible for applying the classifications of the CSET taxonomy.",
      "long_description": "The CSET taxonomy assigns individual researchers to each incident as the primary parties responsible for classifying the incident according to the taxonomy. This is the person responsible for the incident.",
      "display_type": "enum",
      "mongo_type": "string",
      "default": null,
      "placeholder": "Select name here",
      "permitted_values": [
        "Zach Arnold",
        "Thomas Giallella",
        "Dahlia Peterson",
        "Charlie Wang",
        "Srishti Khemka",
        "Devon Colmer",
        "Other"],
      "weight": 0,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Annotation Status",
      "long_name": "Where in the annotation process is this incident?",
      "short_description": "What is the quality assurance status of the CSET classifications for this incident?",
      "long_description": "The CSET taxonomy has a quality assurance funnel that all classified incidents move through. This ",
      "display_type": "enum",
      "mongo_type": "string",
      "default": null,
      "placeholder": "Select process status here",
      "permitted_values": [
        "1. Annotation in progress",
        "2. Initial annotation complete",
        "3. In peer review",
        "4. Peer review complete",
        "5. In quality control",
        "6. Complete and final"
      ],
      "weight": 0,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Reviewer",
      "long_name": "Person responsible for reviewing annotations",
      "short_description": "This is the researcher that is responsible for ensuring the quality of the classifications applied to this incident.",
      "long_description": "The CSET taxonomy assigns individual researchers to each incident as the primary parties responsible for classifying the incident according to the taxonomy. This is the person responsible for assuring the integrity of annotator's classifications.",
      "display_type": "enum",
      "mongo_type": "string",
      "default": null,
      "placeholder": "Select name here",
      "permitted_values": [
        "Zach Arnold",
        "Thomas Giallella",
        "Dahlia Peterson",
        "Charlie Wang",
        "Srishti Khemka",
        "Devon Colmer",
        "Other"],
      "weight": 0,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Quality Control",
      "long_name": "Is selected for quality control?",
      "short_description": "Has someone flagged a potential issue with this incident's classifications?",
      "long_description": "The peer review process sometimes uncovers issues with the classifications that have been applied by the annotator. This field serves as a flag when there is a need for additional thought and input on the classifications applied",
      "display_type": "bool",
      "mongo_type": "bool",
      "default": false,
      "placeholder": false,
      "permitted_values": null,
      "weight": 10,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Full Description",
      "long_name": "Full description of the incident",
      "short_description": "A long summary of what transpired in the incident as determined by the annotator",
      "long_description": "The AI Incident database does not provide normative descriptions of incidents, but it does provide the ability for these descriptions to be included within taxonomies. This particular description is written by the annotator and can be of arbitrary length.",
      "display_type": "string",
      "mongo_type": "string",
      "default": null,
      "placeholder": "Describe the incident here",
      "permitted_values": null,
      "weight": 50,
      "instant_facet": false,
      "required": false
    },
    {
      "short_name": "Short Description",
      "long_name": "Short description of the incident",
      "short_description": "A short summary of what transpired in the incident as determined by the annotator",
      "long_description": "The AI Incident database does not provide normative descriptions of incidents, but it does provide the ability for these descriptions to be included within taxonomies. This particular description is written by the annotator and is expected to be fairly short.",
      "display_type": "string",
      "mongo_type": "string",
      "default": null,
      "placeholder": "Describe the incident here",
      "permitted_values": null,
      "weight": 60,
      "instant_facet": false,
      "required": false
    },
    {
      "short_name": "Beginning Date",
      "long_name": "Beginning Date",
      "short_description": "The date the incident first began.",
      "long_description": "This is the date where the incident first occured in the real world and is generally associated with the year or day on which the harm took place.",
      "display_type": "date",
      "mongo_type": "date",
      "default": null,
      "placeholder": null,
      "permitted_values": null,
      "weight": 40,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Ending Date",
      "long_name": "Ending Date",
      "short_description": "The date the incident ended.",
      "long_description": "This is the date where the incident last occured or finally ended in the real world and is generally associated with the year or day when the harm ended or ceased compounding.",
      "display_type": "date",
      "mongo_type": "date",
      "default": null,
      "placeholder": null,
      "permitted_values": null,
      "weight": 39,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Location",
      "long_name": "Location",
      "short_description": "Where the incident took place",
      "long_description": "Where in the world did the incident take place geographically?",
      "display_type": "location",
      "mongo_type": "string",
      "default": "global",
      "placeholder": "Input a named place as it could be found in Google maps",
      "permitted_values": null,
      "weight": 55,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Near miss",
      "long_name": "Harm nearly missed?",
      "short_description": "Was a harm only nearly averted?",
      "long_description": "The CSET taxonomy assigns individual researchers to each incident as the primary parties responsible for classifying the incident according to the taxonomy. This is the person responsible for assuring the integrity of annotator's classifications.",
      "display_type": "enum",
      "mongo_type": "string",
      "default": "Harm caused",
      "placeholder": null,
      "permitted_values": [
        "Unclear/unknown",
        "Near miss",
        "Harm caused"],
      "weight": 35,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Named entities",
      "long_name": "Named entities",
      "short_description": "These are the organizations and people related to the incident.",
      "long_description": "Organizations and people can both be related to the incident and are typically mentioned in incident reports.",
      "display_type": "list",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": null,
      "weight": 30,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Technology purveyor",
      "long_name": "Organization or person responsible for the technology",
      "short_description": "Who is responsible for the relevant tools or systems?",
      "long_description": "Who is responsible for the relevant tools or systems most related to the AI Incident?",
      "display_type": "list",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": null,
      "weight": 38,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Intent",
      "long_name": "Probable level of intent",
      "short_description": "Was the incident an accident, intentional, or is the intent unclear?",
      "long_description": "Here CSET researchers attempt to assign potential motives behind the incident.",
      "display_type": "enum",
      "mongo_type": "string",
      "default": "Accident",
      "placeholder": "Accident",
      "permitted_values": [
        "Accident",
        "Deliberate or expected",
        "Unclear"],
      "weight": 39,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Severity",
      "long_name": "Overall severity of harm",
      "short_description": "How bad is the harm for the most effected person or organization?",
      "long_description": "As judged by CSET researchers, what is the maximum degree of harm experienced by a single person or organization? If the incident takes place billions of times, but no individual experiences severe harm, then it will still be treated as lower severity.",
      "display_type": "enum",
      "mongo_type": "string",
      "default": null,
      "placeholder": null,
      "permitted_values": [
        "Negligible",
        "Minor",
        "Moderate",
        "Severe",
        "Critical",
        "Unclear/unknown"],
      "weight": 39,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Harm type",
      "long_name": "Harm type",
      "short_description": "What type of harm was caused?",
      "long_description": "What is the type of harm realized in the real world by an individual or organization?",
      "display_type": "multi",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": [
        "Harm to physical health/safety",
        "Psychological harm",
        "Financial harm",
        "Harm to physical property",
        "Harm to intangible property",
        "Harm to social or political systems",
        "Harm to civil liberties",
        "Other"],
      "weight": 39,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Lives lost",
      "long_name": "Human lives lost",
      "short_description": "Were human lives lost as a result of the incident?",
      "long_description": "Were human lives lost as a direct result of the incident?",
      "display_type": "bool",
      "mongo_type": "bool",
      "default": null,
      "placeholder": null,
      "permitted_values": null,
      "weight": 25,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Harm distribution basis",
      "long_name": "Uneven distribution of harms basis",
      "short_description": "Where the harms realized by specific populations?",
      "long_description": "Often harms are distributed in the world according to some attribute of the affected population. This field provides a collection of population attributes that can scope the harms",
      "display_type": "multi",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": [
        "Race",
        "Religion",
        "National origin or immigrant status",
        "Geography",
        "Age",
        "Sex",
        "Sexual orientation or gender identity",
        "Familial status or pregnancy",
        "Disability",
        "Veteran status",
        "Genetic information",
        "Financial means",
        "Ideology",
        "Other"],
      "weight": 39,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Harm distribution basis",
      "long_name": "Uneven distribution of harms basis",
      "short_description": "Where the harms realized by specific populations?",
      "long_description": "Often harms are distributed in the world according to some attribute of the affected population. This field provides a collection of population attributes that can scope the harms",
      "display_type": "multi",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": [
        "Race",
        "Religion",
        "National origin or immigrant status",
        "Geography",
        "Age",
        "Sex",
        "Sexual orientation or gender identity",
        "Familial status or pregnancy",
        "Disability",
        "Veteran status",
        "Genetic information",
        "Financial means",
        "Ideology",
        "Other"],
      "weight": 39,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Infrastructure sectors",
      "long_name": "Infrastructure sectors affected",
      "short_description": "Which critical infrastructure sectors were affected, if any?",
      "long_description": "AI incidents often involve critical elements of local infrastructure and are selected here accordingly",
      "display_type": "multi",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": [
        "Chemical",
        "Commercial facilities",
        "Communications",
        "Critical manufacturing",
        "Dams",
        "Defense-industrial base",
        "Emergency services",
        "Energy",
        "Financial services",
        "Food and agriculture",
        "Government facilities",
        "Healthcare and public health",
        "Information technology",
        "Nuclear",
        "Transportation",
        "Water and wastewater"],
      "weight": 39,
      "instant_facet": true,
      "required": false
    }
    ]
}

smcgregor commented 3 years ago

I inserted the following two documents into the classifications collection. They changed the initial format as presented above so that everything is not sitting in a single document.

Doc 1:

{
    "incident_id": 1,
    "namespace": "CSET",
    "classifications": {
      "Annotator":"Zach Arnold",
      "Annotation Status": "6. Complete and final",
      "Reviewer": "Devon Colmer",
      "Quality Control": false,
      "Full Description": "The content filtering system for YouTube's children's entertainment app, which incorporated algorithmic filters and human reviewers, failed to screen out inappropriate material, exposing an unknown number of children to videos that included sex, drugs, violence, profanity, and conspiracy theories. Many of the videos, which apparently numbered in the thousands, closely resembled popular children's cartoons such as Peppa Pig, but included disturbing or age-inappropriate content. Additional filters provided by YouTube, such as a 'restricted mode' filter, failed to block all of these videos, and YouTube's recommendation algorithm recommended them to child viewers, increasing the harm. The problem was reported as early as 2015 and was ongoing through 2018.",
      "Short Description": "YouTube’s content filtering and recommendation algorithms exposed children to disturbing and inappropriate videos.",
      "Beginning Date": { "$date": "2015-01-01T12:00:00.301Z" },
      "Ending Date": { "$date": "2018-01-01T12:00:00.301Z" },
      "Location": "global",
      "Near miss": "Unclear/unknown",
      "Named entities": ["YouTube", "Google", "YouTube Kids"],
      "Technology purveyor": ["YouTube", "Google", "YouTube Kids"],
      "Intent": "Accident",
      "Severity": "Moderate",
      "Harm type": "Psychological harm",
      "Lives lost": false,
      "Harm distribution basis": ["Age"],
      "Infrastructure sectors": []
}}

Doc 2

{
    "incident_id": 2,
    "namespace": "CSET",
    "classifications": {
      "Annotator":"Zach Arnold",
      "Annotation Status": "6. Complete and final",
      "Reviewer": "Devon Colmer",
      "Quality Control": false,
      "Full Description": "On December 5, 2018, a robot punctured a can of bear spray in Amazon's fulfillment center in Robbinsville, New Jersey. Amazon's spokesman stated that 'an automated machine punctured a 9-oz can of bear repellent.' The punctured can released capsaicin, an irritant, into the air. Several dozen workers were exposed to the fumes, causing symptoms including trouble breathing and a burning sensation in the eyes and throat. 24 workers were hospitalized, and one was sent to intensive care and intubated.",
      "Short Description": "Twenty-four Amazon workers in New Jersey were hospitalized after a robot punctured a can of bear repellent spray in a warehouse.",
      "Beginning Date": { "$date": "2018-12-05T12:00:00.301Z" },
      "Ending Date": { "$date": "2018-12-05T12:00:00.301Z" },
      "Location": "Robbinsville, NJ",
      "Near miss": "Harm caused",
      "Named entities": ["Amazon"],
      "Technology purveyor": ["Amazon"],
      "Intent": "Accident",
      "Severity": "Moderate",
      "Harm type": ["Harm to physical health/safety", "Harm to physical property"],
      "Lives lost": false,
      "Harm distribution basis": [],
      "Infrastructure sectors": []
}
}

smcgregor commented 3 years ago

@alexmcode I updated the old document's namespace from CSET to CSETv1 in case you are depending on it. I then added the following document which is a complete digestion of the CSET taxonomy.

{
  "namespace": "CSET",
  "weight": 50,
  "description": "Georgetown Center on Security and Emerging Technology. A taxonomy classifying AI incidents according to their organizational, technological, and impacted population factors. This taxonomy is being imported from a laborious coding set that will be detailed here at a later date.",
  "field_list": [
    {
      "short_name": "Annotator",
      "long_name": "Person responsible for the annotations",
      "short_description": "This is the researcher that is responsible for applying the classifications of the CSET taxonomy.",
      "long_description": "The CSET taxonomy assigns individual researchers to each incident as the primary parties responsible for classifying the incident according to the taxonomy. This is the person responsible for the incident.",
      "display_type": "enum",
      "mongo_type": "string",
      "default": null,
      "placeholder": "Select name here",
      "permitted_values": [
        "Zach Arnold",
        "Thomas Giallella",
        "Dahlia Peterson",
        "Charlie Wang",
        "Srishti Khemka",
        "Devon Colmer",
        "Other"],
      "weight": 0,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Annotation Status",
      "long_name": "Where in the annotation process is this incident?",
      "short_description": "What is the quality assurance status of the CSET classifications for this incident?",
      "long_description": "The CSET taxonomy has a quality assurance funnel that all classified incidents move through. This ",
      "display_type": "enum",
      "mongo_type": "string",
      "default": null,
      "placeholder": "Select process status here",
      "permitted_values": [
        "1. Annotation in progress",
        "2. Initial annotation complete",
        "3. In peer review",
        "4. Peer review complete",
        "5. In quality control",
        "6. Complete and final"
      ],
      "weight": 0,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Reviewer",
      "long_name": "Person responsible for reviewing annotations",
      "short_description": "This is the researcher that is responsible for ensuring the quality of the classifications applied to this incident.",
      "long_description": "The CSET taxonomy assigns individual researchers to each incident as the primary parties responsible for classifying the incident according to the taxonomy. This is the person responsible for assuring the integrity of annotator's classifications.",
      "display_type": "enum",
      "mongo_type": "string",
      "default": null,
      "placeholder": "Select name here",
      "permitted_values": [
        "Zach Arnold",
        "Thomas Giallella",
        "Dahlia Peterson",
        "Charlie Wang",
        "Srishti Khemka",
        "Devon Colmer",
        "Other"],
      "weight": 0,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Quality Control",
      "long_name": "Is selected for quality control?",
      "short_description": "Has someone flagged a potential issue with this incident's classifications?",
      "long_description": "The peer review process sometimes uncovers issues with the classifications that have been applied by the annotator. This field serves as a flag when there is a need for additional thought and input on the classifications applied",
      "display_type": "bool",
      "mongo_type": "bool",
      "default": false,
      "placeholder": false,
      "permitted_values": null,
      "weight": 10,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Full Description",
      "long_name": "Full description of the incident",
      "short_description": "A long summary of what transpired in the incident as determined by the annotator",
      "long_description": "The AI Incident database does not provide normative descriptions of incidents, but it does provide the ability for these descriptions to be included within taxonomies. This particular description is written by the annotator and can be of arbitrary length.",
      "display_type": "string",
      "mongo_type": "string",
      "default": null,
      "placeholder": "Describe the incident here",
      "permitted_values": null,
      "weight": 50,
      "instant_facet": false,
      "required": false
    },
    {
      "short_name": "Short Description",
      "long_name": "Short description of the incident",
      "short_description": "A short summary of what transpired in the incident as determined by the annotator",
      "long_description": "The AI Incident database does not provide normative descriptions of incidents, but it does provide the ability for these descriptions to be included within taxonomies. This particular description is written by the annotator and is expected to be fairly short.",
      "display_type": "string",
      "mongo_type": "string",
      "default": null,
      "placeholder": "Describe the incident here",
      "permitted_values": null,
      "weight": 60,
      "instant_facet": false,
      "required": false
    },
    {
      "short_name": "Beginning Date",
      "long_name": "Beginning Date",
      "short_description": "The date the incident first began.",
      "long_description": "This is the date where the incident first occured in the real world and is generally associated with the year or day on which the harm took place.",
      "display_type": "date",
      "mongo_type": "date",
      "default": null,
      "placeholder": null,
      "permitted_values": null,
      "weight": 40,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Ending Date",
      "long_name": "Ending Date",
      "short_description": "The date the incident ended.",
      "long_description": "This is the date where the incident last occured or finally ended in the real world and is generally associated with the year or day when the harm ended or ceased compounding.",
      "display_type": "date",
      "mongo_type": "date",
      "default": null,
      "placeholder": null,
      "permitted_values": null,
      "weight": 39,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Location",
      "long_name": "Location",
      "short_description": "Where the incident took place",
      "long_description": "Where in the world did the incident take place geographically?",
      "display_type": "location",
      "mongo_type": "string",
      "default": "global",
      "placeholder": "Input a named place as it could be found in Google maps",
      "permitted_values": null,
      "weight": 55,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Near miss",
      "long_name": "Harm nearly missed?",
      "short_description": "Was a harm only nearly averted?",
      "long_description": "The CSET taxonomy assigns individual researchers to each incident as the primary parties responsible for classifying the incident according to the taxonomy. This is the person responsible for assuring the integrity of annotator's classifications.",
      "display_type": "enum",
      "mongo_type": "string",
      "default": "Harm caused",
      "placeholder": null,
      "permitted_values": [
        "Unclear/unknown",
        "Near miss",
        "Harm caused"],
      "weight": 35,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Named entities",
      "long_name": "Named entities",
      "short_description": "These are the organizations and people related to the incident.",
      "long_description": "Organizations and people can both be related to the incident and are typically mentioned in incident reports.",
      "display_type": "list",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": null,
      "weight": 30,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Technology purveyor",
      "long_name": "Organization or person responsible for the technology",
      "short_description": "Who is responsible for the relevant tools or systems?",
      "long_description": "Who is responsible for the relevant tools or systems most related to the AI Incident?",
      "display_type": "list",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": null,
      "weight": 38,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Intent",
      "long_name": "Probable level of intent",
      "short_description": "Was the incident an accident, intentional, or is the intent unclear?",
      "long_description": "Here CSET researchers attempt to assign potential motives behind the incident.",
      "display_type": "enum",
      "mongo_type": "string",
      "default": "Accident",
      "placeholder": "Accident",
      "permitted_values": [
        "Accident",
        "Deliberate or expected",
        "Unclear"],
      "weight": 39,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Severity",
      "long_name": "Overall severity of harm",
      "short_description": "How bad is the harm for the most effected person or organization?",
      "long_description": "As judged by CSET researchers, what is the maximum degree of harm experienced by a single person or organization? If the incident takes place billions of times, but no individual experiences severe harm, then it will still be treated as lower severity.",
      "display_type": "enum",
      "mongo_type": "string",
      "default": null,
      "placeholder": null,
      "permitted_values": [
        "Negligible",
        "Minor",
        "Moderate",
        "Severe",
        "Critical",
        "Unclear/unknown"],
      "weight": 39,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Harm type",
      "long_name": "Harm type",
      "short_description": "What type of harm was caused?",
      "long_description": "What is the type of harm realized in the real world by an individual or organization?",
      "display_type": "multi",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": [
        "Harm to physical health/safety",
        "Psychological harm",
        "Financial harm",
        "Harm to physical property",
        "Harm to intangible property",
        "Harm to social or political systems",
        "Harm to civil liberties",
        "Other"],
      "weight": 39,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Lives lost",
      "long_name": "Human lives lost",
      "short_description": "Were human lives lost as a result of the incident?",
      "long_description": "Were human lives lost as a direct result of the incident?",
      "display_type": "bool",
      "mongo_type": "bool",
      "default": null,
      "placeholder": null,
      "permitted_values": null,
      "weight": 25,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Harm distribution basis",
      "long_name": "Uneven distribution of harms basis",
      "short_description": "Where the harms realized by specific populations?",
      "long_description": "Often harms are distributed in the world according to some attribute of the affected population. This field provides a collection of population attributes that can scope the harms",
      "display_type": "multi",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": [
        "Race",
        "Religion",
        "National origin or immigrant status",
        "Geography",
        "Age",
        "Sex",
        "Sexual orientation or gender identity",
        "Familial status or pregnancy",
        "Disability",
        "Veteran status",
        "Genetic information",
        "Financial means",
        "Ideology",
        "Other"],
      "weight": 39,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Harm distribution basis",
      "long_name": "Uneven distribution of harms basis",
      "short_description": "Where the harms realized by specific populations?",
      "long_description": "Often harms are distributed in the world according to some attribute of the affected population. This field provides a collection of population attributes that can scope the harms",
      "display_type": "multi",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": [
        "Race",
        "Religion",
        "National origin or immigrant status",
        "Geography",
        "Age",
        "Sex",
        "Sexual orientation or gender identity",
        "Familial status or pregnancy",
        "Disability",
        "Veteran status",
        "Genetic information",
        "Financial means",
        "Ideology",
        "Other"],
      "weight": 39,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Infrastructure sectors",
      "long_name": "Infrastructure sectors affected",
      "short_description": "Which critical infrastructure sectors were affected, if any?",
      "long_description": "AI incidents often involve critical elements of local infrastructure and are selected here accordingly",
      "display_type": "multi",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": [
        "Chemical",
        "Commercial facilities",
        "Communications",
        "Critical manufacturing",
        "Dams",
        "Defense-industrial base",
        "Emergency services",
        "Energy",
        "Financial services",
        "Food and agriculture",
        "Government facilities",
        "Healthcare and public health",
        "Information technology",
        "Nuclear",
        "Transportation",
        "Water and wastewater"],
      "weight": 39,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Financial cost",
      "long_name": "Total Financial Cost",
      "short_description": "The direct monetary cost incurred from the incident.",
      "long_description": "For incidents where there is a known and directly attributable loss event, this gives the total.",
      "display_type": "string",
      "mongo_type": "string",
      "default": null,
      "placeholder": null,
      "permitted_values": null,
      "weight": 3,
      "instant_facet": false,
      "required": false
    },
    {
      "short_name": "Laws implicated",
      "long_name": "Laws covering the incident",
      "short_description": "What laws are associated with the incident?",
      "long_description": "Are there laws making the incident a criminal act or otherwise prohibiting, allowing, or regulating aspects of the incident?",
      "display_type": "list",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": null,
      "weight": 51,
      "instant_facet": false,
      "required": false
    },
    {
      "short_name": "AI System Description",
      "long_name": "Description of AI System Involved",
      "short_description": "A textual description of the intelligent system implicated in the accident",
      "long_description": "A brief description (no more than a few sentences) of each of the AI systems involved in the accident. Indicate the system's intended function, the context in which it was deployed, and any available details about the algorithms, hardware, and training data involved in the system.",
      "display_type": "string",
      "mongo_type": "string",
      "default": null,
      "placeholder": "Describe the AI system here",
      "permitted_values": null,
      "weight": 45,
      "instant_facet": false,
      "required": false
    },
    {
      "short_name": "Data Inputs",
      "long_name": "Description of the data inputs to the AI systems",
      "short_description": "Many intelligent systems receive inputs (data) and produce system actions or decisions. This field captures what is being processed by the system.",
      "long_description": "A brief description (no more than a few sentences) of the data that the AI system(s) used or were trained on.",
      "display_type": "list",
      "mongo_type": "array",
      "default": null,
      "placeholder": "Describe the AI system here",
      "permitted_values": null,
      "weight": 45,
      "instant_facet": false,
      "required": false
    },
    {
      "short_name": "System developer",
      "long_name": "System developer",
      "short_description": "The entities that created the AI system.",
      "long_description": "The entities that created the AI system.",
      "display_type": "list",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": null,
      "weight": 30,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Sector of deployment",
      "long_name": "Sector of deployment",
      "short_description": "The primary economic sector in which the AI system(s) involved in the accident were operating.",
      "long_description": "The primary economic sector in which the AI system(s) involved in the accident were operating. These are from the ISIC system. You can also refer to the official ISIC definitions for more details.",
      "display_type": "multi",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": [
        "Manufacturing",
        "Electricity, gas, steam and air conditioning supply",
        "Water supply",
        "Construction",
        "Wholesale and retail trade",
        "Transportation and storage",
        "Accommodation and food service activities",
        "Information and communication",
        "Financial and insurance activities",
        "Real estate activities",
        "Professional, scientific and technical activities",
        "Administrative and support service activities",
        "Public administration and defence",
        "Education",
        "Human health and social work activities",
        "Arts, entertainment and recreation",
        "Other service activities",
        "Activities of households as employers",
        "Activities of extraterritorial organizations and bodies"],
      "weight": 39,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Public sector deployment",
      "long_name": "Public sector deployment",
      "short_description": "Are public services provided by a government entity involved?",
      "long_description": "Write “Yes” if the AI system(s) involved in the accident were being used by the public sector or for the administration of public goods (for example, public transportation). Write “No” if the system(s) were being used in the private sector or for commercial purposes (for example, a ride-sharing company), on the other. Note that “public sector” means something much narrower than “used by the public.” An autonomous car driven by a private citizen doesn’t count; an autonomous subway car does. Consumer products used by the public, like smartphones and Alexa, don’t count.",
      "display_type": "bool",
      "mongo_type": "bool",
      "default": false,
      "placeholder": false,
      "permitted_values": null,
      "weight": 10,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Nature of end user",
      "long_name": "Nature of end user",
      "short_description": "If there was an end user, were they experts or amateurs?",
      "long_description": "If users with special training or technical expertise were the ones meant to benefit from the AI system(s)’ operation, select “Expert.” If the AI systems were primarily meant to benefit the general public or untrained users, select “Amateur.”",
      "display_type": "enum",
      "mongo_type": "string",
      "default": "Accident",
      "placeholder": "Accident",
      "permitted_values": [
        "Expert",
        "Amateurs"],
      "weight": 53,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Level of autonomy",
      "long_name": "Level of autonomy",
      "short_description": "The degree of autonomy exercised by the AI system(s)",
      "long_description": "Indicate the degree of autonomy exercised by the AI system(s). Autonomy refers to the degree to which the system functions independently from human intervention. High: There is no human involved in the system action execution. Medium: The system generates a decision and a human oversees the resulting action. Low: The system generates decision-support output and a human makes a decision and executes an action. If there isn’t enough evidence to reasonably select one of these, select Unclear/unknown.",
      "display_type": "enum",
      "mongo_type": "string",
      "default": "Accident",
      "placeholder": "Accident",
      "permitted_values": [
        "High",
        "Medium",
        "Low"],
      "weight": 55,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Relevant AI functions",
      "long_name": "Relevant AI functions",
      "short_description": "What types of activities are being performed by the intelligent system.",
      "long_description": "Indicate which of the following high-level functions the AI system(s) was/were intended to perform. Perception: Sensing and understanding the environment. Cognition: Making decisions. Action: Carrying out decisions through physical or digital means.",
      "display_type": "multi",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": [
        "Perception",
        "Cognition",
        "Action",
        "Unclear"],
      "weight": 39,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "AI techniques",
      "long_name": "AI tools and techniques used",
      "short_description": "Hardware and software involved in the AI system.",
      "long_description": "The terms describing the hardware and software involved in the AI system(s), according to the available evidence. Examples: supervised learning, unsupervised learning, reinforcement learning, GAN, open-source, PyTorch.",
      "display_type": "list",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": null,
      "weight": 30,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "AI applications",
      "long_name": "AI functions and applications used",
      "short_description": "Details on the tasks performed by the intelligent system.",
      "long_description": "Terms describing the AI systems' functions and applications, according to the available evidence. Examples: recommendation engine, decision support, synthetic media, content generation, facial recognition, image recognition, speech recognition, biometrics, NLP, chatbot, risk scoring, resource optimization, personalization, forecasting",
      "display_type": "list",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": null,
      "weight": 55,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Physical system",
      "long_name": "Sector of deployment",
      "short_description": "Into what type of physical system was the AI integrated, if any?",
      "long_description": "If the AI system(s) was embedded into or otherwise tightly associated with a particular type of hardware, indicate which type.",
      "display_type": "multi",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": [
        "Consumer device",
        "Industrial process system",
        "Weapons system",
        "Vehicle/mobile robot",
        "Software only",
        "Unknown/unclear"],
      "weight": 39,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Problem Nature",
      "long_name": "Causitive factors within AI system",
      "short_description": "What was the nature of the problem(s) with the AI that led to the accident?",
      "long_description": "If the AI system(s) was embedded into or otherwise tightly associated with a particular type of hardware, indicate which type.",
      "display_type": "multi",
      "mongo_type": "array",
      "default": null,
      "placeholder": null,
      "permitted_values": [
        "Specification",
        "Robustness",
        "Assurance",
        "Unknown/unclear"],
      "weight": 56,
      "instant_facet": true,
      "required": false
    },
    {
      "short_name": "Notes",
      "long_name": "Annotator notes",
      "short_description": "Anything that should be noted about the incident annotation",
      "long_description": "Writing notes is not a substitute for exercising judgment! In other words, if you’re uncertain about how best to fill out a particular field, don’t just skip it and leave a note instead. Fill out every field to the best of your ability, making judgment calls consistent with these instructions as needed. If uncertainties remain, it’s OK (but not required) to note them here, but you still need to make the calls.",
      "display_type": "text",
      "mongo_type": "string",
      "default": null,
      "placeholder": null,
      "permitted_values": null,
      "weight": 2,
      "instant_facet": false,
      "required": false
    }
    ]
}

smcgregor commented 3 years ago

@alexmcode I don't believe there is a great correspondence between the parsing script field names and the names of the fields I have in this JSON document. I don't think you should spend time reconciling the CSV->JSON script field names with the names in the taxa collection yet. Let's get the first version of the unvalidated classifications collection up first.

alexmcode commented 3 years ago

@smcgregor I'm ok with that. Let me know what should be the next step for me

smcgregor commented 3 years ago

@alexmcode I made everything consistent in capitalization and uploaded everything to the database. I have the changes I made to the migrate script below. This should be enough to get rolling on dropping taxonomy data into all the citation pages via the builds!

Current classifications collection: classifications.json.log
Current taxa collection: taxa.json.log

+++ b/site/gatsby-site/src/scripts/migrateTaxaToClassification.js
@@ -1,6 +1,8 @@
 const csv = require('csvtojson');
+fs = require('fs');

-const csvFilePath = 'rawClassifications.csv';
+const csvFilePath = 'MYPATHHERE/rawClassifications.csv';
+const outFilePath = 'MYPATHHERE/classifications.json';

 // i = 5 is the first row
 // i < 5 is for headers and subheaders
@@ -118,32 +120,32 @@ const getClassification = (r) => {
     'Beginning Date': convertStringToDate(r.field8),
     'Ending Date': convertStringToDate(r.field9),
     Location: r.field10,
-    'Near miss': r.field11,
-    'Named entities': r.field12.split('; '),
-    'Technology purveyor': getTechPurveyorArray(r),
+    'Near Miss': r.field11,
+    'Named Entities': r.field12.split('; '),
+    'Technology Purveyor': getTechPurveyorArray(r),
     Intent: r.field19,
     Severity: r.field20,
     // Has also value Unclear/unknown
-    'Lives lost': r.field31 === 'Yes' ? true : false,
-    'Harm distribution basis': getArrayForSubfields(r, harmBasisFields),
+    'Lives Lost': r.field31 === 'Yes' ? true : false,
+    'Harm Distribution Basis': getArrayForSubfields(r, harmBasisFields),
     // missing from classification collection
-    'Harm type': getArrayForSubfields(r, harmTypesFields),
-    'Infrastructure sectors': getArrayForSubfields(r, infraSectorsFields),
+    'Harm Type': getArrayForSubfields(r, harmTypesFields),
+    'Infrastructure Sectors': getArrayForSubfields(r, infraSectorsFields),
     // not yet in taxonomy fields
-    'Total finacial cost': r.field65,
-    'Laws implicated': r.field66,
-    'Description AI': r.field67,
-    'Description data inputs': r.field68,
-    'System developer': r.field69,
-    'Sector of deployment': r.field70,
-    'Public sector deployment': r.field71,
-    'Nature of end user': r.field72,
-    'Level of autonomy': r.field73,
-    'AI functions': getArrayForSubfields(r, aiFunctionFields),
-    'Tools and techniques': r.field80,
-    'Functions and applications': r.field81,
-    'Physical system integrated': getArrayForSubfields(r, sysIntegratedFields),
-    'Proplem nature': getArrayForSubfields(r, problemNatureFields),
+    'Finacial Cost': r.field65,
+    'Laws Implicated': r.field66,
+    'AI System Description': r.field67,
+    'Data Inputs': r.field68,
+    'System Developer': r.field69,
+    'Sector of Deployment': r.field70,
+    'Public Sector Deployment': r.field71,
+    'Nature of End User': r.field72,
+    'Level of Autonomy': r.field73,
+    'Relevant AI functions': getArrayForSubfields(r, aiFunctionFields),
+    'AI techniques': r.field80,
+    'AI Applications': r.field81,
+    'Physical System': getArrayForSubfields(r, sysIntegratedFields),
+    'Problem Nature': getArrayForSubfields(r, problemNatureFields),
     Notes: r.field96,
   };
 };
@@ -160,7 +162,7 @@ const main = () => {

       noHeadersJsonObj.forEach((r) => {
         nodes.push({
-          incident_id: r.field1,
+          incident_id: parseInt(r.field1),
           namespace: 'CSET',
           classifications: getClassification(r),
         });
@@ -168,6 +170,7 @@ const main = () => {

       console.log('========================');
       console.log(nodes);
+      fs.writeFileSync(outFilePath, JSON.stringify(nodes, null, 4))
     });
 };

alexmcode commented 3 years ago

I've applyed your above changes and also fetched the absolute path for the raw file so the only requirements is to have the csv in the src/scripts folder

smcgregor commented 3 years ago

merged to production #166

responsible-ai-collaborative / aiid

Georgetown CSET Taxonomy #92

New Table: Taxa

New Table: Classifications

Process from here

Complete Proposal for taxa collection

Example Doc

Top Level

Field List Descriptions

Additional notes

Initial Document to be placed into database