project-open-data / project-open-data.github.io

Open Data Policy — Managing Information as an Asset
https://project-open-data.cio.gov/
Other
1.34k stars 583 forks source link

Provide documentation and examples on the use of redactions in JSON #446

Closed philipashlock closed 9 years ago

philipashlock commented 9 years ago

The general guidance on redactions for federal agencies is as follows, but we need to provide examples of what this looks like as JSON.

Redaction text Brief FOIA exemption description
[[REDACTED-EX B3]] Specifically exempted from disclosure by statute (other than FOIA), provided that such statute (A) requires that the matters be withheld from the public in such a manner as to leave no discretion on the issue, or (B) establishes particular criteria for withholding or refers to particular types of matters to be withheld.
[[REDACTED-EX B4]] Trade secrets and commercial or financial information obtained from a person and privileged or confidential.
[[REDACTED-EX B5]] Inter-agency or intra-agency memorandums or letters which would not be available by law to a party other than an agency in litigation with the agency.
[[REDACTED-EX B6]] Personnel and medical files and similar files the disclosure of which would constitute a clearly unwarranted invasion of personal privacy.
philipashlock commented 9 years ago

This has started to be implemented with https://github.com/GSA/project-open-data-dashboard/issues/83 such that the following example below (taken from the existing example) would pass the schema validation

Note that the particular exemption reason denoted by "B3" in the example used [[REDACTED-EX B3]] might not make sense in some of the places it's used. More generally, the places where the redactions are used in this example might not make sense given the descriptions used in the other fields. The example here is intended only to demonstrate what the redaction text would look like in the JSON syntax.

It's worth considering whether some fields might never need to be redacted, eg (accessLevel, identifier, isPartOf, bureauCode, programCode). With a traditional redacted paper document, I imagine the page numbers are never redacted, even if the full page is. Similarly, it seems like it would be necessary to retain the identifier even if everything else was redacted so that you could at least distinguish between different redacted records.

{
    "@context": "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld",
    "@id": "http://www.agency.gov/data.json",
    "@type": "dcat:Catalog",    
    "conformsTo": "https://project-open-data.cio.gov/v1.1/schema", 
    "describedBy": "https://project-open-data.cio.gov/v1.1/schema/catalog.json",
    "dataset": [
        {
            "@type": "dcat:Dataset",
            "accessLevel": "non-public", 
            "accrualPeriodicity": "R/P1Y", 
            "bureauCode": [
                "018:10"
            ],
            "conformsTo": "http://www.agency.gov/widget-taxonomy/",
            "contactPoint": {
                "@type": "vcard:Contact",
                "fn": "Jane Doe", 
                "hasEmail": "mailto:jane.doe@agency.gov"
            }, 
            "describedBy": "http://www.agency.gov/datasets/widgets-dictionary.html", 
            "dataQuality": true, 
            "description": "This dataset provides national statistics on the production of widgets for [[REDACTED-EX B4]]", 
            "distribution": [
                {
                    "@type": "dcat:Distribution",
                    "description": "[[REDACTED-EX B4]] widgets data as a CSV file", 
                    "downloadURL": "[[REDACTED-EX B4]]", 
                    "format": "CSV", 
                    "mediaType": "text/csv", 
                    "title": "[[REDACTED-EX B4]]-widgets.csv"
                }
            ], 
            "identifier": "https://metadata.agency.gov/10.7927/H4PZ56R2", 
            "issued": "2011-11-22", 
            "keyword": [
                "widget", 
                "manufacturing", 
                "factory"
            ], 
            "landingPage": "http://agency.gov/widgets/data", 
            "language": [
                "en-US"
            ], 
            "license": null, 
            "modified": "2011-11-19T12:00:00Z", 
            "primaryITInvestmentUII": "021-006227212", 
            "programCode": [
                "018:001"
            ], 
            "publisher": {
                "@type": "org:Organization",
                "name": "Widget Services", 
                "subOrganizationOf": {
                    "@type": "org:Organization",
                    "name": "Office of Widget Statistics"                    
                }
            }, 
            "references": [
                "https://agency.gov/docs/widgets-1.html", 
                "https://agency.gov/docs/widgets-2.html"
            ], 
            "rights": "This dataset cannot be made public because it includes trade secrets and commercial or financial information obtained from a person and is privileged or confidential.", 
            "spatial": "United States", 
            "systemOfRecords": "http://www.agency.gov/widgets/sorn/", 
            "temporal": "2009-09-01T12:00:00Z/2010-05-31T12:00:00Z", 
            "theme": [
                "manufacturing"
            ], 
            "title": "U.S. Widget Statistics for [[REDACTED-EX B4]]"
        }
    ]
}
rebeccawilliams commented 9 years ago

I think all fields should be redacted with a presumption of openness. This is inline with the federal FOIA policy. An example that reflects this would be useful too.

rebeccawilliams commented 9 years ago

Including the presumption of openness language (above) and DOT's PDL as a best practice would be good additions to this guidance as well.

jlberryhill commented 9 years ago

Thanks, guys, and great example @philipashlock. I'd also note that certain parts of a field can be redacted rather than the whole field, if only certain words are subject to FOIA exemption. Agree with @rebeccawilliams on the presumption of openness. Think agencies should not redact entire metadata records and that there may be some fields that would never make sense to be redacted.

bpushed commented 9 years ago

Greetings all -- As a foreign assistance agency, USAID is exempt from releasing data per the seven principled exceptions outlined in OMB 12-01 (see Attachment 1, page 4).

When we issued our open data policy, this is the guidance we provided to our staff for justifying exemptions. Our FOIA office agrees that these do not conflict with the FOIA act, but I wanted to flag this issue so that we can adopt an approach that keeps both documents in mind. Thanks.

konklone commented 9 years ago

@bpushed I believe that still means USAID needs to express those exemptions in the form of redacted JSON, with individualized determinations for each field and catalog entry.

bpushed commented 9 years ago

Thanks. That is essentially our plan. For the Sunlight Foundation FOIA request, we were asked specifically to use FOIA exemptions but would plan to revert to OMB 12-01 moving forward.

bbrotsos commented 9 years ago

We are only planning on redacting (if any) on the PDL and leaving the EDI with the full description. This will increasingly become more difficult to manage without some additional metadata tags to automate generating PDL vs EDI. However if you add additional metadata tags for redaction, the simplicity of the POD Schema would be lost.

Is there an equivalent way to do inline tags on text in a JSON fields like in xml? For example:

    "description": "<Redacted type='exb4'>Non Public Title</> widgets data as a CSV file"

The only equivalent way I can think of to do this in json is:

   "description_redacted": "[[REDACTED-EX B4]] Non Public Title widgets data as a CSV file"   
   "description": "Non Public Title widgets data as a CSV file"

This would needlessly complicate the schema. Could Agencies submit both the PDL and EDI redacted?

rebeccawilliams commented 9 years ago

@bbrotsos Following up on this thread -- PDLs @ /data.json should include non-public datasets including any required redactions. If redactions are present, an unredacted copy must also be submitted to OMB Max.

I think that was clear, but wanted to record that in this issue. Closing this issue as guidance is live: https://project-open-data.cio.gov/redactions/

New issues or pull requests to clarify that guidance are encouraged though.

philipashlock commented 9 years ago

@bbrotsos For what it's worth, this is what we're going to try for inventory.data.gov - https://github.com/GSA/enterprise-data-inventory/issues/182#issuecomment-128514823