project-open-data / project-open-data.github.io

Open Data Policy — Managing Information as an Asset
https://project-open-data.cio.gov/
Other
1.34k stars 583 forks source link

May 31, 2015 IDC Guidance: Non-Public Datasets in PDL #462

Closed jlberryhill closed 9 years ago

jlberryhill commented 9 years ago

OMB's Integrated Data Collection (IDC) guidance to agencies states:

OMB greatly appreciates agencies’ cooperation in responding to OMB’s guidance to publish their EDI, with FOIA exemption-related redactions as necessary, as a downloadable dataset in their PDL. We believe this further makes the U.S. the most transparent government in the world. To further enhance this transparency, effective May 31, agencies are required to do the following:

ianjkalin commented 9 years ago

The Commerce Department is working everyday to improve its centralized data inventory. There are certain technological issues that are creating major barriers to creating the PDL as described here. We are glad to be working with GSA on these issues, but cannot guarantee that all "non public" data will meet these new requirements. I'd be happy to explain this situation in greater detail.

jlberryhill commented 9 years ago

Thanks, @ianjkalin. I understand Commerce's unique challenges in including all of its EDI datasets in its PDL, as well as including all of the agency's geospatial data in botht he EDI and PDL. Is there anything the data.gov team or I can do to assist?

bethbeck commented 9 years ago

We're working hard to continually add open data assets to our inventory -- which is the whole point of open data. Our agency has non-public data and will continue to have non-public data. I don't see the point of a non-public data list that is required to be public. We have our hands full with the task of finding and freeing all our treasure troves of data that are already available to the public on websites and in publications. We're spending a good deal of time over the next several years putting all our research data in one place -- which will also need to be pulled into our PDL. Identifying and indexing our open data is our priority.

From our perspective, it makes no sense to continue to add additional non-public datasets to the EDI which will have to go through the rigorous redaction process that our FOIA office requires. They are still reviewing the files from the last go-round. If the EDI is now determined to be the PDL, I think it should just go away. You want an inventory of open data for the public listing. That's where we should be spending scarce agency resources.

In my opinion, the EDI goes away if it's required to be public. It simply becomes the PDL. And we no longer should need to provide additional datasets for an inventory of non-public data (but must be made public) -- which means it has to be redacted through a lengthy and costly FOIA process every quarter. Having a list of redacted files is of no value to the public for the cost to the agencies required to provide it.

We will do our best to meet the quarterly requirements, of course. We're good soldiers. But our lives would be simpler if open data for PDL is our focus -- and I believe it's the best value for the tax dollars to benefit the public we serve. In my opinion....

jlberryhill commented 9 years ago

Thanks for your comments, @bethbeck. We appreciate you continuing to identify public datasets. However, maintaining a comprehensive inventory of an agency's data assets, as required by M-13-13, is important to fully understand what data exists within the agency so data can be opened to the public unless it is deemed to be not releasable.

Most agencies went through the process of coordinating with their FOIA office last quarter to determine whether any redactions were necessary to comply with the EDI FOIA. Since the past datasets have been reviewed, agencies would just want to check in with these offices if they believe any metadata for newly added non-public datasets are subject to a FOIA exemption. In the majority of agencies, no redactions have been necessary.

We believe that giving the public insight into the federal government's data assets in important, even for data that are marked as non-public. This allows the public to know what they don't know, and allows them to potentially request that agencies reconsider the release of datasets marked as non-public. In adding the non-public datasets to the PDL, the PDL essentially becomes a public version of the EDI. We will only ask agencies to send separate EDIs to us if their PDL contains any redactions. The EDI won't go away entirely, as OMB needs to maintain copies of the inventories without redactions.

Thanks for your patience on this, and for you consistently good performance on the Dashboard, @bethbeck.

bethbeck commented 9 years ago

I totally get it. We agree the public should have insight into how the government works. We release everything we can as public data.

I'm not sure we're going to get support within our agency to add any new non-public datasets to the EDI, which will have to be redacted through the formal FOIA process.

I'm curious how other agencies are coping with this. Maybe we're the only ones with unhappy FOIA and legal folks.

ianjkalin commented 9 years ago

@jlberryhill Yes. There are several ways that we can coordinate on a technical solution. I have already started to share some ideas with @philipashlock @lynnovermann @alanswx and @rebeccawilliams. Should we schedule a time to discuss as well?

rebeccawilliams commented 9 years ago

@bethbeck @ianjkalin, It'd be helpful to hear specific pain points on including Non-Public Datasets in your PDLs in this thread, so we can work on incorporating additional guidance that address them. Broader policy changes like eliminating or de-prioritizing EDIs, or the issues that arise in creating a truly comprehensive data inventory for large data-centric agencies are separate (albeit tangential) issues.

Currently, since they are public, you can view how EDI redaction has been addressed by other agencies: https://catalog.data.gov/dataset?q=enterprise+data+inventory

In terms of FOIA process would case studies like: https://project-open-data.cio.gov/licensing-resources/ be helpful?

In terms of technical hurdles to necessary EDI redactions specifically, would guidance on converting your data.json to csv to share with a less tech savy FOIA office be helpful?

ianjkalin commented 9 years ago

@rebeccawilliams It's interesting to hear that the comprehensiveness of the EDI and PDL is "tangential". As far as I know, our published EDI (which is probably <1% of the Department's total, actual inventory) has no redactions. So, does that mean we can really submit a PDL immediately and satisfy the requirements stated here?

rebeccawilliams commented 9 years ago

@ianjkalin Tangential because efforts to make government data accessible, discoverable, and usable by the public are happening in parallel to the gathering process.

Correct. Since, there are currently no redactions in the Department of Commerce Public EDI, this satisfies the new requirements of including Non-Public Datasets in the PDL and no additional EDI is required. However, as the Department of Commerce adds new metadata to the EDI (moving that 1% needle :arrow_up:), if you find that any of the metadata needs redaction (imagine metadata where for some reason the title or description is sensitive information), then a copy of the unredacted EDI will need to be sent to OMB as well. And the reason the field was redacted should be included in the rights field of the PDL, per the guidance here: https://project-open-data.cio.gov/redactions/

I am happy to have you on board to work on this! And as you explore new gathering methods and tools, and large scale data metadata management please add them as issues or resources here on Project Open Data, so large agencies and small can learn from them.

ianjkalin commented 9 years ago

@rebeccawilliams Oh! This is great to learn. Thanks for the guidance! As for moving the needle; absolutely agree. Continues every day and will share the good/new processes and tools as they start to have impact.

MRumsey commented 9 years ago

Wanted to chime in with some perspective from outside government.

First off, it's great to see the original guidance in this thread. I believe that these indexes are most useful if they include the largest amount of information possible. I am hopeful that, in the long-run, including "non-public" information will help reduce FOIA loads through more targeted and specific requests, allow for better public understanding of data release decisions, and ultimately ensure that more government data is being put to good use by the public.

That said, I understand that this won't necessarily be a quick process, especially for the more data heavy agencies. Time and resource constraints are real, it would be foolish to argue otherwise. I am less interested in having perfect data indexes today, and more interested in -- as @rebeccawilliams and @ianjkalin have already pointed out -- moving the needle on a regular basis and ensuring that a functional process for managing government information as an asset -- both internally and by/for the public -- is in place

I would argue that a little extra work ensuring that agency data assets -- of all access levels -- are properly indexed and managed as they are identified will save a lot of time in the long run. Proactively engaging FOIA offices and making these choices now may seem like a heavy lift, but it should save time in the long-run when future release decisions are being considered.

If building a robust EDI/PDL is made part of the data identification process now it will make it easier to manage agency data as an asset moving forward --both internally and publicly -- as new data is created and discovered.

Thanks for all your hard work on this @jlberryhill @bethbeck @ianjkalin @rebeccawilliams!

jlberryhill commented 9 years ago

Great discussion here. @ianjkalin -- I did not realize Commercehad a public version of your EDI up at http://www.commerce.gov/public-edi.json. I don't see this as a dataset on data.gov, so we haven't marked off Commerce as having been responsive to our earlier instructions to post a redacted (if necessary) EDI as a downloadable dataset on data.gov. Any way to get this file on data.gov, or perhaps its there and you could direct me to it? (cc @paulOMB)

ianjkalin commented 9 years ago

@jlberryhill Yes and integrating with Data.gov. Clean file not uploaded yet. Working with @philipashlock and our Chief Enterprise Architect this week on clearing up a few duplicates.

rebeccawilliams commented 9 years ago

Going forward all IDC Requirements will be posted as Issues to gather guidance and best practices for executing the requirements. The above requirement remains an IDC requirement and you can discuss it and updated requirements at: August 31, 2015 IDC Guidance.