project-open-data / project-open-data.github.io

Open Data Policy — Managing Information as an Asset
https://project-open-data.cio.gov/
Other
1.34k stars 583 forks source link

Quality over Quantity #109

Closed skybristol closed 10 years ago

skybristol commented 11 years ago

@seanherron commented in #33 that "...we also have a deadline of November 9th for agencies to publish a ton of data..." The Implementation Guide says that we should...

"Conduct a zero-based review effort of all existing data. Give this effort a very short timeframe and the very specific goal of producing a simple list of all data assets within the agency. Stop at the due date rather than stopping at the 100 percent marker, which is very difficult to reach in a single pass. Repeat at regular intervals."

So, are we publishing a "ton of data," or are we working to produce a more usable resource of accessible and usable data assets?

@gbinal commented in #105 that "...we all know that NOAA and USGS are the two 900 lbs. gorillas when it comes to number of entries." Back when we did Geospatial One-Stop, in the earliest days of Data.gov, and now apparently in the current instantiation of what's much the same concept, we've had times where there are tons of records in whatever catalog is being aggregated from all this work. But how useful have those resources been to date? What kinds of use cases have we enabled from discovery to access to use leading to beneficial outcomes for society? How did the big aggregations of tons of records do something in and of themselves that no other pathway to the actual underlying data could accomplish?

Don't get me wrong - I fully believe in what we are doing. I do believe that we should end up with every data asset that the taxpayers (including me) have funded over all these years easily discoverable and accessible through aggregations like Data.gov and lots of other creative applications arising from this effort. We should put them out there and let smart technology and technologists come up with creative ways to narrow in on the useful from the stuff not appropriate for a particular inquiry.

However, on the data provider end, we have to make some choices about what "zero-based review effort" means. At least from our perspective in the USGS, we are trying to do something sustainable that will make "Repeat at regular intervals" mean, "as soon as data are released." We are committed to scientific integrity, validity, and excellence at every level of our organization. So, is it better for us to continue pushing every single metadata record that we can get to line up with the high-level and relatively scant POD implementation of DCAT JSON? Or should we put some amount of energy into trying to make sure that the records we do put into one or more (ref. #105) data.json files have viable and accessible distribution links, appropriate contact information, references to deeper metadata and associated scholarly publications, and the few other details such that when someone finds them they will have a pretty good shot at being able to put the data to use?

It's a balance between quality and quantity, for sure. My opinion is that balance could be characterized in a ratio like 3-1, quality over quantity. But that's my opinion. What's yours?

Just for fun, I posted a little survey (on my own time and not any official part of Project Open Data or in any way affiliated with the USGS) - http://www.surveymonkey.com/s/JGWV3NK (I'll post survey results somewhere and reference in a comment.) I'm interested in comments here and input to the survey. If you have anecdotes or artifacts that answer any of my use case questions on success (or failure) of big comprehensive cataloging efforts - past or current, I'd love to hear about them here.

cew821 commented 11 years ago

I am torn on this question, but it's a good one to discuss.

On the one hand, throwing every record into a flat JSON file creates a huge search and discovery problem. This is especially true given how many awesome, CKAN-friendly geospatial resources we have. For a simple example, take a look at the search results for "average income by state" on http://catalog.data.gov. Of the 1100+ results returned, 1038 (and most of the top results) are geospatial shape files, covering various geographies and years.

I'm guessing most folks making this query are looking for the state census survey data (the number one result on Google). I'm guessing very few users are looking for shape files.

The counter argument is that, yes, search and discovery is hard, but the only way to solve that problem is to put all our data out there, tagged with a solid common metadata format, and let the pubic/industry create portals to our data for various audiences / use cases.

I tend to think that for both machines and humans, it would be more useful to get a record called "US Geospatial Census Atlas" (illustrative title) that links to an awesome standards-based geospatial datastore full of great cuts of census data, instead of getting 1000 individual shape files from various geospatial repositories as a response to a query. For this reason, I think the hybrid approach proposed by @ddnebert could make sense as a way to find a good balance between quantity and quality.

ddnebert commented 11 years ago

In fact, we have 25 records for Census TIGER data - one for each geographic theme - which individually link to sub-collections with thousands of geographic and time variants for real data hounds to go through. The other records came in through 'raw' data and confuse the story as they don't respect the granularity rules. By supporting the parent-child relationship of these collections and members, only the parent records are returned in an initial search. Children can be subsequently searched in a second step, actually in the index. Very helpful to help manage the appropriate granularity of the data. We have prepared metadata recommendations for the geospatial community to use when publishing to CKAN to address such problems: http://www.geoplatform.gov/sites/default/files/document_library/MetadataPractices07-2013_Linked_0.pdf

cew821 commented 11 years ago

@ddnebert :+1: that is awesome, wish that was more widely adopted.

MarinaNitze commented 11 years ago

I was pausing on responding to this comment until the official new implementation guidance came out. Would be curious to hear your reaction after reading it -- it hopefully will provide clarity on where to start and what ongoing expectations are.

mhogeweg commented 11 years ago

the question on quantity vs quality is a tough one. as @skybristol indicated we ended geospatial one-stop with almost 1,000,000 items (files, services, apps, etc). it sort of turns the one catalog into a google of its own where you now have a hay stack.

The approach to reference collections as mentioned by @ddnebert results in people doing a search on data.gov and then after several clicks end up on another website with a different search UI and having to re-do their search. plus the next time they might just go to Census/NOAA/USGS directly as that's where they found the data they wanted.

Whether data found is useful is not something the choice for a single catalog vs federated approach solves, but depends on what the searcher is looking for. The data was created by the government for some purpose and is made available.

$0.02

ddnebert commented 11 years ago

Having the ability to tag content with its preferred or authoritative (or merely most popular) nature would be a good addition in the metadata and indexing. Then we can apply that in ranking the results or in showcasing certain items. I think we need both the full inventory and the featured assets capability.

jpmckinney commented 11 years ago

I appreciate the discussion, but just for clarity, is there a proposed change to http://project-open-data.github.io/ here?

mhogeweg commented 11 years ago

@jpmckinney seems to me that project open data is a great a place to discuss these larger architectural questions that affect participants in open government even if they don't immediately result in a pull request. yes/no?

jpmckinney commented 11 years ago

Yes, I was just seeking clarity on what sort of discussion this is. Thanks for clarifying.

mhogeweg commented 11 years ago

We're trying to figure out how to channel our pathological geodata collection and sharing desires...

MarinaNitze commented 10 years ago

I'm going to close this thread, but please feel free to re-open or start a new one with specific questions and comments. I think the implementation guidance publication addressed the original inquiry. Thanks all!