Single-file /data catalog not good--optional alternative suggested

jeffdlb commented 11 years ago

Current guidance is that each agency's "/data" inventory must be a single list in a file containing multiple lines of Javascript Object Notation (JSON) summary metadata per dataset, even if our agency has tens of thousands of datasets distributed across multiple facilities and servers. I believe the single list will pose problems of inventory creation, maintenance, and usability. I enumerate my concerns below, but first I propose a specific solution.

PROPOSAL:

I recommend the single-list approach be made optional. Specifically, I suggest that the top-level JSON file be permitted to include either a list of datasets or a list of child nodes. Each node would at minimum have 'title' and 'accessURL' elements from your JSON schema (http://project-open-data.github.io/schema/), an agreed-upon value of 'format' such as "inventory_node" to indicate the destination is not a data file, and optionally some useful elements (e.g., person, mbox, modified, accessLevel, etc) describing that node. Each node could likewise include either a list of datasets or a list of children.

CONCERNS REGARDING THE SINGLE-LIST APPROACH:

(1) We should not build these inventories only to support data.gov. We want to leverage this for other efforts internal to our agencies, for PARR, and to support other external portals such as the Global Earth Observing System of Systems (GEOSS) or the Global Change Master Directory (GCMD). A distributed organization will be more useful for them (even if data.gov itself could handle a single long unsorted list.)

(2) The inventory will need to be compiled from many different sources, including multiple web-accessible folders (WAFs) of geospatial metadata, existing catalog servers, or other databases or lists. Each type of input will need some code to produce the required subset of the list, and then some software will need to merge everything into a giant list. Failure at any step in the process may cause the inventory to be out of date, incomplete, or broken entirely. A distributed organization will more easily allow most of the inventory to be correct or complete even if one sub-part is not, and will require much less code for fault-handling.

(3) Some of our data changes very frequently, on timescales of minutes or hours, while other data are only modified yearly or less frequently. A distributed organization will more easily allow partial updates and the addition (or removal) of new collections of data without having to regenerate the entire list.

(4) The inventory is supposed to include both our scientific observations and "business" data, and both public and non-public data. That alone suggests a top-level division into (for example) /data/science, /data/business, and /data/internal. The latter may need to be on a separate machine with different access control.

(5) It would be easier to create usable parallel versions of the inventory in formats other than JSON (e.g., HTML with schema.org tags) if the organization were distributed.

(6) I understand that the data.gov harvester has successfully parsed very long JSON files. However, recursive traversing of a web-based directory tree-like structure would be trivial to implement by data.gov and would be more scalable and solve many problems for the agencies and the users. data.gov's own harvesting could even be helped if the last-modified date on each node is checked to determine whether you can skip it.

waldoj commented 11 years ago

@jeffdlb, is it fair to say that the Sitemap index standard is a good model here?

mhogeweg commented 11 years ago

@jeffdlb @waldoj I'd say so and had done so before in issue #27 where I suggested pagination like is seen in OpenSearch providers or the breaking up in smaller files as done in sitemap. I think Jeff's point about supporting needs not just of Data.gov but also think about how agencies support other initiatives also deserves consideration here.

benbalter commented 11 years ago

I'd argue that the schema was written with developers in mind, not the ease of agency adoption (or data.gov). When the two are in conflict, we should err on the side of those we want to encourage to use the data, not those whose job it is to publish or organize the data. Going into multiple formats may make things easier for agencies, but it does so at the average developer's expense.

This is a great case for practicality over purity. A single .data.json file that's too large to easily manipulate would be a great problem to have. It would mean agencies are indexing data and exposing it to the public, but as far as I can tell, looking through the open issues here, that problem remains theoretical and limited to government. Given the ease of adopting options like #27, I'd argue for a wait and see approach. Let the users' needs drive the product, not the publishers.

Practically, if we allow sub-data files two implications:

A lot more complex for developers (or data.gov or whatever) to crawl. The HTTP call is going to be the most expensive part of the transaction. Allowing sub-files doubles that transaction cost, not to mention the complexity of the crawler.
If I were a government agency, I‘d take advantage of it. I’d make a single data.json file in the root, and then have each bureau/office make their own data.json file, so there'd be no change whatsoever from the status quo where data is siloed.

Forcing agencies to make a single .data.json file is an exercise that helps the agency centralize their data index, to absorb the complexity of their own system on behalf of citizens, and begins the process of becoming more customer-centric when it comes to data.

waldoj commented 11 years ago

Each type of input will need some code to produce the required subset of the list, and then some software will need to merge everything into a giant list. Failure at any step in the process may cause the inventory to be out of date, incomplete, or broken entirely. A distributed organization will more easily allow most of the inventory to be correct or complete even if one sub-part is not, and will require much less code for fault-handling.

But doesn't the process that you're describing still necessitate that "some software will need to merge everything into a giant list"? It seems that complexity is being pushed out to the clients, rather than being resolved within the federal agency.

ddnebert commented 11 years ago

A subtlety in this proposal, and one I opened as a suggestion in github two months ago, is that we allow the data.json file to contain entries for existing standards-based catalogs or APIs. Each such collection - and we manage many in the geospatial domain - could be simply marked up, one entry per collection, within the agency's json file with the breadcrumbs to perform the query and or indexing. CKAN already has built-in harvesting capability on these established protocols, as well as json, so the integration challenge would be minor. It would produce the same, if not better results as a serialized subset of metadata in feeds, since it all ends up in the searchable index. The benefits of this approach are many:

Geospatial metadata are robust and support not only discovery but fitness-for-use information that end-users need to know before access. The json file does not contain sufficient metadata for end-user advice.
The hybrid solution supports standards and gains access to well over 90% of existing government data sets in the catalog.data.gov index. Proposing an alternative, non-standard solution does not provide new content or augment counts of data.
Data conversion burdens on the agencies are negligible. This is an elegant, least-effort solution as support for these protocols and formats is built into the CKAN index software already.
The hybrid solution supports traversing and indexing homogeneous data series (i.e. imagery and similar inventory catalogs) in a two-phase search, a feature not present in the json solution.
The catalog solution supports detection of resource changes (add, update) not supported in the json feed. Full traversal/re-indexeing of complete agency json files is currently required, whereas change detection is already supported in the protocols and harvest.

I propose the following text changes to the Implementation document, implementation-guide.md, along with a modification of the harvest routine to recognize the catalog resource type within the feed:

A) Minimum Required for Compliance

Produce a single catalog or list of data managed in a single table, workspace, or other relevant location. Describe each dataset or existing metadata catalog according to the common core metadata.

This listing can be maintained in a Data Management System (DMS) such as the open-source CKAN platform; a single spreadsheet, with each metadata field as its own column; or a DMS of your choosing. A description of each agency metadata catalog, such as CKAN, can be placed in the agency json file as a single entry. This entry will describe the resource type of "catalog" and the access URL to be used in harvest by data.gov.

Metadata for geographic or geospatial information is often collected using the FGDC Content Standard for Digital Geospatial Metadata or ISO 19115/19139 and represented as XML, providing content that maps to common core metadata. These collections are exposed using the Open Geospatial Consortium Catalog Service for the Web interface (CSW 2.0.2) or as a read-enabled HTTP directory known as a Web Accessible Folder (WAF). In lieu of posting individual entries for each geospatial dataset in the json file, a single json entry should be prepared for each geospatial metadata collection (WAF) or service (CSW) as a "Harvest Source" enabling harvest of the collections by catalog.data.gov. Individual geospatial metadata entries for datasets, applications, or services should not be duplicated in the agency json feed.

mhogeweg commented 11 years ago

thanks @ddnebert for pointing out that agencies have FGDC/ISO metadata. I had posted a mapping of those specs to DCAT (as the mapping is not trivial with absence of 1:N elements in DCAT, different dates notations, and different interpretations of fields to name a few) and submitted that as pull request #74. would like to see your thoughts on that mapping.

A second point is the focus on getting things to work for CKAN. I understand that Data.gov uses CKAN, but would it not be better if a solution is designed that works across the government regardless of technology? That is what the geospatial domain has been working on for many years and what was done to promote an open ecosystem of suppliers and consumers of data. To me that also relates to @jeffdlb's point regarding programs like GEOSS (which you play a key role in) and GCMD, not to mention Eye on Earth, UNEP Live and various other global initiatives focused on open data sharing.

jeffdlb commented 11 years ago

@waldoj - I don't think breaking up the single list into a linked set of lists pushes complexity to the users. The end goal of the inventory is not just to have a list, it is to populate data.gov or other portals or commercial search engines. In all cases, entries in each list (whether there is one list per agency or many) will be going into a database of some type. That database will be updated one entry at a time by reading through the lists.

jeffdlb commented 11 years ago

@benbalter - If every bureau's sub-list is linked from the master list, then I believe the data would be less "siloed" than currently. At present any bureau-level inventories are not standardized, whereas this effort would standardize and link them.

cew821 commented 11 years ago

@ddnebert I like this suggestion. In fact, some geospatial energy data has already been posted to data.gov using this approach (CKAN harvesting of a CSW endpoint). See below for details of this example:

The Department of Energy is helping create a datastore for geothermal data called the National Geothermal Data Store. One of the nodes of this system is the State Geothermal Database. This data store conforms to the ISO 19139 geospatial metadata standard discussed above, and uses the Catalog Service for the Web (CSW) standard to provide interoperability. Here's an example CSW endpoint to the same data store linked to above.

Because the CKAN harvester knows how to inter-operate with the CSW, all of the datasets (including the geospatial data) were able to be added to data.gov just by pointing the harvester at the CSW endpoint. End result? A search for 'borehole temperatures' on data.gov yields 59 datasets, with intact metadata, available in Esri REST API, WMS, WFS, and ZIP file formats.

For agencies that have existing data portals that can be easily harvested by CKAN, pointing the harvester to the data store itself could make a lot of sense.

I should note that not ALL of the Department's data will be easily harvested in this way. For legacy or custom data portals that did not use standards, or are otherwise not up to date, creating the data.json file is a good forcing function. But for systems that have this capability, why not use it?

waldoj commented 11 years ago

CKAN's CSW syndication functionality is great, but this initiative is not about expanding CKAN support within the federal government. The goal is to "produce a single catalog or list of data managed in a single table, workspace, or other relevant location". Putting some data behind a different protocol runs counter to this goal (the data then ceases to be "in a single table"), and erects a significant hurdle to anybody who wants to syndicate that data (requiring parsing 2+ catalog formats).

Perhaps some tools could be built to solve this problem via both design patterns? That is, the need to syndicate data can be address via harvesting from CSW (for example), precisely as proposed here. Simultaneously, some tools can be built to extract that catalog data and convert it into the data.json format, alongside the existing tools to aid in this transition. I'd be happy to scrub in to help create such a converter!

MarionRoyal commented 11 years ago

Very good discussion. If I might add my couple of pennies...

History

The original DataGov metadata template created back in 2009 was done rather quickly and at that time DCAT really hadn't taken hold and schema.orgwasn't even an idea. The success of the original template was demonstrated by the fact that we only had one major revision (congratulations to the multi-agency team who developed it.) Based on Dublin Core, it clearly had the basic metadata (and some fields that were never used) needed to express the basic concept of a stored chunk of data. If we could have found a simple standard to use instead of building our own, we would have done so. To be fair, though, we had some administrative requirements such as "Does this Dataset meet your agencies compliance to the Data Quality Act?" which would not likely be found in any other standard. It was well understood that the DataGov metadata template was a temporary solution and not a long term "Standard".

DCAT evolved as a means of connecting catalogs together and in doing so, harmonized most of the metadata catalog terms (and incorporated other namespaces like DC and FoaF). Not different enough from the original DataGov template to make a change but notable as we considered mapping to other schema's.

When we started tying DataGov with Geospatial One-Stop (GOS), we used an API on the Geo Catalog to map and display the metadata at Data.gov using the DataGov template. This was fine as long as there was a rich metadata catalog accessible for geospatial mapping and storage of map services (I have now reached my extent of knowledge of geo-speak). The Geospatial metadata was well established before DataGov albeit with an FGDC evolution to ISO 19115 (I have now reached my extent of knowledge of FGDC vs ISO). Simple mapping from the geospatial metadata to DataGov template was never a problem. However, the DataGov template was a _SUBSET_of all of the fields found in either FGDC or ISO. The fields used in the geo community were way too specialized to even be considered for non-geo records.

We kept this rich catalog of metadata along with it's harvesting capability as we brought together DataGov and GOS and duplicated their infrastructure at geo.data.gov. We used the functionality of the geo.data.gov catalog as a requirement as we deployed and contributed to the CKAN software. The end result is that catalog.data.gov supports harvesting of FGDC and ISO and the basic catalog requirements of GeoPlatform.gov (thank you Doug et al).

There are experts already in this discussion who knows the details exponentially to me, so I won't dally around. It is important for those who do not know the history to poke around a bit to understand the complexity of the geo community.

So

We are asking the geospatial community to abandon their long-fought process of establishing an international metadata standard through ISO (this process is a career, not a project) and adopting a NEW CORE schema in a way that is better than the original DataGov metadata template but does not contain the rich catalog information needed for their community.

They just can't do that. Of course they can develop tools to spit out the JSON metadata (subset) from their records. We can do that from Data.gov. But what value is it? It doesn't contain the metadata that any geo-scientist would need. They would still need to maintain the FGDC/ISO records. It would put us in a situation where we are treading down the path to separate catalogs which is the opposite direction that we need to be headed in.

Today

Today we are preparing to harvest agency metadata using the new CORE schema which has been validated against the original DataGov metadata template and is obviously an improvement. The new CORE schema is a SUBSET of a mapping to FGDC/ISO. At DataGov, we will continue to harvest geospatial metadata using FGDC and/or ISO19115 (even if the subset of metadata is in the JSON file), because that's the requirement for GeoPlatform.gov and other geo-scientists.

Reality (from my perspective)

The new CORE metadata schema is a temporary solution, not a permanent one. It will not become an ISO standard. It will not supersede Dublin Core. It will not even supersede DCAT. It is not even based in a well known namespace.

However, it may be what we need just now; a simple way for agencies to make their data available in a way that the public understands it (and by public, I include developers.) It also flips around the manner is which these data are published in a way that applications (other than CKAN and Data.gov) can parse the information and make use of it. Shall we expand this temporary solution in a way that it meets the requirements of all? Heck no!

Instead, we should seek a longer term solution (maybe not permanent but certainly more scalable.) Is the long term solution ISO19115? Sorry no.

I think that the long term solution that we should be working toward is SCHEMA.ORG. I won't go into the reasons. That's a whole different discussion and if I don't make everybody mad, I would love to be part of that work.

So What (in my opinion)

The proposal that this simple (long) JSON file may contain pointers to more complex catalogs seems like a reasonable approach to me. Developers should be sophisticated enough to recognized a field that states this pointer will require a different parsing mechanism and either ignore or pursue. In the mean time, let's continue our goal for a long term solution and make that solution the best it can be.

On Mon, Aug 5, 2013 at 10:42 PM, Waldo Jaquith notifications@github.comwrote:

CKAN's CSW syndication functionality is great, but this initiative is not about expanding CKAN support within the federal government. The goal is to "produce a single catalog or list of data managed in a single table, workspace, or other relevant location"http://project-open-data.github.io/implementation-guide/. Putting some data behind a different protocol runs counter to this goal (the data then ceases to be "in a single table"), and erects a significant hurdle to anybody who wants to syndicate that data (requiring parsing 2+ catalog formats).

Perhaps some tools could be built to solve this problem via both design patterns? That is, the need to syndicate data can be address via harvesting from CSW (for example), precisely as proposed here. Simultaneously, some tools can be built to extract that catalog data and convert it into the data.json format, alongside the existing tools to aid in this transitionhttp://project-open-data.github.io/#4_tools. I'd be happy to scrub in to help create such a converter!

— Reply to this email directly or view it on GitHubhttps://github.com/project-open-data/project-open-data.github.io/issues/105#issuecomment-22155058 .

Marion A. Royal PMP Program Director, DataGov GSA Office of Citizen Services and Innovative Technologies 202.302.4634

ddnebert commented 11 years ago

Regarding the statement: "CKAN's CSW syndication functionality is great, but this initiative is not about expanding CKAN support within the federal government. The goal is to "produce a single catalog or list of data managed in a single table, workspace, or other relevant location". " it is my understanding that the intention of the .json feeds was to feed the single government data search engine (powered by CKAN) to create a comprehensive view of governmental metadata. The creation of a feed (or catalog service) is only a means to an end since the metadata cache at CKAN provides the necessary interface to access all governmental metadata, or selected metadata via query. BTW, CKAN has the ability to expose its results in json and RDFa, albeit a DCAT flavor, as the result of a query. So, if it is json format that you need, it is also accessible there in addition to several other APIs.

By allowing multiple (already supported) protocols and formats, we have now created a seamless virtual catalog of government metadata. We require the option to reference the existence of metadata collections as WAF or CSW individual entries within the agency json feed. This is an operational solution that fulfills the requirements of both the geospatial and 'raw' data communities - allowing agency and federated views, exposing actionable APIs, enabling counts and tracking, and most importantly enabling access to the data, services, and applications of our federal, state, local, tribal, and academic partners.

waldoj commented 11 years ago

it is my understanding that the intention of the .json feeds was to feed the single government data search engine (powered by CKAN) to create a comprehensive view of governmental metadata.

The stated purpose of providing the specified data.json file, as described in Slash Data Catalog Requirements, is "to make [the catalog] more easily discoverable to private-sector developers and entrepreneurs." Data.gov is just one client.

ddnebert commented 11 years ago

The proposed hybrid solution that feeds a common search and retrieval API (that can return json and other formats) supports that requirement even better. With /data in every agency, developers would need to locate such directories and then visit each one individually. With a common search interface built on the json syndication, developers will have an easier time interacting with the metadata through a single entrypoint.

waldoj commented 11 years ago

I follow, but M-13-13 says:

Any datasets in the agency’s enterprise data inventory that can be made publicly available must be listed at www.[agency].gov/data in a human- and machine-readable format

That's a per-agency mandate. There are lots of details about implementation that can be altered, but fundamentally M-13-13 requires a complete inventory, on the agency website, within /data, as human- and machine-readable data.

ddnebert commented 11 years ago

One could interpret the requirement to be satisfied by including in the json entries for each collection description. You're right, that is a 'mandate' rather than explaining the objective or outcome. If the desired outcome is to produce links to all govt metadata, the hybrid solution satisfies it. We should recommend modification of M-13-13 to include the implemented capabilities that currently provide access to over half a million government data asset descriptions already.

I'm also thinking that json does not strictly satisfy the M-13-13 desire for a 'human readable' format; the query results from CKAN can be formatted and styled in many ways.

gbinal commented 11 years ago

Just to echo part of @MarionRoyal 's point about the pragmatic push currently going on, I think it's worth noting that we all know that NOAA and USGS are the two 900 lbs. gorillas when it comes to number of entries. They account for 5/6 of the datasets in data.gov currently (32k and 18k entries respectively). I agree with @waldoj that the move is to create a solution for them while keeping intact the simple and clear agency.gov/data.json requirement as it currently exists for the other 168 agencies currently reporting data in Data.gov. I'm working with those 98% of agencies that should be able to handle this fine for the short and medium term but agree that we need to figure out something that can scale for NOAA and USGS.

I agree that each level of complexity we introduce to the structures of data.json files increases the burden for third party adoption and costs us more in the long run.

skybristol commented 11 years ago

+1 to @waldoj comment, and -1 to @ddnebert response (sorry, Doug, I have to disagree with you on this one). Is Data.gov the "one ring to rule them all"? I think the world has moved well beyond the one stop shop paradigm whether we're talking about data assets or shoes. As far as I'm concerned, the big driver that the giant comprehensive catalog addresses is the management itch (that all of us should share and appreciate) of determining whether or not we've really done right by the taxpayer in releasing all our wares in a complete, discoverable, and accessible way. If we do that job right, then we should be able to drive all manner of "stop and shop where it makes the most sense" apps across government and the private and commercial sectors.

If we follow the data, information, and knowledge idea, data in context is information, and information leads to knowledge and action. Context is really important. Seismologists, ecologists, and other scientists are interested in different things than resource managers, energy developers, environmental and social activists, and policy analysts (and every other class of data consumer we can think about). All of us probably should have more data at our fingertips when doing whatever it is we're doing so that we can develop a more robust characterization of whatever it is we're examining with data. But it's not a one size fits all world. Why not go about this in a way that better enables the unanticipated good uses of our resource "listings"?

The most important thing about this (in my mind) is that we not conduct this as yet another data call. This process has to get baked into the different agencies at a level that is sustainable and evolvable over time with changing requirements, backend processes, and increased data holdings. The implementation needs to balance between the need for some level of standardization (so that downstream consumers like Data.gov have a somewhat predictable playing field) while allowing for some reasonable variability in processes and methods such that the data providers can figure out how to make it last.

Some of us (gov agencies) have wonderfully mature catalog systems of formal metadata already in place. Others of us have dozens or hundreds of potential catalogs that might not all meet the same level of maturity. Still others have piles of "metadata" in every conceivable format and state completeness. As with all technology, there are 50 different ways of doing anything. Perhaps we can do a little more work defining the use cases associated with the machine-readable aspect of this deal and then let the agencies come up with the creative ways of getting there.

From what I've been hearing and reading, providing a way for Data.gov to go from a "push me" to a "pull you" way of aggregating is one of those. I'd like to see that use case a little more spelled out in terms of what might be changing for catalog.data.gov. It would also be nice to understand if there is some difference in approach that is anticipated between catalog.data.gov, next.data.gov, and the various other x.data.gov things that seem to be going on.

Another use case I eventually want to pursue is specifically with the major earth science agency partners (USGS, NOAA, NASA, USDA, EPA). Being a USGS guy, I know that there are USGS data assets and derivative data products that live in the holdings of other agencies. I can search NASA's ECHO catalog or NOAA's GeoPortal and find some of them. If we had reached a level of maturity in uniquely identifying everything released with a registered DataCite DOI and referenced those everywhere, the problem of negotiating between different derivations on the same data and understanding authoritativeness might already be solved. But we ain't there yet. So, I might want to write some software to go looking for potential interconnections between things that I know about and things that NOAA knows about based on the raw inventories we are each listing publicly. Sure, I might be able to do that using a Data.gov API once everything is all aggregated there nicely, but then again, maybe I'd rather develop a whole new algorithm based on creating a linked data asset from selective crawls of source material that's not supported by how Data.gov has gone about its aggregation or the form of data provided by its API. Having hopefully established some interconnections with things known from the USGS context, I want to exploit those in different ways through data management practices to make the field cleaner, recommender systems for end users, and other methods.

skybristol commented 11 years ago

@waldoj said...

...fundamentally M-13-13 requires a complete inventory, on the agency website, within /data, as human- and machine-readable data...

It was my understanding that /data/ was fundamentally for the public data listing part of this goal we're shooting for with a data.json (or whatever we end up coming to through this discussion) and some type of human-readable interface (browse, search, etc.). But I understood the complete data inventory (both public and nonpublic data) to be another matter, potentially driven off of various data management systems, agency catalogs, etc. On a teleconference (last week, I think) there was discussion on some uses OMB might be making of the inventory that would make it desirable to also have those available in the same type of JSON format or in some way to facilitate cross-agency analysis.

I don't know who you are, but could you elaborate on your thinking if you are one of those "in the know"?

(Perhaps this issue ought to go off to a new thread.)

ddnebert commented 11 years ago

On 8/6/13 1:12 PM, skybristol wrote:

+1 to @waldoj https://github.com/waldoj comment, and -1 to @ddnebert https://github.com/ddnebert response (sorry, Doug, I have to disagree with you on this one). Is Data.gov the "one ring to rule them all"? I think the world has moved well beyond the one stop shop paradigm whether we're talking about data assets or shoes. As far as I'm concerned, the big driver that the giant comprehensive catalog addresses is the management itch (that all of us should share and appreciate) of determining whether or not we've really done right by the taxpayer in releasing all our wares in a complete, discoverable, and accessible way. If we do that job right, then we should be able to drive all manner of "stop and shop where it makes the most sense" apps across government and the private and commercial sectors.

And what we propose supports both aims - exposing individual agency feeds, some with embedded references to catalogs, and a search/browse facility across this federation to enable broad access. In terms of cart-and-horse, I see the data.gov facilities as a primary means to realize the goals of the Open Data Policy since it supports the pan-governmental view that the .json enabled approach does not support by itself. We can have it both ways with very little extra work. If you read between the lines, CKAN can also expose a filter-driven json feed (or delivery of XML+RDFa files) against all harvested metadata for the whole of government.

skybristol commented 11 years ago

So, circling back to the original concept posed by @jeffdlb and then built on by @ddnebert, it seems like we have the following two proposals:

1) @jeffdlb - Follow the Project Open Data (POD) JSON schema to provide discovery-level metadata for each discrete dataset but allow a network of nodes within a given agency such that top-level data.json might point to other locations where further "data.json" files could be crawled/aggregated to create the whole.

2) @ddnebert - Allow for the use of not only the POD schema but accept established formal metadata standards in XML (ISO19115/19139 and FGDC CSDGM are mentioned), using the top-level agency data.json as an index/directory pointing to catalog services (CSW) or web accessible folders where such metadata can be harvested.

Those seem to be two widely different proposals (if I've got them right), and I wonder if they don't deserve to be restated to start separate threads for debate.

The comment from @ddnebert above seems to point to Data.gov's CKAN implementation as the solution for OMB scrutiny and any other uses of a simple JSON output of discovery-level metadata, allowing for agencies with formal metadata holdings to simply provide those as their public data listing without "dumbing down" the catalog to the more simple POD attributes.

ddnebert commented 11 years ago

Well stated. I would also add that the CKAN implementation can already support harvest of .json and CSW/WAF. It is a minor tweak to identify catalog references within the agency .json file. We can experiment with exposing the federated catalog (CKAN) as filtered .json for developer access with all entries looking the same yet provide access and indexing of the robust metadata where it exists.

mhogeweg commented 11 years ago

@ddnebert what you describe looks something like this http://gptogc.esri.com/geoportal/rest/repositories. That response is a very specific list of repositories registered in a catalog. A client could take this list for harvesing, syndication, brokering, synchronization, indexing (or what the term of the day is).

I'm curious to the ways people expect to use Data.gov. Yes, there is a CKAN interface and I've integrated with that to perform some basic searching, say for water quality.

currently the CKAN API doesn't seem to return a total count of items. on providing DCAT, you can also get the [same search results as DCAT](http://gptogc.esri.com/geoportal/rest/find/document?rid=CKAN&searchText=water quality&start=1&max=10&f=dcat) from the same site. all of this does not prevent or mandate full verbose ISO/FGDC metadata. it would just expose to a user what is relevant given the user's search request.

what I think we all want to avoid is to create a one-of-a-kind process like used in Recovery.gov for transparency reporting or in the EPA CDX environmental reporting. those are good use cases for a specific process because specific information is exchanged at set frequencies with 1 user. in the field of open data (geo or non-geo) we all want to find that one awesome dataset, regardless of where it is registered or regardless what website/client we use... #different

ultrasaurus commented 11 years ago

If I understand the details of this thread correctly, I'd like to offer Libraries and Archives as another use case in support of:

2) @ddnebert - Allow for the use of not only the POD schema but accept established formal metadata standards in XML (ISO19115/19139 and FGDC CSDGM are mentioned), using the top-level agency data.json as an index/directory pointing to catalog services (CSW) or web accessible folders where such metadata can be harvested.

My understanding is that under this proposal there would be a /data.json that references pre-existing standard repositories that are in use by established communities of data publishers and developers. Archives and libraries are early adopters of open data with MARC and EAD standards already published and in active use by various communities and partners. Notable is the Digital Public Libraries of America (relevant background here: http://dp.la/info/get-involved/partnerships/) which already aggregates a huge amount of public data. Many Smithsonian archives and libraries already have public data repositories that are contributed to DPLA. Like the scientific and technical government agencies, the Smithsonian Institutions has a huge amount of open data, already in use by scientists and humanities researchers.

It believe that it would be a huge win to publicize these existing resources and reference well-established standards, drawing new developers and industry into existing communities of experts.

seanherron commented 11 years ago

I think that this conversation boils down to two points that everyone can probably agree on: 1) we want it to be easy for people to find government data and 2) we want it to be easy for government to make data available to the public.

With those two core objectives in mind, I'd like to highlight what I perceive to be the primary ways to make that happen:

1) If you go to agency.gov/data, you are able to view our best-faith effort in indexing all of the data that agency has, both in a human and machine readable interface.

2) Agencies across the government use a single standard to publish that index so that they can be easily aggregated. That standard is straightforward, simple, and is designed to be a starting point to other linked datasets.

3) Agencies need a minimal amount of resources to publish data in that standard - the barrier of entry is low, even for someone with no familiarity of the standards world.

As @gbinal noted, we all know the geo agencies are way ahead of everyone else in terms of publishing and sharing data. However, the more complexity we add to the schema, the fewer people at other agencies will understand it. I think this is a great example of a time when it is best to make things simple. For the agencies which are already publishing metadata, it is relatively easy to convert that to a single data.json file (using either one of a number of tools listed on this site or a quick parser they can write themselves). While that data.json file may not contain the rich information they are used to publishing, it does mean that now they are talking on the same playing field as the other agencies which are publishing data.json files. Any other relevant information they want to provide can be listed as expanded fields, and any other linked data they want to publish can be listed under something like endpoint, download url, or data dictionary (which people can then scrape and pull from).

In the interest of keeping things simple, my vote would be to, for now, focus on creating this single file. How agencies get there is up to them - if they want every internal organization to publish their own data.json file which is then aggregated in to a single, top-level file, that's fine. But to introduce what would be fairly substantial changes to the schema this close to November would, in my opinion, add unneeded complexity and ultimately make it more difficult for third party services (only one of which is data.gov) to accomplish the indexing we are trying to achieve.

:space_invader:

ddnebert commented 11 years ago

The JSON file is a means to an end, supporting syndication of content to be harvested into the index (search engine). How will programmers know where these agency files are? Programmers will not likely attach to every agency to download and parse the files, index them themselves, in order to find data. It is the search engine and API that makes these feeds and the things they point to (like catalogs or data or services) most valuable.

Our proposal is already simple and supports the common schema and indexing tools. The json file can include one or more references to metadata catalogs that contain more detail in addition to raw metadata with its simpler descriptions. All the hundreds of thousands of records get indexed into the search engine in support of the two points you identify: 1) we want it to be easy for people to find government data and 2) we want it to be easy for government to make data available to the public. The end user is given the search capabilities in catalog.data.gov that do not exist on the json file - that is not its intention. They will also see the same type of common record on initial delivery but are given the option to dive further, if interested, into the full metadata. The geospatial publishers are already up and running with CKAN support and it does not impose any burden on the 'raw' data publishers.

seanherron commented 11 years ago

@ddnebert, it may be useful if you could develop a proposed schema change and submit it as a pull request.

I'm not sure how your proposal is better than the status quo in regard to enabling programmers to know where the agency files are. With the status quo, we centralized this so that agency.gov/data.json is the known place to grab the data, or latch on to a service like data.gov and query via API. Your proposal seems to complicate this with additional crawling needs and more schema information, however, if you could provide an example implementation maybe it will help me understand more accurately.

waldoj commented 11 years ago

Programmers will not likely attach to every agency to download and parse the files, index them themselves, in order to find data.

I can't see why not. It'd be trivial to loop through a list of every federal government website, grab the data.json file, and gather data from it. That's perhaps five minutes of work. More likely, though, people are going to get the data.json files from the few agencies that they're interested in or comprising the totality of datasets for the topic that they're interested in—every spending-related dataset, every geospatial dataset, etc. That's very simple under this plan.

Your alternate proposal (as I understand it; there's no pull request to evaluate) requires that, instead, developers query a series of different types of catalog files, varying between agencies, nested within an existing dataset, to even find out which datasets are available. I can't see why we wouldn't expose the entire list of datasets at a data.json level, including of course the specialized, existing metadata catalogs that you're describing, since they are their own dataset. Anybody who wants to browse those specific endpoints to get detailed information can do so, while those who are happy with the limited data provided within data.json don't need to do so.

As I've said before, the stated goal of this endeavor is to "produce a single catalog or list of data managed in a single table, workspace, or other relevant location". Putting some data behind a different protocol runs counter to this goal (the data then ceases to be "in a single table"), and erects a significant hurdle to anybody who wants to syndicate that data (requiring parsing n catalog formats, instead of just 1).

ddnebert commented 11 years ago

I made a pull request in May to amend the implementation document. There is no schema to change, only practice to codify: https://github.com/project-open-data/project-open-data.github.io/pull/4

My point is that there is already a robust open API on CKAN that enables query (search facets) using Lucene/Solr on all government data in catalog.data.gov. We always thought that the proposal for having JSON files was primarily to syndicate records for ingest into the catalog and search API. Compared to locating all the federal JSON files, indexing them yourself, and immediately being out-of-date, doesn't it make sense to use the existing search facilities and API? The CKAN harvester already knows how to parse all the json records and all the robust geospatial metadata. The result is a single searchable index - and actually, you could attach to and request a single collosal JSON file from CKAN if you wanted to, with common identical schema, or a subset for an agency. Or, you could use the open query API to do much more advanced things.

This has no other purpose that the stated goal: "to produce a single catalog or list of data" for the entirety of government. It is now a single, cached catalog with an API, not just a series of files at agencies. JSON and references to catalog services supply this index very nicely.

mhogeweg commented 11 years ago

@ddnebert I have asked some questions on the CKAN API over at the open data stack exchange that I hope someone can look at. I'm missing some things that would make the robust API robuster... these things will make it easier for developers to interact with the API without downloading a 1,000,000 entry data.json from data.gov and parsing that (which would negate the need for an API at data.gov to begin with).

ddnebert commented 11 years ago

The current API primarily supports search so that you can request items of interest and specify the format of the response, where options exist. The primary response is DCAT in JSON format. Being open source, we can add a US govt flavor of JSON as a response option. I would agree that downloading a very large data.json file is not typically what one is looking for. The main purpose of a catalog should be to allow filters/query to extract (meta)data in many ways.

mhogeweg commented 11 years ago

fixed the link to the CKAN API question above. the CKAN flavor of JSON response is fine, I'd just ask that pagination is supported properly and that the API allows for at least the same kinds of queries as the UI.

rsignell-usgs commented 11 years ago

It looks like developers working with Europe's public data (http://publicdata.eu/) have figured out how to use the CKAN API -- check out the apps at http://publicdata.eu/related

I like that I see a python toolkit for the CKAN API at https://github.com/okfn/ckanclient,

mhogeweg commented 11 years ago

@rsignell-usgs they have implemented a paging mechanism that I can't find in the API docs. Perhaps publishing an OpenSearch descriptor would help developers interact with a simple API.

waldoj commented 11 years ago

We always thought that the proposal for having JSON files was primarily to syndicate records for ingest into the catalog and search API. Compared to locating all the federal JSON files, indexing them yourself, and immediately being out-of-date, doesn't it make sense to use the existing search facilities and API?

I don't want to beat a dead horse (I'd just be repeating my previous comments here), so I'll just say again that your proposal that thousands of datasets be omitted from data.json files runs counter to the specific mandates within M-13-13, and thus requires enabling language in the form of a White House policy memo.

ddnebert commented 11 years ago

Perhaps the discussion is moot - it was clarified today on the POC call that the mandate applies to Departments and independent Agencies for execution. Which means that, at least in our case, DOI will be collating, preparing, and feeding MAX. How individual bureaus work with the Departments to create this posting can be subject to other arrangements. The result will be a www.doi.gov/data.json file emanating from a CKAN instance at DOI with all our geospatial and raw metadata in it. Meanwhile, in catalog.data.gov, all the ingested metadata will be available for live search via the query API based on opensearch.

bsweezy commented 11 years ago

Wouldn't developers prefer to query data.gov's CKAN API rather than track down every agency.gov data.json? On Aug 26, 2013 5:49 PM, "ddnebert" notifications@github.com wrote:

Perhaps the discussion is moot - it was clarified today on the POC call that the mandate applies to Departments and independent Agencies for execution. Which means that, at least in our case, DOI will be collating, preparing, and feeding MAX. How individual bureaus work with the Departments to create this posting can be subject to other arrangements. The result will be a www.doi.gov/data.json file emanating from a CKAN instance at DOI with all our geospatial and raw metadata in it. Meanwhile, in catalog.data.gov, all the ingested metadata will be available for live search via the query API based on opensearch.

— Reply to this email directly or view it on GitHubhttps://github.com/project-open-data/project-open-data.github.io/issues/105#issuecomment-23297953 .

seanherron commented 11 years ago

@bsweezy Yes, but data.gov needs to get that data from somewhere. Data.gov will pull from each data.json file, and external orgs that want to make a competitor can do the same.

mhogeweg commented 11 years ago

@seanherron with 'competitor' you surely meant 'an additional channel to open data goodness', right? ;-)

seanherron commented 11 years ago

@mhogeweg yes - my hope is that other groups continually put a little heat on the data.gov team to keep improving ;)

waldoj commented 11 years ago

I'm pretty psyched for the possibility of a competitor, which data.json facilitates. Data.gov is great, and it keeps improving, but competition raises the possibility of better things still. There are some things that government can't do, for reasons of politics or limits of power or privacy, but that the private sector can do. I bet we'll find that some things in that area can be applied nicely to government data inventories.

MarionRoyal commented 11 years ago

Perhaps we can maintain and publish on DataGov a list of all harvest sources (with syntax/schemata capabilities) to make it easier for our competitors. I don't think it could get much warmer though.

Marion A. Royal 202.302.4634

Sent from PDA

On Aug 27, 2013, at 1:40 PM, Waldo Jaquith notifications@github.com wrote:

I'm pretty psyched for the possibility of a competitor, which data.jsonfacilitates. Data.gov is great, and it keeps improving, but competition raises the possibility of better things still. There are some things that government can't do, for reasons of politics or limits of power or privacy, but that the private sector can do. I bet we'll find that some things in that area can be applied nicely to government data inventories.

— Reply to this email directly or view it on GitHubhttps://github.com/project-open-data/project-open-data.github.io/issues/105#issuecomment-23355033 .

ajturner commented 10 years ago

this conversation seems to have devolved into a bit of philosophy and diverged from the original request.

I'm interested if a practical decision has prevailed. Pagination is a simple capability that every developer and tool understands well. Anyone reading this thread has probably paginated over twitter/github/email/RSS feeds/etc. Catalogs are growing in size, and as @jeffdlb points out, a good, simple, spec can grow adoption across multiple platforms and internally. We're already seeing catalogs in the 10k-100k+ range.

For simple practicality, OpenSearch-Atom has helpful next links:

<link rel="self" href="http://example.com/New+York+History?pw=3&amp;format=atom" type="application/atom+xml"/>
   <link rel="first" href="http://example.com/New+York+History?pw=1&amp;format=atom" type="application/atom+xml"/>
   <link rel="previous" href="http://example.com/New+York+History?pw=2&amp;format=atom" type="application/atom+xml"/>
   <link rel="next" href="http://example.com/New+York+History?pw=4&amp;format=atom" type="application/atom+xml"/>
   <link rel="last" href="http://example.com/New+York+History?pw=42299&amp;format=atom" type="application/atom+xml"/>

gbinal commented 10 years ago

Discussing the initial issue raised here (single file v. federated files) with others and there's still conflict over the right balance. Agencies feel the need to be able to federate but there's still a compelling interest in having the simple requirement of all data from an agency being accessed in a straightforward, direct way.

gbinal commented 10 years ago

FYI - This also overlaps with Issue #308.

rebeccawilliams commented 9 years ago

@jeffdlb, I am curious if these tools satisfy your original concerns:

The data.json merger: http://labs.data.gov/dashboard/merge
https://inventory.data.gov/ (or if not inventory, the emerging CKAN multisite: https://github.com/datacats/ckan-multisite)

jeffdlb commented 9 years ago

@rebeccawilliams - Thanks for your follow-up note. The short answer is No, they unfortunately do not satisfy the original concern.

data.json merger is not helpful for us because we already have a tool (CKAN) to produce data.json from multiple sources that have existing metadata that is not in data.json format.
I cannot evaluate inventory.data.gov because all of the links, including About this site, seem to redirect me to the login page.
CKAN multisite is not useful for us because we only have 1 CKAN instance.

The fundamental concerns raised in my original post are that (a) a huge flat file with no structure, organization, sorting, or partial-update capability is not a very useful way to exchange information about large inventories (we have >63000 entries) whose contents change with time; and (b) we already produce metadata in ISO or FGDC XML format that is far more complete and better-structured than the data.json approach.

Again, thanks for following up.

Regards, Jeff DLB

mhogeweg commented 9 years ago

I agree with @jeffdlb view. In Esri Geoportal Server we can harvest other data.json files and provide a single file, as well as provide pagination using data.json as one of the output formats of our open search endpoint. This would allow a harvester to page through the contents of a catalog quite easily, fetching chunks of the catalog instead of a single large file.

to overcome the updating issue raised, we generate a 'cached' version of the catalog's content in data.json on a regular basis (hourly, daily, weekly, depending on desired frequency).

Brykerr78 commented 5 years ago

😂😂😂😂😂😂

sheriff-Nik commented 5 years ago

What?

On Wed, Jan 23, 2019, 12:49 AM Brykerr78 notifications@github.com wrote:

😂😂😂😂😂😂

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/project-open-data/project-open-data.github.io/issues/105#issuecomment-456576548, or mute the thread https://github.com/notifications/unsubscribe-auth/AVlcqunQH7ldL6FSlfwtn9oMvojv6Bkoks5vF4dXgaJpZM4A4A1I .

project-open-data / project-open-data.github.io

Single-file /data catalog not good--optional alternative suggested #105

A) Minimum Required for Compliance