project-open-data / project-open-data.github.io

Open Data Policy — Managing Information as an Asset
https://project-open-data.cio.gov/
Other
1.34k stars 583 forks source link

Unique Identifiers should probably be globally unique rather than only unique for an org #69

Closed willpugh closed 9 years ago

willpugh commented 11 years ago

Looking at the description for "identifier" and the guidance, the suggestions are to make the identifiers unique for an Agency or Catalog.

I think we should actually change the guidance to be globally unique. This could be something as easy as adding agency name before the agency identifier, e.g. gov.hhs.aspe.1234567 instead of 1234567. Or it could simply be a URI.

There is no guarantee that catalogs won't aggregate or pull from other catalogs (and it is probably even likely). If this happens a single catalog could get two items from different agencies that have the same identifier.

One solution to this would be to make it a function of the catalog to add something to these datasets to give them unique IDs. The problem with this is that, the "uniqueness" is creating by an intermediate party (not the source of truth).

So, now imagine that I setup a catalog that is going to aggregate a lot of health data. I pull from New York State's catalog, as well as a catalog from HHS. I then get the exact same dataset (since HHS pulls from New York State's catalog un-beknownst to me), but it has two different identifiers in my system. One is NY State's original ID. The other is the ID that HHS's catalog had to create for me (to make sure it was unique within their catalog)

Seems like we could protect ourselves from a lot of identification problems down the line if we ask for global uniqueness for identifiers.

I would also suggest that as guidance, if an Agency already has an agency unique identifier, they should probably make it unique by appending their agency to it (rather than using something opaque like UUID).

jpmckinney commented 11 years ago

+1 using URIs

jqnatividad commented 11 years ago

+1 using URIs. And to go a bit further, perhaps a federated registry function for each catalog can be established to maintain these URIs.

And this registry can be simply implemented as a standard dataset within each catalog.

Going by @willpugh's example - the data.gov catalog will have a dataset listing all federal entities (with their URIs) and point to the authoritative entity for other non-federal jurisdictions (e.g. States and Cities) who maintain a similar dataset within their catalogs listing all the entities within their jurisdiction (again with their corresponding URIs).

Linked Catalogs :)

jqnatividad commented 11 years ago

DCAT Namespace System - DNS for data catalogs :)

jpmckinney commented 11 years ago

DCAT generally uses URIs for IDs already (pretty much all linked data does) - it's very unusual for this project's implementation to not use URIs.

BernHyland commented 11 years ago

++1

Unique identifiers should be unique on the Web, represented as HTTP URIs, and ideally HTTP URLs (resolvable).

Bernadette

On May 22, 2013, at 2:07 PM, James McKinney notifications@github.com wrote:

DCAT generally uses URIs for IDs already (pretty much all linked data does) - it's very unusual for this project's implementation to not use URIs.

— Reply to this email directly or view it on GitHub.

benbalter commented 11 years ago

:+1: for URLs.

seanherron commented 11 years ago

Came here to seek guidance on this exact issue. I'm generally thinking that URLs may not be the best approach - no matter how hard we try, URLs will shift and change over time, especially when we are talking about something as relatively portable as data. We should "name" data based on its identity, not its location.

I think the schema of (gov).(agency).(agency).(etc).(id) works well - organizations can be as precise as they would like (eg. gov.hhs.fda.ompt.ctp.4578418 or gov.hhs.fda.78714547). Ideally, people could then also query against this ID to discover what agency the data originally came with, as that's not one of the definable fields in the schema (publisher kind of covers this, but it probably won't be something you can query against since there isn't a standard convention defined).

jpmckinney commented 11 years ago

A URI doesn't need to be resolvable on the web. urn:issn:1535-3613 is a URI. The scheme you describe can very easily be a valid URI if you replace periods with colons.

seanherron commented 11 years ago

Right - thats why I said URLs weren't my preferred approach. URIs are good. As for the colons - yes, you are right, I forgot the standard calls for colons rather than periods.

jahendler commented 11 years ago

Let me argue strongly for the use of URIs (and particularly URI's in the http: scheme - that is to say URLs). To start, it's a common misconception that URLs are "location sensitive" - the first field, the server, is usually the location of the URL, but remember the web has underlying stuff so that if things move, the URLs can be preserved. But more importantly, it is precisely the de-referencibility of the URLs that make them so powerful - type "gov.hhs.fda.ompt.ctp.4578418" into your browser (or app, or API, or etc.) and you get an error. If we had a good URI scheme for this, however, you could type http:/data.gov/id/us/fed/agency_page/Department_of_Health_and_Human_Services and you know you are designating a particular agency - more importantly, type that into a browser and it would resolve to a page that had both human and machine readable information about the agency -- and this can go deeper - for example, we could design a schema to name common terms or elements that agencies use - for example the EPA could have something in their control for naming chemicals - so we would standardize on something like http://data.gov/source/epa-gov/id/us/toxic-chemicals_page/1_1_1_2-tetrachloroethane to designate a particular chemical

Those two URIs won't resolve now because they are not set up at data.gov, but working with Chris Musialek when he was with the project, we designed some pages to show the power this would allow - if you go to http://logd.tw.rpi.edu/id_page then you can see lists of states, agencies, chemicals, crops and other things in this form. There's discussion there of the URI scheme (based in part on the UK's scheme that was designed to a large degree by Tim Berners-Lee, inventor of the Web, and Nigel Shadbolt) - that discussion is at http://logd.tw.rpi.edu/instance-hub-uri-design

based on this scheme - you can see dereferencible versions of the two previous URIs are: http://logd.tw.rpi.edu/id/us/fed/agency_page/Department_of_Health_and_Human_Services and http://logd.tw.rpi.edu/source/epa-gov/id/us/toxic-chemicals_page/1_1_1_2-tetrachloroethane

note that a URN scheme or other URI can be built to be very similar to this, we've done some with the CNRI Handle system for large scientific communities, but the URL-based scheme has the real value that everyone already has all the mechanisms for resolution running, for free, on every computer, phone, etc. they may own.

oh, and some folks are worried about the length of the elements in the URL scheme, but note that schemes for dealign with this are already in place - I cannot use the government name shorteners, but, for example, the chemical page above can also be used as http://bit.ly/1cSk0fV

theres other benefits I could go on about - for example the URLs are search engine friendly, they work well with both mobile and web technologies, they work without change on social media sites and others.

Rather than going on at even more length - here's a couple of pointers to more about URI stuff for open data - Jeanne Holm, George Thomas, Chris Musialek and I published a paper in IEEE Intelligent Systems about how this stuff is used in various parts of Data.gov http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6185527&sortType%3Dasc_p_Sequence%26filter%3DAND%28p_IS_Number%3A6237448%29 and I have a number of talks and slideshares floating around the Web that I can point people at if they want excrutiating detail of all this sort of stuff (a short, accessible talk that outlines some of this is the one I gave at Strata - http://www.youtube.com/watch?v=Cob5oltMGMc)

philipashlock commented 9 years ago

It seems like there's basically consensus on using URIs, but not quite agreement on URLs. Perhaps we encourage URLs that can persist via a URI resolution service.

If there's no resolution service then a URI like doi:10.7927/H4PZ56R2 would be acceptable but if there's a resolution service, you could use a URL like http://dx.doi.org/10.7927/H4PZ56R2 or http://data.datacite.org/10.7927/H4PZ56R2

philipashlock commented 9 years ago

@andreamedinasmith What are your thoughts on a URI vs a URL such as with the example I provided above?

mhogeweg commented 9 years ago

Just implement @timbl suggestion from 2006:

'nough said

andreamedinasmith commented 9 years ago

The CrossRef and DataCite guidelines state that DOIs should be listed as URLs and that falls in line with what Marten suggests.

Andrea Medina-Smith Metadata Librarian Information Services Office National Institute of Standards and Technology

andrea.medina-smith@nist.govmailto:andrea.medina-smith@nist.gov (301) 975-2592

On Nov 13, 2014, at 2:00 PM, "Philip Ashlock" notifications@github.com<mailto:notifications@github.com> wrote:

@andreamedinasmithhttps://github.com/andreamedinasmith What are your thoughts on a URI vs a URL such as with the example I provided above?

Reply to this email directly or view it on GitHubhttps://github.com/project-open-data/project-open-data.github.io/issues/69#issuecomment-62946406.

philipashlock commented 9 years ago

Thanks @andreamedinasmith and @mhogeweg. Do either of you know whether there are already recommendations on using URLs (or URIs more broadly) as unique identifiers within the federal government? I suspect there is, but couldn't find any.

andreamedinasmith commented 9 years ago

I don't know of any beyond what orgs have agreed to do if they mint DOIs of any sort.

Andrea Medina-Smith Metadata Librarian Information Services Office National Institute of Standards and Technology

andrea.medina-smith@nist.govmailto:andrea.medina-smith@nist.gov (301) 975-2592

On Nov 13, 2014, at 3:22 PM, "Philip Ashlock" notifications@github.com<mailto:notifications@github.com> wrote:

Thanks @andreamedinasmithhttps://github.com/andreamedinasmith and @mhogeweghttps://github.com/mhogeweg. Do either of you know whether there are already recommendations on using URLs (or URIs more broadly) as unique identifiers within the federal government? I suspect there is, but couldn't find any.

Reply to this email directly or view it on GitHubhttps://github.com/project-open-data/project-open-data.github.io/issues/69#issuecomment-62959549.

mhogeweg commented 9 years ago

perhaps @tedhabermann has some suggestions for this.

philipashlock commented 9 years ago

We've included language to recommend URLs, but not require them. Here's the main diff for the change

BernHyland commented 9 years ago

We are hardly in the early days of machine readable data. Rather, we're 5+ years into teaching/mentoring government agencies on how to publish & consume machine & human readable content. Policy & implementation approaches for persistence schemes have been described in detail by federal governments in the US, UK, Netherlands and others. I can get you references if asked.

Government librarians have been dealing with issues of persistence on the Web for close to two decades. Let's take some input on what these digital information experts on what they've implemented.

There several persistence schemes, including: DOIs (used by libraries & traditional academic publishers), The Handle System (used by libraries, government), LSID (used by life sciences), INFO URIs (used by libraries & publishers), and PURLs (used by libraries, governments and life sciences).

The two Open Source Software projects that I've worked with on behalf of the US Government that incorporate PURLs are http://purlz.org or http://callimachusproject.org. See also, http://en.wikipedia.org/wiki/Persistent_uniform_resource_locator

My experience is with PURLs. PURLs provide persistent URLs for information resources on the Web. PURLs work by simple HTTP redirection. Unlike some other persistence schemes, PURLs are specifically designed using the fundamental principal of the Web, the humble but flexible HTTP URI. PURLs have evolved significantly since introduced in the mid 1990's by OCLC. IMHO PURLs are an open & flexible for Web information resources but there are no doubt merits to other schemes.

The US Government Printing Office has been serving over 1,200 academic libraries with documents and data deemed essential to American Democracy using a mature persistent URLs strategy and implementation built on the OSS project http://purlz.org. They service about 40M hits / mos on their production service, the majority coming from machines.

PURLs today handle very sophisticated use cases and have management & reporting applications built on top of them, however, for the purposes of POD Metadata Schema, PURLs may be an easy win -- Open Source Software, used by US Government agencies for over 15 years serving >1,200 university libraries. If you want an intro to the program manager at GPO, let me know & I'll connect you @philipashlock.

Finally, in 2014 and beyond, if the goal of the POD Metadata Schema is to advance the discovery, access & re-use of open government data catalog info, then specifying that URLs (resolvable URIs on the Web) is a "MUST", isn't a tall order, (I'm referring to the catalog info only). Of course, all the data using HTTP URIs is the goal, but we can make that "SHOULD" (aka "Recommended").

Considering the billions of dollars taxpayers spend for our favorite data agencies, e.g., NOAA, Census, Economic Affairs, HHS, EPA, DOE (and others that shall remain nameless), to collect data. The least they can be required to do is describe their data catalogues using the published standards & best practices, which include JSON-LD and of course, URLs (there is an echo in the room ;-)

philipashlock commented 9 years ago

Thanks @BernHyland. It's easy to find government-wide guidance for this in other countries like the UK (e.g. the Persistent resolvable identifiers profile ), but I've been surprised to have trouble finding anything comparable for the US. If there's anything like that you could reference, even if it's specific to offices like GPO, that would be very helpful. Since this consideration wasn't addressed in the first iteration and because identifiers in any form are meant to be persistent, we obviously have to be extra careful in changing their requirements - and if we do, we'll have to outline a migration pathway that's not too disruptive for anything depending on current non-URI identifiers.

Considering GSA manages the .gov domain system, there's some chance it could play a leadership role in this space, both in terms of providing guidance and actual infrastructure, but I think that will need to be broader than the experiments done with data.gov in the past.

philipashlock commented 9 years ago

@jahendler Do you know if the IEEE article about URI stuff for open data is available at a publicly accessible resolvable URL? ;)

If it has contributions from federal employees or as the result of federal funding, it doesn't seem like it should be behind a paywall - not to mention that the arrangement would seem to contradict the subject matter that I suspect the article covers. That kind of barrier is even more frustrating than publishing recommendations about machine readable open data as a PDF.

BernHyland commented 9 years ago

@philipashlock It is easy to find guidance on persistence policy for countries including the UK & Netherlands because the respective government stakeholders engaged with Web architects and the W3C (with government stakeholders as participants). NB: we did have some terrific input from US Gov't stakeholders from Library of Congress, EPA, HHS, NASA and research institutions including Rensselaer Polytechnic Institute (RPI).

I agree 100% with you that GSA 'could play a leadership role in this space'. In fact, I'd take it a step further and say that GSA SHOULD, as manager of the .gov domain, lead on this key piece of infrastructure. Persistence strategy, implementation, monitoring & reporting isn't glamorous per se but it's critical for consistency of all government data dependent upon the universal addressing scheme of the Web, namely HTTP URIs.

As editor of the Best Practices for Publishing Linked Data Recommendation, (published by W3C 21-Dec-2013), I wrote the section titled "The Role of Good URIs for Linked Data", see http://www.w3.org/TR/ld-bp/#HTTP-URIS. The URI strategy section was informed & reviewed by the work of Professor Jim Hendler (RPI) and his team's work on URI strategy for US Government data. Best Practices also references "Architecture of the World Wide Web, Volume One" (published by W3C 15-Dec-2004), from a decade earlier. My point is, guidance on this whole URI has been around for over a decade. You can simply reference the international specifications ("Recommendations" in W3C speak). No one need recreate the wheel. I'll also send you an intro to the program manager at GPO who can provide advice on how they've been handling persistence strategy since the late 1990's.

RE: US GSA's involvement in URI strategy ... George Thomas (HHS) and I offered detailed advice to your predecessor at GSA, Chris Musialek, in early 2012. At that time George and I were co-chairing the W3C Government Linked Data working group and encouraged Chris and Jeanne to participate on our weekly discussions that included information architects & researchers from the UK, Netherlands, Brazil, France, etc. FWIW, I spent a half day on January 10, 2012 with Chris in a F2F meeting at GSA's office (on First Street, NE) briefing him on persistence strategy, how it works, how GPO uses the open source PURLs project (purlz.org) with some added features for reporting & monitoring. Chris was interested but the needs for data.gov were great in 2012 & he had other pressing priorities.

Perhaps in 2014/15, GSA can make unique identifiers & a persistence strategy a priority. After all, as of May 9, 2014, in a bi-partisan effort by the 113th Congress, the DATA Act became law, (Public Law 113–101). That should make the business case for GSA to offer guidance/infrastructure on "unique identifiers applied Government-wide", (see section 4 - Data Standards in http://www.gpo.gov/fdsys/pkg/PLAW-113publ101/pdf/PLAW-113publ101.pdf), easier to support.

gbinal commented 9 years ago

Agencies are strongly encouraged to do this in current guidance. I think we can close this.