project-open-data / project-open-data.github.io

Open Data Policy — Managing Information as an Asset
https://project-open-data.cio.gov/
Other
1.34k stars 583 forks source link

license field needs more guidance #196

Open regongithub opened 10 years ago

regongithub commented 10 years ago

Some datasets are copyrighted -- can't address that in current field or point to specific license guidance via URL.

gbinal commented 10 years ago

Agreed. In particular, in the 'Further Metadata Field Guidance,' material for 'accepted values, usage notes, and example. @benbalter @waldoj - any thoughts?

waldoj commented 10 years ago

Huh. Funny that we hadn't considered this previously. I'm reminded of the town I lived in as a kid, a planned community near D.C. Famously, the Rouse Brothers, who planned the community, didn't include any cemeteries. It hadn't occurred to them that people would die.

Even our example for this isn't good. {"license":""} isn't an example, it's a non-example. {"license":"CC-BY-SA"} is a reasonable example. I recall discussion as to whether we wanted people to name the license here, or provide a URL linking to the license, but I can't see in the documentation what we settled on.

If this is a URL then it seems like we'd want to point to the terms of use. I mean, if the data is actually copyrighted, period, then presumably it wouldn't be on an API. There must be some terms of uses, a license under which people can use that data, even if it's not a license in the cookie-cutter sense like Apache, GPL, MIT, Creative Commons, etc. It seems to me that we wouldn't want agencies sharing copyrighted data via an API that has no license allowing it to be reused.

And if this is a text field, then I imagine we just need to agree on and document a string (e.g., copyrighted) that covers this. Of course, "copyrighted" isn't a license at all, so I'm not actually proposing using that.

gbinal commented 10 years ago

Taking a page from CKAN, we've actually been leaning towards text box for name of license. Given the pros and cons both ways, my suggestion is to clarify in the instructions that it's a text field where the provider can write in the license or paste a URL instead. I imagine that seems heretically open to some but any thoughts on if there's another path that is a clear consensus winner that we should adopt instead? @konklone, @joshdata - any thoughts?

konklone commented 10 years ago

If you really have to stick with just one field, then a URL seems like the right thing to ask for. You could also split it out into license_url, license_name, and license_text fields, where at least one of them needs to be filled out. It really just depends what balance of expressiveness versus maintenance POD wants.

JoshData commented 10 years ago

I think in the August meeting we said we'd provide (a link to?) some non-exhaustive list of possible values.

Since @waldoj mentioned "terms of use" generally, I had a bit of pause before replying. What if the terms are "by downloading this data you agree to quack like a duck"? If we allow arbitrary terms, we should not include an accessURL for the dataset. If an agency wants to create a click-wrap style agreement, they should be responsible for ensuring that the user actually agrees before getting the data. Because otherwise, who's really going to view the license page to see what they are agreeing to?

IANAL but this seems to differ from a copyright license agreement (or a public domain dedication). In these cases no terms apply until the user makes a copy. (Let's assume that the act of downloading doesn't constitute making a copy.) There are no terms just to download/view/analyze.

So I guess what I'm saying is, the guidance should be specific about this. You can point to a URL containing a copyright license agreement.

haleyvandyck commented 10 years ago

Thanks everyone for the helpful discussion. Sorry for the gap in the guidance here and the accidental "non-example." :flushed: Seems that there is a relative consensus forming around the URL approach, which I would agree with. @waldoj, care to do a pass at pull request clarifying this in the further metadata field guidance? Thanks!

waldoj commented 10 years ago

For starters, I don't see any way around that this needs to be a URL. Licenses must be unambiguous, and anything short of an actual copy of the license (or, rather, a pointer to where it can be found) is ambiguous.

CC-BY-SA, the example that I used previously, is Exhibit A for why we need a URL—version 4.0 of the Creative Commons license came out a week or so ago. Which version is that CC-BY-SA refers to? Because they're all different.

The downside of a URL is that it makes it more difficult for machine recognition of the license that's in use. That is, "GPL" is an easily recognized string, if ambiguous. But most major licenses have canonical locations. GPL 2.0? Link to http://www.gnu.org/licenses/old-licenses/gpl-2.0.html. Apache 2.0? Link to http://www.apache.org/licenses/LICENSE-2.0.html. Etc. So I think the problem of machine-readability is well addressed by encouraging the use of canonical license URLs.

waldoj commented 10 years ago

If we accept that we need to use URLs, then that solves the issue raised by this ticket, because a copyrighted dataset can have a URL that points to the actual license. (Of course, this necessitates that the license be available online, but that's going to be functionally true no matter what—simply writing "copyrighted" in this field, if we didn't allow URLs, would raise many questions but answer none, making the data useless to people.) But it does raise a new problem of what's to be done with work that has no license—that is, it's in the public domain. What is the canonical URL for that? Should the field be defined but left blank (the embodiment of a non-license), or is that in practice going to be ambiguous (I think so)?

Would it be inappropriate to encourage that non-standard licenses (i.e., copyrighted datasets) be marked up with Creative Commons Rights Expression Language RDF? That would make it possible to harvest license data that would otherwise be a black box.

If anybody has any feelings about any of the points I've raised in this pair of comments, please don't be shy. Otherwise, after I let this sit for a couple of weeks, I'm inclined to start filing pull requests to act on some of these ideas, and I want to make sure that every perspective has had a fair airing first.

raking08 commented 10 years ago

Hello I am new to this discussion as I just started supporting the CDC in this effort; however I come from a pretty deep private sector data governance / data quality background so tend to view through that lens. My comment is that a blank itself is ambiguous. a blank really means that the field has not been populated with a value. That might be because the data was evaluated and there is no license ( as in common rights) but more often will mean that the data steward did not populate that field (Didn't know, wasn't available at time, the URL moved...) As I have looked at the available data from the three main open data sites I am seeing this blank more often than not. So I would recommend the URL be used of course, but also I think there should be commonly defined terms for the Creative rights, "not applicable", unknown... so that its clean that a blank means simply not populated. The its possible to query the metadataset for blanks and send to the data stewards for the correct population. They may still populate with unknown, but at least you can get them to own that decision. We have about 400 data stewards, so managing this set for quality on an ongoing basis, but just when initially created) is a challenge. Thoughts?

JoshData commented 10 years ago

Agreed with @raking08.

CCREL is interesting but.... I think we'd need to write more guidance on how to actually do that. It's something that I think would appear on the target URL and not in the data.json file itself, so we could add that suggestion later on too without affecting the POD schema too.

Back to @waldoj's question about non-licensed federal government data. We usually talk about this data as being in the public domain, but it's only in the domestic public domain. Calling it a 17 USC § 105 exemption would be more accurate. So I think we have two options:

1) Have one special case where the value of the field is not a URL, e.g. exempt_by_17USC105 2) Make up a URL to use in this special case, e.g. tag:whitehouse.gov,2013-12-04:ostp/pod/copyright-status/exempt_by_17USC105

I'm agnostic between the choices. I have a feeling @gbinal would find the first easier to have implemented?

konklone commented 10 years ago

Project Open Data could also make a URL whose sole purpose is to describe the state of non-licensed federal government data -- http://project-open-data.github.io/us-data-license or something. It could just contain the text of 17 USC 105, with a human readable description accompanying it, and a more attractive/readable presentation than others I could find.

Even separately from improving the data.json schema, this would be a nice permalink for the country to have.

JoshData commented 10 years ago

I'd be a little hesitant to coin URLs on a github domain though.

waldoj commented 10 years ago

Agreed, Josh.

konklone commented 10 years ago

That is exactly what inspired me to file #216!

gbinal commented 9 years ago

There seems to be broad consensus around this field being a URL. In the above commit, I'm proposing that the field be clarified to be a string 'URL' and link to the list of example open data licenses as potential options. I think this would address this issue specifically and would suggest a new issue expanding the list of licenses (along with their authoritative URLs) on http://project-open-data.github.io/license-examples/

ajturner commented 9 years ago

I'm interested in continuing this conversation. A URL is a good first step - however it's unclear what should be at the other side of that URL.

As a developer, when enumerating over data, I want to have structured license information that enables me to understand the capabilities this license permits or restricts.

So can the license point to a JSON such as:

{
  "title": "CC0",
  "version": "1.0",
  "description": "The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.

You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.",
  "capabilities": {
    "commercial_use": true,
    "attribution_required": false,
    "allows_redistribution": true,
    "share_modifications": false
  },
  "logo":  "https://creativecommons.org/images/deed/nolaw.png",
  "link": "https://creativecommons.org/publicdomain/zero/1.0/"
}

does something like that exist?

waldoj commented 9 years ago

Huh. That's such a totally obvious-in-retrospect idea that it seems like it has to exist. And yet I've never encountered it.

konklone commented 9 years ago

The package manifest in Node-land, package.json, has a license field that will technically accept any text, but asks for "SPDX" identifiers. It's the only thing I've seen in the wild like this.

philipashlock commented 9 years ago

@ajturner @waldoj @konklone actually @aaronsw's spirit lives on throughout the web in part because he helped establish the licensing metadata standard for this (as a teenager have you) and it's now part of the foundation of Creative Commons that's already widely in use. @waldoj as I understand it this is the CCREL approach you and @JoshData were referring to. Given the era this came from it was originally done as RDF and is now RDFa instead of JSON, but it appears embedded in the pages found with the URLs we use for Creative Commons License Deeds. For example, the CC-BY Deed includes the following RDFa:

   <h3 resource="http://creativecommons.org/ns#Reproduction"
        rel="cc:permits">You are free to:</h3>
    <ul class="license-properties">
      <li class="license share"
          rel="cc:permits"
          resource="http://creativecommons.org/ns#Distribution">
        <strong>Share</strong>  &mdash; copy and redistribute the material in any medium or format
      </li>
        <li class="license remix"
            rel="cc:permits"
            resource="http://creativecommons.org/ns#DerivativeWorks">
          <strong>Adapt</strong>  &mdash; remix, transform, and build upon the material
        </li>
        <li class="license commercial">
          for any purpose, even commercially.
        </li>
      <li id="more-container"
          class="license-hidden">
        <span id="devnations-container" />
      </li>
    </ul>

That said, it would be nice to not have to parse the RDFa out of the HTML here - and not to even have to parse the HTML in the first place. Within the HTML there is also a link to the raw RDF XML of the deed:

<link rel="alternate" type="application/rdf+xml" href="rdf" />

In context, that href resolves to https://creativecommons.org/licenses/by/4.0/rdf but ideally this would be provided in alternate formats including JSON or JSON-LD and ideally could be requested directly with HTTP using Content Negotiation based on an Accept header rather than parsed out of HTML, i.e. I'd want the response from curl -I https://creativecommons.org/licenses/by/4.0/ to include:

Accept: application/rdf+xml, application/ld+json

As @JoshData said, I think this is still pretty obscure and non-obvious, so guidance would be needed, but in practice I think most of these URLs will be used more as commonly understood unique identifiers and human readable landing pages rather than a bespoke set of conditions that need to be parsed.

Of course, that thought implies that we actually would have a canonical URL for something like "U.S. Public Domain" At this point, I think we should have something like that (in addition to guidance on using alternatives like CC0 for expanding use internationally) and that it should function in a very similar manner as the Creative Commons Public Domain Mark, but be specific and limited to the U.S. jurisdiction as is the case with U.S. Government Works (Title 17 § 105) and other reasons like Title 37 § 202.1, Title 17 § 1302, or expired copyrights (Title 17 § 303). This should have a simple human readable page like Creative Commons deeds, include machine readable rights, and also link to the full legal documentation.

I don't think this URL should live on the Project Open Data site since this isn't specific to data, but as we look to establish this somewhere we may need to provide an interim URL such as http://www.usa.gov/copyright.shtml or something on project-open-data.cio.gov

I'm cc'ing @peterspdx here in reference to https://github.com/creativecommons/Localized-Public-Domain-Mark

waldoj commented 9 years ago

Well, this is delightful. :) Thanks for a very informative and engaging explanation, @philipashlock!

konklone commented 9 years ago

Wow. A whole bunch of great stuff there, @philipashlock. I had no idea about license metadata, or what Creative Commons' plans for a version 2 are.

Also https://github.com/creativecommons/Localized-Public-Domain-Mark looks like exactly what we'd want (though it's new and not finalized)?

cc @tvol @JoshData

tvol commented 9 years ago

@konklone @philipashlock @JoshData yeah I believe that's in more of an idea stage right now. Diane Peters (@peterspdx) is on point.

rebeccawilliams commented 9 years ago

ICYMI, I wanted to flag the recent licensing guidance updates, created with #454 + #456.

You'll notice that a new section for U.S. Government works has been added, highlighting the importance of a worldwide public domain dedication (CC-0) and providing a default URL for U.S. Government works: http://www.usa.gov/publicdomain/label/1.0/

Interestingly, I learned today that the White House's Flickr page has separately and of their own accord been referencing the same U.S. Government Works page, see this example. Great minds point to a common temporary solution!

The content of the U.S. Government Works page is a work in progress and will ideally incorporate a Creative Commons-like solution to a U.S. specific version of the Public Domain Mark or to an equivalent legal/human-readable/machine-readable solution.

akuckartz commented 9 years ago

One minor point: RDFa is an RDF serialization. But providing a JSON-LD alternative would help.

Maybe someone can add information on the CC Wiki. There is a page on RDFa and RDF/XML material is available but no info about JSON-LD.

prototypo commented 9 years ago

@philipashlock, I would like to expound a bit on @akuckartz's comment regarding RDF. You said, "originally done as RDF and is now RDFa instead of JSON", and @akuckartz rightly pointed out that RDFa is an RDF serialization.

RDF is a format-independent data model. There are, as of last year, 7 standardized syntaxes for that data model, including a JSON one known as JSON-LD. Please see http://json-ld.org/ for more information, or http://www.w3.org/TR/json-ld/ for the standard.

A more accurate way to phrase your statement would have been something like, "was written in the RDF data model and serialized as RDF/XML. We all know that RDF/XML sucked, so is now available in other standard RDF formats, including RDFa, which is easily convertible using widely available tooling to JSON-LD. That gets us where we need to go." :-)

There is no reason to jettison the work done in describing or representing licenses on the Web because you want to use JSON. JSON-LD is JSON. That allows you to make use of the Creative Commons work, and still integrate easily into existing Web tools.

Please don't reinvent the wheel on licensing for the Web. The CC folks have put a lot of very valuable thought into this, as have other parts of the Web community.