schemaorg / suggestions-questions-brainstorming

Suggestions, questions, and brainstorming
19 stars 15 forks source link

SPDX identifiers for licenses? #251

Open stain opened 3 years ago

stain commented 3 years ago

This thread is trying to gather existing best practice, or for such to be established, and perhaps to hear other views.

license property vs SPDX identifier

https://schema.org/license refers to a CreativeWork or URL and is of course useful particularly on all kinds of https://schema.org/CreativeWork beyond documents, e.g. https://schema.org/SoftwareSourceCode and https://schema.org/ImageObject

It is now common best practice in open source software to [use SPDX ids]https://spdx.dev/ids/) for identifying source code's license, you may have come across code comments like:

# SPDX-License-Identifier: GPL-2.0-or-later

But http://schema.org/license requires a URL or Creative Work - so which one to use? And can we classify these with SPDX identifiers even if a specialized license file (with copyright) is linked to? How do we deal with dual-license?

SPDX intro

https://spdx.org/licenses/ lists known open source licenses. These are great as you avoid confusions such as "What do you mean 'BSD license', 2-clause, 3-clause or 4-clause?" - the umabigious BSD-3-Clause can be looked up to https://spdx.org/licenses/BSD-3-Clause

SPDX has known licenses expressed as RDF like (simplified):

<http://spdx.org/licenses/GPL-2.0-or-later>
        a                             spdx:License ;
        rdfs:comment                  "This license was released: June 1991. This license identifier refers to the choice to use code under GPL-2.0-or-later (i.e., GPL-2.0 or some later version), as distinguished from use of code under GPL-2.0-only. The license notice (as seen in the Standard License Header field below) states which of these applies the code in the file. The example in the exhibit to the license shows the license notice for the \"or later\" approach." ;
        rdfs:seeAlso                  "https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html" , "https://opensource.org/licenses/GPL-2.0" ;
        spdx:isFsfLibre               "true" ;
        spdx:isOsiApproved            "true" ;
        spdx:licenseId                "GPL-2.0-or-later" ;
        spdx:name                     "GNU General Public License v2.0 or later" ;

(this RDF seems to only exist in GitHub, although some microdata is embedded it gets the subject wrong).

Using SPDX URIs as @id

So the simple approach, shown in schemaorg/schemaorg#1928, is to just use these URIs like http://spdx.org/licenses/GPL-2.0-or-later directly - @njh in https://www.arduinolibraries.info/libraries/arduino-json.json have opted for the https instead of http variant:

{
  "@context": "http://schema.org/",
  "@type": "SoftwareApplication",
  "name": "ArduinoJson",
  "url": "https://arduinojson.org/?utm_source=meta&utm_medium=library.properties",
  "author": {
    "@type": "Person",
    "name": "Benoit Blanchon"
  },
  "license": "https://spdx.org/licenses/MIT"
}

Many URIs

Many of the licenses have their own URIs as well, and then the usual http vs https etc, so we could have many potential inconsistencies:

For listing/mapping https://opendefinition.org/licenses/api/ has a nice list, but it's custom JSON.

Challenges

The SPDX website is inconsistent with it's own RDF and https://spdx.org/licenses/ links to https://spdx.org/licenses/MIT.html (notice https and html) so I guess many will get the alternative URIs - I have also seen the variant NJH uses as most common, e.g. we refer to it from https://www.commonwl.org/user_guide/17-metadata/index.html

SPDX identifiers are also not just identifying a single license, but also expressions covering dual licenses like MIT or Apache-2.0 or exceptions. Some licenses like https://spdx.org/licenses/BSD-3-Clause are templates requiring a copyright year and copyright holder, and so the actual license URL would be a specialized file, say https://github.com/seek4science/seek/blob/master/BSD-LICENSE which would then not immediately be recognizable as the BSD 3-Clause license.

Using identifier from CreativeWork

One way around this could be to use http://schema.org/identifier on an anonymous or local CreativeWork license resource - of course setting the SPDX expression directly as identifier would be easiest, but a bit too much left as implications:

{ "@id": "workflow.cwl",
  "@type": "SoftwareSourceCode",
  "license": {
      "@id": "https://creativecommons.org/licenses/by/4.0/",
      "@type": "CreativeWork",
      "name": "CC BY 4.0",
      "description": "Creative Commons Attribution 4.0 International License",
      "identifier": "CC-BY-SA-4.0"
    }
}

Using PropertyValue to capture SPDX expressions

More explicit using http://schema.org/PropertyValue identifiers we can better include SPDX expressions, even if there either is no license file, or it is a local specialization:

{ "@id": "dual-licensed.py",
  "@type": "SoftwareSourceCode",
  "license": {
      "@type": "CreativeWork",
      "name": "MIT or AGPL 3.0 (or later)",
      "description": "Dual-licensed as MIT or AGPL 3.0",
      "isBasedOn": [
        "https://spdx.org/licenses/MIT",
        "https://spdx.org/licenses/AGPL-3.0-or-later",
      ],
      "identifier": {
          "@type": "PropertyValue",
          "name": "SPDX-License-Identifier",
          "value": "MIT OR AGPL-3.0+",
          "propertyID": "https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/"
       }
    }
 }

We see that the SPDX expression MIT OR AGPL-3.0+ is captured. I threw in http://schema.org/isBasedOn for good measure, although this would play double-duty with the SPDX license expression without its flexibility or rigidity.

Here I used https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/ as the https://schema.org/propertyID as it explains well the SPDX expressions, and instead of just SPDX I used SPDX-License-Identifier to match what they recommend for code comments. (not sure if propertyId here should be {@id: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files instead.)

This is much more precise - but unfortunately becomes a bit too nested/repetitive when applied to the base case of just using https://spdx.org/licenses/MIT style URIs directly:

{
  "@context": "http://schema.org/",
  "@type": "SoftwareApplication",
  "name": "ArduinoJson",
  "license": {
      "@id": "https://spdx.org/licenses/MIT",
      "@type": "CreativeWork",
      "name": "MIT",
      "identifier": {
          "@type": "PropertyValue",
          "name": "SPDX-License-Identifier",
          "value": "MIT",
          "propertyID": "https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/"
       }
    }
}

Discussion across GitHub

(This section added to lure others in to comment with their views :grin: )

In schemaorg/schemaorg#1928 @njh concludes to use https://spdx.org/licenses/MIT directly as @id

In seek4science/seek#456 we tried to explore this further, as we had initially abused license as a text field with an implied SPDX identifier looked up using https://opendefinition.org/ JSON - we need to distinguish between "data license" and "software license". It suggests the PropertyValue expanded form shown above. Discussions include @fbacall @stuzart @alaninmcr

In radiantearth/stac-spec#378 @mojodna @gkellogg @m-mohr are using the variant https://spdx.org/licenses/MIT.html in JSON-LD

In galaxyproject/galaxy#10408 @jmchilton and @nsoranzo are referencing SPDX from Galaxy workflows, unclear which identifier form (custom YAML?)

In earthcubearchitecture-project418/p418Docs#6 we see @mbjones https://github.com/earthcubearchitecture-project418/p418Docs/issues/6#issuecomment-358169081 suggest a PropertyValue approach as above, but less verbose with propertyID: SPDX string, as https://schema.org/propertyID can be either Text or URL.

The Citation File Format (CFF) (custom YAML) use license_url: https://spdx.org/licenses/MIT and license: "MIT" - see for instance citation-file-format/cff-converter-python#25 by @jspaaks and citation-file-format/citation-file-format#105 with @thomaskrause

mbjones commented 3 years ago

In the https://science-on-schema.org guidelines for Dataset metadata, we recommend using SPDX URIs from the RDF files: https://github.com/ESIPFed/science-on-schema.org/blob/master/guides/Dataset.md#license

In CodeMeta, which is a schema.org extension for software metadata, we also recommend using SPDX: https://github.com/codemeta/codemeta/issues/67 although the guidelines are not prescriptive.

m-mohr commented 3 years ago

Some quick thoughts:

bact commented 7 months ago

In the https://science-on-schema.org guidelines for Dataset metadata, we recommend using SPDX URIs from the RDF files: https://github.com/ESIPFed/science-on-schema.org/blob/master/guides/Dataset.md#license

In CodeMeta, which is a schema.org extension for software metadata, we also recommend using SPDX: codemeta/codemeta#67 although the guidelines are not prescriptive.

Just a note from Codemetapy https://github.com/proycon/codemetapy :

"For schema:license, full SPDX URIs are used where possible."