Design minimal data structure

pombredanne commented 5 years ago

This is a continuation of https://github.com/spdx/tools-python/issues/106

pombredanne commented 5 years ago

@sschuberth @nishakm Here is a suggested structure.

Given these:

a license string and or structured data snippet (to account for npm old styles and Maven structures) as found in a package manifest.
a package manager type (e.g. a Package URL type)

Then we map to:

a SPDX license expression
some indication of confidence (say between 0 and 100) for the accuracy of this mapping, defaulting to 100.
some optional notes

And here would be an example using a YAML serialization:

- package_type: npm
  declared_license:
    license: MIT
  license_expression: MIT

- package_type: maven
  declared_license:
    name: Apache License, Version 2.0
    url: http://www.apache.org/licenses/LICENSE-2.0
  license_expression: Apache-2.0
  notes: See https://repo1.maven.org/maven2/org/springframework/hateoas/spring-hateoas/0.15.0.RELEASE/spring-hateoas-0.15.0.RELEASE.pom  

- package_type: pypi
  declared_license:
    license: http://creativecommons.org/publicdomain/zero/1.0/
    classifiers: 'License :: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication'
  license_expression: CC0-1.0
  notes: See https://github.com/dchest/pyblake2/blob/master/setup.py for example

- package_type: pypi
  declared_license:
    classifiers: 'License :: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication'
  license_expression: CC0-1.0
  notes: See https://github.com/dchest/pyblake2/blob/master/setup.py for example

- package_type: npm
  declared_license:
    license: bsd
  license_expression: BSD-3-Clause
  confidence: 80

nishakm commented 5 years ago

@pombredanne where does the confidence number come from? Is it something vetted by the legal community somewhere? It would be nice to include those numbers for all of the package_type and license combinations.

pombredanne commented 5 years ago

Another example for http://central.maven.org/maven2/com/sun/xsom/xsom/20100725/xsom-20100725.pom (from https://github.com/spdx/tools-python/issues/106#issuecomment-499911725 )

- package_type: maven
  declared_license: 
    license:
      name: CDDL v1.0 / GPL v2 dual license
      url: https://glassfish.dev.java.net/nonav/public/CDDL+GPL.html
  license_expression: CDDL-1.0 OR GPL-2.0-only

- package_type: maven
  declared_license: 
    license:
      name: CDDL v1.1 / GPL v2 dual license
      url: https://glassfish.java.net/public/CDDL+GPL_1_1.html
  license_expression: CDDL-1.1 OR GPL-2.0-only

pombredanne commented 5 years ago

@nishakm re: https://github.com/spdx/package-licenses-mapping/issues/1#issuecomment-535037645

where does the confidence number come from? Is it something vetted by the legal community somewhere?

That's just a rough evaluation provided by someone contributing a data point. I am fine to have legal folks validating this if they like, but I would not want this to be a gating item.

It would be nice to include those numbers for all of the package_type and license combinations.

As suggested here, it would default to 100 if not provided, so it would be always "there" yet would not need to be repeated if this has the default value.

pombredanne commented 5 years ago

BTW this brings up the issue of things that do not exist in SPDX such as Public domain, proprietary licenses, etc... all things that do exist in the wild in package manifest declarations.

pombredanne commented 5 years ago

Also of note:

I would consider that the case, spacing and punctuation of the declared licenses data to have no significance (with the possible exception of the + sign)
we may need to consider the Package URL type and namespace in some cases. For instance, for RPM distros the conventions used in Fedora and Suse are different and may not overlap gracefully... TBD, we can wait to add this when the problem comes up.

sschuberth commented 5 years ago

some indication of confidence (say between 0 and 100) for the accuracy of this mapping

I'm actually against a confidence and would only add uncontroversial mappings. It's generally too opaque how such a confidence level is calculated, and how to determine a suitable threshold for your particular use-case.

goneall commented 5 years ago

Any issue with using JSON rather than YAML?

If we are intending this to be primarily machine read, JSON is reported to have a higher adoption rate as a serialization format. (reference https://twobithistory.org/2017/09/21/the-rise-and-rise-of-json.html) while YAML has an advantage in being more human readable.

If this was intended to be primarily human read and/or written (e.g. like a configuration file), I would agree with YAML. I think this will be primarily read by tools so JSON may be a better choice.

BTW - I don't feel strongly - I can use either format in the Java tooling, just throwing this out there before we lock down the format.

goneall commented 5 years ago

BTW this brings up the issue of things that do not exist in SPDX such as Public domain, proprietary licenses, etc... all things that do exist in the wild in package manifest declarations.

We could add "local licenses" for each section into the document using the same terms as section 6.1 in the spec using LicenseRef-[ID] and LicenseText.

If there is interest in this approach, I'll see if I can come up with an example to add.

pombredanne commented 5 years ago

@goneall re:

If this was intended to be primarily human read and/or written (e.g. like a configuration file), I would agree with YAML. I think this will be primarily read by tools so JSON may be a better choice.

I do not care too much about one or the other. That's just a data definition so we can use either one

pombredanne commented 5 years ago

@goneall re:

We could add "local licenses" for each section into the document using the same terms as section 6.1 in the spec using LicenseRef-[ID] and LicenseText.

If there is interest in this approach, I'll see if I can come up with an example to add.

That would help :+1:

goneall commented 5 years ago

From the SPDX call on 8 Oct, YAML is the preferred format due to it being more human readable and writable.

goneall commented 5 years ago

We could add "local licenses" for each section into the document using the same terms as section 6.1 in the spec using LicenseRef-[ID] and LicenseText.

Here's an example. Say we have a Maven POM file with the following license element:

<licenses>
  <license>
    <name>Android Software Development Kit License</name>
    <url>https://developer.android.com/studio/terms.html</url>
    <distribution>repo</distribution>
    <comments>This is non open source license</comments>
  </license>
</licenses>

The resultant mapping would be:

- package_type: maven
  declared_license:
    name: Android Software Development Kit License
    url: https://developer.android.com/studio/terms.html
  license_expression: LicenseRef-AndroidSDK
  notes: See https://maven.google.com/com/google/android/gms/play-services/12.0.0/play-services-12.0.0.pom
  local_licenses:
    - LicenseRef-AndroidSDK:
          ExtractedText: >
             This is the Android Software Development Kit License Agreement ...
          LicenseName: Android Software Development Kit License
          LicenseCrossReference: https://developer.android.com/studio/terms.html
          LicenseComment: This is non open source license

sschuberth commented 4 years ago

Esp. since this issue is about a minimal data structure to start with, I'd like to propose to drop most fields and esp. not make the mappings package-manager-specific *).

For reference, I like the simplicity of the mapping in @stevespringett's CycloneDX library, which is very similar to ORT's hard-coded mapping. So for me, a simple structure like

- spdx_id: Apache-2.0
  alias_names:
  - Apache 2
  - Apache 2.0
  - Apache 2.0 License
  - Apache Software License, Version 2.0
  - The Apache Software License, Version 2.0
  - Apache License (v2.0)
  - Apache License 2.0
  - Apache License Version 2.0
  - Apache License, Version 2.0
  - Apache Public License 2.0
  - Apache Software License - Version 2.0
  - The Apache License, Version 2.0

would be sufficient to start with.

) While I acknowledge that there are package manager specific syntaxes like the license classifiers for Python, my thinking is that we should rather require users of the mappings to strip package manager specific stuff (like License :: OSI Approved ::) before* applying the mapping, and keep the the mapping itself generic.

nishakm commented 4 years ago

From conversation on https://github.com/nexB/scancode-toolkit/issues/1895 @pombredanne asked how to manage 1. General patterns of licenses related to certain ecosystems/package managers 2. Random one-offs seen in the wild.

My gut reaction is to store regexes (eg: r'\bApache\b.*(2.0|2)' to match all the above licenses including the Python ones). This will not work for made up licenses that actually mean a certain license in which case we store that as a full strings.

Perhaps there are reasons why regex is a bad idea. I'd like to hear them :)

stevespringett commented 4 years ago

Regex has dialects, some of the most common are Perl and Java. XML Schema also supports regex but it's a subset of what other dialects support. Defining regex that meet the capabilities of the least common denominator may be difficult and would involve a lot of research and testing.

Not to discourage. Point is only to address the reality that regular expression syntax varies.

Another thing to consider is ReDos. Regular expressions are powerful, but they can be easily misused (maliciously or not) resulting in a denial of service (or at a minimum, performance issues) when processing certain types of expressions. Something to consider. All regex would need to be evaluated to ensure they are free of patterns leading to a ReDos scenario.

In addition to the above concerns, I did not pursue regular expressions in the CycloneDX mapping because the text field being processed may contain multiple licenses. For example:

Apache 2 and BSD

I can get a positive match of Apache 2 which resolves to an SPDX license ID. But I wouldn't want to stop there. I would also want to include BSD, but it's unresolved. I don't know which specific BSD license this text is referring to. The CycloneDX mapping approach is to treat the entire string as an unresolved license. Ideally, the result should include a resolved Apache-2.0 license and an unresolved BSD license.

pombredanne commented 4 years ago

@nishakm I second @stevespringett ... I would not want to use any regex for such mappings. My main argument is simplicity and maintainability and the fact that using regex would mean that any implementation has to be based on regex. I do not do regex unless I need to. Let's use plain string instead. And you are welcome to derive regexes from a set of string if you feel like it.

That said, there are eventually two levels of mappings:

symbols, e.g. where a string (that may include blanks) maps to a single license key
expressions e.g. where a string maps to a license expression

@stevespringett you wrote:

Ideally, the result should include a resolved Apache-2.0 license and an unresolved BSD license.

This is what scancode does btw for symbols it does not know about ... (because it is using the https://github.com/nexB/license-expression/ library) which may well run in Java FWIW through Jython. Worth a try.

And on the topic of a bare "BSD" word used as a license "id" see also the discussion here https://github.com/nexB/scancode-toolkit/issues/1901 The point is that BSD -> BSD-3-Clause is a safe approach.

nishakm commented 4 years ago

@nishakm I second @stevespringett ... I would not want to use any regex for such mappings. My main argument is simplicity and maintainability and the fact that using regex would mean that any implementation has to be based on regex. I do not do regex unless I need to. Let's use plain string instead. And you are welcome to derive regexes from a set of string if you feel like it.

Agreed.

That said, there are eventually two levels of mappings:

symbols, e.g. where a string (that may include blanks) maps to a single license key

expressions e.g. where a string maps to a license expression

@pombredanne Can you give an example to better understand where a symbol or expression would be in a given license from a package manager?

pombredanne commented 4 years ago

@nishakm re:

Can you give an example to better understand where a symbol or expression would be in a given license from a package manager?

An expression: Here MIT/Apache-2.0 would map to MIT OR Apache-2.0 https://github.com/rust-lang/packed_simd/blob/93af7efd9b4011b3dfb626f9cba7915d0dc98179/Cargo.toml#L11
An other expression GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] mapping to either GPL-2.0-or-later or GPL-2.0 OR GPL-3.0 in https://cran.r-project.org/web/packages/ade4TkGUI/index.html ... and as you can see R is doing some nasty things with licenses
A symbol: https://metadata.ftp-master.debian.org/changelogs//main/a/alsa-tools/alsa-tools_1.1.3-1_copyright where GPL-2+ -> GPL-2.0-or-later

And there is also a third more problematic one where you have a data structure that is package-type specific such as these (which are eventually handled OK in scancode):

get and parse a package manifest
collect the structured data there
apply a package-type specific license detection that is aware of the data structure of that manifest and of the possible conventions used to structure some expression-like data and only then and there may be use some mappings to resolve some ambiguous licensing declaration and/or support some custom parsing of expressions

In the end I am starting to wonder if mappings really would be of any real value outside of a license detection tool (such as ScanCode).

sschuberth commented 4 years ago

That said, there are eventually two levels of mappings:

symbols, e.g. where a string (that may include blanks) maps to a single license key

expressions e.g. where a string maps to a license expression

FYI, that's almost exactly what we're already doing in ORT with our SpdxLicenseAliasMapping (which maps to license IDs) and SpdxDeclaredLicenseMapping (which maps to license expressions).

nishakm commented 4 years ago

That said, there are eventually two levels of mappings:

symbols, e.g. where a string (that may include blanks) maps to a single license key

expressions e.g. where a string maps to a license expression

FYI, that's almost exactly what we're already doing in ORT with our SpdxLicenseAliasMapping (which maps to license IDs) and SpdxDeclaredLicenseMapping (which maps to license expressions).

Indeed! The reason why this repo exists is because I wondered if these two mappings could be converted into yaml files and an accompanying python module for use by anyone looking for a simple license mapping utility.

As for the parsing of the package metadata provided by various package managers, that may or may not be in scope for this project. From this discussion, it turns out most likely not. Personally, I wish it were ;)

sschuberth commented 4 years ago

The reason why this repo exists is because I wondered if these two mappings could be converted into yaml files and an accompanying python module for use by anyone looking for a simple license mapping utility.

I absolutely support that idea, but I'd prefer to really only have the data (i.e. YAML files) in this repo, and put any code using the data (like a Python module or Java library) in different repos, similar to how SPDX license data is separated from SPDX tools.

As for the parsing of the package metadata provided by various package managers, that may or may not be in scope for this project. From this discussion, it turns out most likely not. Personally, I wish it were ;)

I'm not necessarily arguing it should be out of scope of the data stored in this repo. But also here I'd prefer a clean separation of any package-manager-specific mappings from generic mappings, and not by adding meta-data to the mappings themselves about whether they are package manager specific or not, but by having separate mappings in separate YAML files.

I.e. instead of having a mappings.yml with something like

- package_type: python
  declared_license: "License :: OSI Approved :: Apache Software License"
  license_expresion: Apache-2.0

I'd prefer to have a python-mappings.yml with something like

- declared_license: "License :: OSI Approved :: Apache Software License"
  license_expresion: Apache-2.0

One advantage of this is that the package manager types are not part of the data, i.e. they do not need to be specified and maintained, and it's easier for users to pick only the mappings they want / need.

pombredanne commented 4 years ago

@nishakm you wrote:

As for the parsing of the package metadata provided by various package managers, that may or may not be in scope for this project. From this discussion, it turns out most likely not. Personally, I wish it were ;)

That's already in scancode. I am not sure we want to duplicate scancode here :dancer:

pombredanne commented 4 years ago

@sschuberth re:

I absolutely support that idea, but I'd prefer to really only have the data (i.e. YAML files) in this repo, and put any code using the data (like a Python module or Java library) in different repos, similar to how SPDX license data is separated from SPDX tools.

I second that.

pombredanne commented 4 years ago

@sschuberth re

I.e. instead of having a mappings.yml with something like [...] One advantage of this is that the package manager type are not part of the data, i.e. they do not need to be specified and maintained, and it's easier for user to pick only the mappings they want / need.

What about keeping things simple with a simple list of objects:

-  declared_license: "License :: OSI Approved :: Apache Software License"
   license_expression: Apache-2.0
   package_type: pypi
-  declared_license: The allmitty license
   license_expression: MIT

where the package_type is present or not. And you are welcomed to ignore it. As for the package type, we can simply reuse the ones specified for Package URLs.

Some notes:

the number of cases where a package_type will be required is gong to be super small if any, but I am more comfy to keep this as a optional field
In the cases where the license is stored as multiple fields e.g. structured data, this cannot work, but that's likely OK, these mappings do not have to be perfect. That will be taken care of in code by tools alright (e.g. scancode already handles it).

sschuberth commented 4 years ago

where the package_type is present or not.

I'm not a fan of omitting optional data. Mostly, because I like to be able to understand the full data structure by looking at any example for such data.

While I'd still prefer to separate package-manager-specific mappings out, another compromise could be to introduce a list of applicable_package_types: If that list is empty, it is a generic mapping, otherwise it lists the types of packages it is specific to. That also has the minor advantage that mappings could be shared across different package manager types, e.g. if two package managers happen to use the same alias for a license.

sschuberth commented 4 years ago

In the cases where the license is stored as multiple fields e.g. structured data, this cannot work

True. And as we need pre-processing code in such cases anyway, that brings me back to saying any package-manager-specific licenses alias should require pre-processing to that the generic mappings could be used, instead of having package-manager-specific mappings. Then at least all package-manager-specific stuff would be handled in the same way.

nishakm commented 4 years ago

@sschuberth @pombredanne so have we decided that this project is just a port of ORT's SpdxLicenseAliasMapping (which maps to license IDs) and SpdxDeclaredLicenseMapping (which maps to license expressions) with tests around formatting and downstream tools can deal with the different package managers' schema?

pombredanne commented 4 years ago

@sschuberth re:

And as we need pre-processing code in such cases anyway

I do not see that as pre-processing but rather more complex mappings where a data structure as a whole maps to a license expression. I do not see how some pre-processing could simplify that case.

@nishakm I do not think we have an agreement yet.

My best future-proof take would be that we have a single list of mappings where:

the key is either a string or some data structure (e.g. list or object, etc)
the mapped value is a license expression string
there is an optional package_type field if a mapping is specific to a package

A next best would not be future proof and be this way:

the key is a string
the mapped value is a license expression string
there is an optional package_type field if a mapping is specific to a package

A degraded option would this way:

the key is a string
the mapped value is a license expression string

I all cases having multiple lists does not make sense to me especially if we have each files named after a package type, this means putting a package type schema field in a file name. Having meaning and a data field in a file name is a sure source of problems IMHO And the first proposed solution also supports the two other cases.

sschuberth commented 4 years ago

And as we need pre-processing code in such cases anyway

I do not see that as pre-processing but rather more complex mappings where a data structure as a whole maps to a license expression. I do not see how some pre-processing could simplify that case.

Let me give you an example: The classifiers field in the meta-data for a Python package provided by PyPI may contain a string like

"License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)"

Instead of having a package-manager specific mapping from

"License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)"

to

LGPL-2.0-or-later

we should have package-manager specific pre-processing that strips the "License :: OSI Approved :: " part and only have a generic mapping from

"GNU Library or Lesser General Public License (LGPL)"

to

LGPL-2.0-or-later

This saves to also maintain hard-coded package-manager specific mapping for all kinds of variants, e.g. when the package maintainer forgets to add the "OSI Approved" part. With mappings, we would also need to have a mapping from

"License :: GNU Library or Lesser General Public License (LGPL)"

to

LGPL-2.0-or-later

in that case, whereas with pre-processing, the code can be so generic to cover that case.

So that's a typical example where some simple package-manager-specific pre-processing can greatly reduced the amount of required mappings.

@nishakm I do not think we have an agreement yet.

I agree that we don't have an agreement yet 😁

@pombredanne, what's the use-case for having "the key is [...] some data structure (e.g. list or object, etc)"? I'm aware that package managers like Maven support declaring a list of licenses, but is it that what you mean? If so, that's the advantage of using the whole list as the key, instead of mapping all licenses individually and then combining them to a license expression?

pombredanne commented 3 years ago

FYI, in the end we end up dropping most mappings we were using in ScanCode See: https://github.com/nexB/scancode-toolkit/issues/1895#issuecomment-902486183

spdx / package-licenses-mapping

Design minimal data structure #1