Open pombredanne opened 5 years ago
@sschuberth @nishakm Here is a suggested structure.
Given these:
Then we map to:
And here would be an example using a YAML serialization:
- package_type: npm
declared_license:
license: MIT
license_expression: MIT
- package_type: maven
declared_license:
name: Apache License, Version 2.0
url: http://www.apache.org/licenses/LICENSE-2.0
license_expression: Apache-2.0
notes: See https://repo1.maven.org/maven2/org/springframework/hateoas/spring-hateoas/0.15.0.RELEASE/spring-hateoas-0.15.0.RELEASE.pom
- package_type: pypi
declared_license:
license: http://creativecommons.org/publicdomain/zero/1.0/
classifiers: 'License :: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication'
license_expression: CC0-1.0
notes: See https://github.com/dchest/pyblake2/blob/master/setup.py for example
- package_type: pypi
declared_license:
classifiers: 'License :: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication'
license_expression: CC0-1.0
notes: See https://github.com/dchest/pyblake2/blob/master/setup.py for example
- package_type: npm
declared_license:
license: bsd
license_expression: BSD-3-Clause
confidence: 80
@pombredanne where does the confidence number come from? Is it something vetted by the legal community somewhere? It would be nice to include those numbers for all of the package_type and license combinations.
Another example for http://central.maven.org/maven2/com/sun/xsom/xsom/20100725/xsom-20100725.pom (from https://github.com/spdx/tools-python/issues/106#issuecomment-499911725 )
- package_type: maven
declared_license:
license:
name: CDDL v1.0 / GPL v2 dual license
url: https://glassfish.dev.java.net/nonav/public/CDDL+GPL.html
license_expression: CDDL-1.0 OR GPL-2.0-only
- package_type: maven
declared_license:
license:
name: CDDL v1.1 / GPL v2 dual license
url: https://glassfish.java.net/public/CDDL+GPL_1_1.html
license_expression: CDDL-1.1 OR GPL-2.0-only
@nishakm re: https://github.com/spdx/package-licenses-mapping/issues/1#issuecomment-535037645
where does the confidence number come from? Is it something vetted by the legal community somewhere?
That's just a rough evaluation provided by someone contributing a data point. I am fine to have legal folks validating this if they like, but I would not want this to be a gating item.
It would be nice to include those numbers for all of the package_type and license combinations.
As suggested here, it would default to 100 if not provided, so it would be always "there" yet would not need to be repeated if this has the default value.
BTW this brings up the issue of things that do not exist in SPDX such as Public domain, proprietary licenses, etc... all things that do exist in the wild in package manifest declarations.
Also of note:
+
sign)some indication of confidence (say between 0 and 100) for the accuracy of this mapping
I'm actually against a confidence and would only add uncontroversial mappings. It's generally too opaque how such a confidence level is calculated, and how to determine a suitable threshold for your particular use-case.
Any issue with using JSON rather than YAML?
If we are intending this to be primarily machine read, JSON is reported to have a higher adoption rate as a serialization format. (reference https://twobithistory.org/2017/09/21/the-rise-and-rise-of-json.html) while YAML has an advantage in being more human readable.
If this was intended to be primarily human read and/or written (e.g. like a configuration file), I would agree with YAML. I think this will be primarily read by tools so JSON may be a better choice.
BTW - I don't feel strongly - I can use either format in the Java tooling, just throwing this out there before we lock down the format.
BTW this brings up the issue of things that do not exist in SPDX such as Public domain, proprietary licenses, etc... all things that do exist in the wild in package manifest declarations.
We could add "local licenses" for each section into the document using the same terms as section 6.1 in the spec using LicenseRef-[ID] and LicenseText.
If there is interest in this approach, I'll see if I can come up with an example to add.
@goneall re:
If this was intended to be primarily human read and/or written (e.g. like a configuration file), I would agree with YAML. I think this will be primarily read by tools so JSON may be a better choice.
I do not care too much about one or the other. That's just a data definition so we can use either one
@goneall re:
We could add "local licenses" for each section into the document using the same terms as section 6.1 in the spec using LicenseRef-[ID] and LicenseText.
If there is interest in this approach, I'll see if I can come up with an example to add.
That would help :+1:
From the SPDX call on 8 Oct, YAML is the preferred format due to it being more human readable and writable.
We could add "local licenses" for each section into the document using the same terms as section 6.1 in the spec using LicenseRef-[ID] and LicenseText.
Here's an example. Say we have a Maven POM file with the following license element:
<licenses>
<license>
<name>Android Software Development Kit License</name>
<url>https://developer.android.com/studio/terms.html</url>
<distribution>repo</distribution>
<comments>This is non open source license</comments>
</license>
</licenses>
The resultant mapping would be:
- package_type: maven
declared_license:
name: Android Software Development Kit License
url: https://developer.android.com/studio/terms.html
license_expression: LicenseRef-AndroidSDK
notes: See https://maven.google.com/com/google/android/gms/play-services/12.0.0/play-services-12.0.0.pom
local_licenses:
- LicenseRef-AndroidSDK:
ExtractedText: >
This is the Android Software Development Kit License Agreement ...
LicenseName: Android Software Development Kit License
LicenseCrossReference: https://developer.android.com/studio/terms.html
LicenseComment: This is non open source license
Esp. since this issue is about a minimal data structure to start with, I'd like to propose to drop most fields and esp. not make the mappings package-manager-specific *).
For reference, I like the simplicity of the mapping in @stevespringett's CycloneDX library, which is very similar to ORT's hard-coded mapping. So for me, a simple structure like
- spdx_id: Apache-2.0
alias_names:
- Apache 2
- Apache 2.0
- Apache 2.0 License
- Apache Software License, Version 2.0
- The Apache Software License, Version 2.0
- Apache License (v2.0)
- Apache License 2.0
- Apache License Version 2.0
- Apache License, Version 2.0
- Apache Public License 2.0
- Apache Software License - Version 2.0
- The Apache License, Version 2.0
would be sufficient to start with.
) While I acknowledge that there are package manager specific syntaxes like the license classifiers for Python, my thinking is that we should rather require users of the mappings to strip package manager specific stuff (like License :: OSI Approved ::
) before* applying the mapping, and keep the the mapping itself generic.
From conversation on https://github.com/nexB/scancode-toolkit/issues/1895 @pombredanne asked how to manage 1. General patterns of licenses related to certain ecosystems/package managers 2. Random one-offs seen in the wild.
My gut reaction is to store regexes (eg: r'\bApache\b.*(2.0|2)' to match all the above licenses including the Python ones). This will not work for made up licenses that actually mean a certain license in which case we store that as a full strings.
Perhaps there are reasons why regex is a bad idea. I'd like to hear them :)
Regex has dialects, some of the most common are Perl and Java. XML Schema also supports regex but it's a subset of what other dialects support. Defining regex that meet the capabilities of the least common denominator may be difficult and would involve a lot of research and testing.
Not to discourage. Point is only to address the reality that regular expression syntax varies.
Another thing to consider is ReDos. Regular expressions are powerful, but they can be easily misused (maliciously or not) resulting in a denial of service (or at a minimum, performance issues) when processing certain types of expressions. Something to consider. All regex would need to be evaluated to ensure they are free of patterns leading to a ReDos scenario.
In addition to the above concerns, I did not pursue regular expressions in the CycloneDX mapping because the text field being processed may contain multiple licenses. For example:
Apache 2 and BSD
I can get a positive match of Apache 2 which resolves to an SPDX license ID. But I wouldn't want to stop there. I would also want to include BSD, but it's unresolved. I don't know which specific BSD license this text is referring to. The CycloneDX mapping approach is to treat the entire string as an unresolved license. Ideally, the result should include a resolved Apache-2.0 license and an unresolved BSD license.
@nishakm I second @stevespringett ... I would not want to use any regex for such mappings. My main argument is simplicity and maintainability and the fact that using regex would mean that any implementation has to be based on regex. I do not do regex unless I need to. Let's use plain string instead. And you are welcome to derive regexes from a set of string if you feel like it.
That said, there are eventually two levels of mappings:
@stevespringett you wrote:
Ideally, the result should include a resolved Apache-2.0 license and an unresolved BSD license.
This is what scancode does btw for symbols it does not know about ... (because it is using the https://github.com/nexB/license-expression/ library) which may well run in Java FWIW through Jython. Worth a try.
And on the topic of a bare "BSD" word used as a license "id" see also the discussion here https://github.com/nexB/scancode-toolkit/issues/1901 The point is that BSD -> BSD-3-Clause is a safe approach.
@nishakm I second @stevespringett ... I would not want to use any regex for such mappings. My main argument is simplicity and maintainability and the fact that using regex would mean that any implementation has to be based on regex. I do not do regex unless I need to. Let's use plain string instead. And you are welcome to derive regexes from a set of string if you feel like it.
Agreed.
That said, there are eventually two levels of mappings:
- symbols, e.g. where a string (that may include blanks) maps to a single license key
- expressions e.g. where a string maps to a license expression
@pombredanne Can you give an example to better understand where a symbol or expression would be in a given license from a package manager?
@nishakm re:
Can you give an example to better understand where a symbol or expression would be in a given license from a package manager?
An expression: Here MIT/Apache-2.0
would map to MIT OR Apache-2.0
https://github.com/rust-lang/packed_simd/blob/93af7efd9b4011b3dfb626f9cba7915d0dc98179/Cargo.toml#L11
An other expression GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]
mapping to either GPL-2.0-or-later
or GPL-2.0 OR GPL-3.0
in https://cran.r-project.org/web/packages/ade4TkGUI/index.html ... and as you can see R is doing some nasty things with licenses
A symbol: https://metadata.ftp-master.debian.org/changelogs//main/a/alsa-tools/alsa-tools_1.1.3-1_copyright where GPL-2+
-> GPL-2.0-or-later
And there is also a third more problematic one where you have a data structure that is package-type specific such as these (which are eventually handled OK in scancode):
You eventually need to:
In the end I am starting to wonder if mappings really would be of any real value outside of a license detection tool (such as ScanCode).
That said, there are eventually two levels of mappings:
- symbols, e.g. where a string (that may include blanks) maps to a single license key
- expressions e.g. where a string maps to a license expression
FYI, that's almost exactly what we're already doing in ORT with our SpdxLicenseAliasMapping (which maps to license IDs) and SpdxDeclaredLicenseMapping (which maps to license expressions).
That said, there are eventually two levels of mappings:
- symbols, e.g. where a string (that may include blanks) maps to a single license key
- expressions e.g. where a string maps to a license expression
FYI, that's almost exactly what we're already doing in ORT with our SpdxLicenseAliasMapping (which maps to license IDs) and SpdxDeclaredLicenseMapping (which maps to license expressions).
Indeed! The reason why this repo exists is because I wondered if these two mappings could be converted into yaml files and an accompanying python module for use by anyone looking for a simple license mapping utility.
As for the parsing of the package metadata provided by various package managers, that may or may not be in scope for this project. From this discussion, it turns out most likely not. Personally, I wish it were ;)
The reason why this repo exists is because I wondered if these two mappings could be converted into yaml files and an accompanying python module for use by anyone looking for a simple license mapping utility.
I absolutely support that idea, but I'd prefer to really only have the data (i.e. YAML files) in this repo, and put any code using the data (like a Python module or Java library) in different repos, similar to how SPDX license data is separated from SPDX tools.
As for the parsing of the package metadata provided by various package managers, that may or may not be in scope for this project. From this discussion, it turns out most likely not. Personally, I wish it were ;)
I'm not necessarily arguing it should be out of scope of the data stored in this repo. But also here I'd prefer a clean separation of any package-manager-specific mappings from generic mappings, and not by adding meta-data to the mappings themselves about whether they are package manager specific or not, but by having separate mappings in separate YAML files.
I.e. instead of having a mappings.yml
with something like
- package_type: python
declared_license: "License :: OSI Approved :: Apache Software License"
license_expresion: Apache-2.0
I'd prefer to have a python-mappings.yml
with something like
- declared_license: "License :: OSI Approved :: Apache Software License"
license_expresion: Apache-2.0
One advantage of this is that the package manager types are not part of the data, i.e. they do not need to be specified and maintained, and it's easier for users to pick only the mappings they want / need.
@nishakm you wrote:
As for the parsing of the package metadata provided by various package managers, that may or may not be in scope for this project. From this discussion, it turns out most likely not. Personally, I wish it were ;)
That's already in scancode. I am not sure we want to duplicate scancode here :dancer:
@sschuberth re:
I absolutely support that idea, but I'd prefer to really only have the data (i.e. YAML files) in this repo, and put any code using the data (like a Python module or Java library) in different repos, similar to how SPDX license data is separated from SPDX tools.
I second that.
@sschuberth re
I.e. instead of having a mappings.yml with something like [...] One advantage of this is that the package manager type are not part of the data, i.e. they do not need to be specified and maintained, and it's easier for user to pick only the mappings they want / need.
What about keeping things simple with a simple list of objects:
- declared_license: "License :: OSI Approved :: Apache Software License"
license_expression: Apache-2.0
package_type: pypi
- declared_license: The allmitty license
license_expression: MIT
where the package_type
is present or not. And you are welcomed to ignore it. As for the package type, we can simply reuse the ones specified for Package URLs.
Some notes:
package_type
will be required is gong to be super small if any, but I am more comfy to keep this as a optional field where the
package_type
is present or not.
I'm not a fan of omitting optional data. Mostly, because I like to be able to understand the full data structure by looking at any example for such data.
While I'd still prefer to separate package-manager-specific mappings out, another compromise could be to introduce a list of applicable_package_types
: If that list is empty, it is a generic mapping, otherwise it lists the types of packages it is specific to. That also has the minor advantage that mappings could be shared across different package manager types, e.g. if two package managers happen to use the same alias for a license.
In the cases where the license is stored as multiple fields e.g. structured data, this cannot work
True. And as we need pre-processing code in such cases anyway, that brings me back to saying any package-manager-specific licenses alias should require pre-processing to that the generic mappings could be used, instead of having package-manager-specific mappings. Then at least all package-manager-specific stuff would be handled in the same way.
@sschuberth @pombredanne so have we decided that this project is just a port of ORT's SpdxLicenseAliasMapping (which maps to license IDs) and SpdxDeclaredLicenseMapping (which maps to license expressions) with tests around formatting and downstream tools can deal with the different package managers' schema?
@sschuberth re:
And as we need pre-processing code in such cases anyway
I do not see that as pre-processing but rather more complex mappings where a data structure as a whole maps to a license expression. I do not see how some pre-processing could simplify that case.
@nishakm I do not think we have an agreement yet.
My best future-proof take would be that we have a single list of mappings where:
key
is either a string or some data structure (e.g. list or object, etc)package_type
field if a mapping is specific to a packageA next best would not be future proof and be this way:
key
is a stringpackage_type
field if a mapping is specific to a packageA degraded option would this way:
key
is a stringI all cases having multiple lists does not make sense to me especially if we have each files named after a package type, this means putting a package type schema field in a file name. Having meaning and a data field in a file name is a sure source of problems IMHO And the first proposed solution also supports the two other cases.
And as we need pre-processing code in such cases anyway
I do not see that as pre-processing but rather more complex mappings where a data structure as a whole maps to a license expression. I do not see how some pre-processing could simplify that case.
Let me give you an example: The classifiers
field in the meta-data for a Python package provided by PyPI may contain a string like
"License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)"
Instead of having a package-manager specific mapping from
"License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)"
to
LGPL-2.0-or-later
we should have package-manager specific pre-processing that strips the "License :: OSI Approved :: " part and only have a generic mapping from
"GNU Library or Lesser General Public License (LGPL)"
to
LGPL-2.0-or-later
This saves to also maintain hard-coded package-manager specific mapping for all kinds of variants, e.g. when the package maintainer forgets to add the "OSI Approved" part. With mappings, we would also need to have a mapping from
"License :: GNU Library or Lesser General Public License (LGPL)"
to
LGPL-2.0-or-later
in that case, whereas with pre-processing, the code can be so generic to cover that case.
So that's a typical example where some simple package-manager-specific pre-processing can greatly reduced the amount of required mappings.
@nishakm I do not think we have an agreement yet.
I agree that we don't have an agreement yet 😁
@pombredanne, what's the use-case for having "the key is [...] some data structure (e.g. list or object, etc)"? I'm aware that package managers like Maven support declaring a list of licenses, but is it that what you mean? If so, that's the advantage of using the whole list as the key, instead of mapping all licenses individually and then combining them to a license expression?
FYI, in the end we end up dropping most mappings we were using in ScanCode See: https://github.com/nexB/scancode-toolkit/issues/1895#issuecomment-902486183
This is a continuation of https://github.com/spdx/tools-python/issues/106