pivotal / LicenseFinder

Find licenses for your project's dependencies.
MIT License
1.72k stars 338 forks source link

Processing packages with no License attribute #808

Open vr333dev opened 3 years ago

vr333dev commented 3 years ago

Hello, as we depend on License-Scanner in GitLab to perform License Compliance validation, we find that LicenseFinder is unable to deal with NuGet packages where License type/ID is not defined. Package Polly v7.2.1, for example, shows "BSD-3-Clause" for license name. Xunit.analyzers v0.10, on the other hand, does not have this information (many packages are created this way), so the scanner defaults to license URLs instead of missing names (Security & Compliance -> License Compliance -> Policies, URLs often correctly refer to Apache 2.0/MIT/BSD/etc. licenses online). This in turn, causes many licenses to appear as unknown, and to not be subject to Policies we create, since Policies depend on actual License types/names. Then merging to master skips all such licenses from being validated (potential for legal issues).

Checkmarx, as a reference, keeps license URL to Name relationships in the database, and so able to perform Open Source Analysis with minimal amount of unknown licenses. Please help by expanding functionality of LicenseFinder (we believe it will be beneficial in many cases to refer to a list of license names based on available URLs, if a name is not provided directly), or please let us know if there is a workaround.

Thank you

Vlad

cf-gitbot commented 3 years ago

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

vr333dev commented 3 years ago

Unable to view the story, correct.

Thanks for the update, I'll keep track of labels for this issue.

xtreme-shane-lattanzio commented 3 years ago

Hey @vr333dev ! Someone just made another issue here: https://github.com/pivotal/LicenseFinder/issues/810 Is this related? It would be great if we could have a fix for this. We arent that well versed in nuget so any help would be appreciated from the community!

vr333dev commented 3 years ago

Hi, it's not the same issue unfortunately, #810 is simply about returning a URL when license type/name is not available. This current issue is similar in that we also need to derive license URL. But then we propose to look up a hash-table where URL is the key and value is license name (for example), so that once we have the URL, we can return correlating license name. Because in GitLab it doesn't make sense to use License Policies otherwise, since policies are based on license names (to then let the scanner decide whether a policy has been validated or denied, before a merge request).

We are somewhat in the same situation, as far as reading NuGet attributes in code (especially with Ruby, no experience, and adapting to existing logic).

Thanks in advance

vr333dev commented 3 years ago

Would it help if I provided the full list of license names with correlating license URLs (any format, could be JSON etc.)?

xtreme-shane-lattanzio commented 3 years ago

I'm not sure if that will help. What is typically done is we would look into a package manager file and pull out the information and log them while running the prepare commands such as nuget restore. For this case in this file, we go through each dependency we find and look for a nupkg file. The next step seems to unzip it and pull licenses out of it with the def self.nuspec_license_urls(specfile_content) function on line 147

Does this help clarify what is being done in that file? I am not sure how we can change this process to be more efficient for nuget but we are definitely open to suggestions. If you are saying that if that process returns an unknown we can then fall back to a hardcoded list, that could work but I do not think we have ever done something like that. Because this would not be dynamic, it could lead to incorrect reporting in the future.

I hope I am understanding this correctly but If not, feel free to share small snippets of the files you are referring to where the license names are matching to urls vs when they dont :)

vr333dev commented 3 years ago

The logic does make sense. And I agree on licenses list becoming stale over time, or license URLs being different for different NuGets (even though it's the same license type/name). But I think those would be rare (and exceptions are going to happen), if we still try to use a hard-coded list. And currently there is no fallback when processing packages with no license names/types defined, so it seems like we could still benefit from using a static list.

A list in JSON format could look something like this:

{ "https://opensource.org/licenses/0BSD": "0-clause BSD License (0BSD)", "https://opensource.org/licenses/Apache-2.0": "Apache License 2.0", "https://opensource.org/licenses/ISC": "ISC License" }

So that if a license name is not available, but a URL is, we can access and return a license name based on a URL. Otherwise, when we run Gitlab pipeline (for example), License-Compliance policies depend on license names, to determine if a license is allowed or denied. But if we get URLs then many of the results are unknown, and software with forbidden licensing could be introduced (or keep existing) in our applications.

vr333dev commented 3 years ago

And I could send screenshots of what we see in GitLab projects -> Security & Compliance -> License Compliance -> Detected in Projects/Policies tabs (if an email is provided), and differences between NuGets with license names vs. without.

xtreme-shane-lattanzio commented 3 years ago

@vr333dev With the example you put, I am starting to think this can actually be pretty useful. AS you have written that, I do doubt that it will change much. I would leave this as an experimental feature for now for just nuget. Looking at that though, this is possible for any package manager having this issue. The URL will link to the name which will link to actual license content. If you can make a PR for this I would be happy to check it out! :D

One more clarification to make sure I understand the problem we are solving here, can you please give a sample json that is currently erroring? Not sure of the format but I image something like:

{ name: packageA version: 1 licenseName: ????????? licenseURL: https://opensource.org/licenses/0BSD }

vr333dev commented 3 years ago

There's no JSON currently, it would be used in/by LicenseFinder (if implemented). LicenseFinder reads packages and returns information which is there, where license names are not always available, but URLs are. If a JSON correlates URLs to license names, then license finder could return license names properly (in GitLab when License-Compliance scans run).

https://docs.gitlab.com/ee/user/compliance/license_compliance/ "The License Finder scan tool runs as part of the CI/CD pipeline..."

To triple check please, I should create an PR with the full list of license names and URLs (like in the JSON example I provided)? Because I don't see a way to provide a JSON with package names (your example), there would be many more repetitions of URLs and license names, + packages list is much more dynamic then license types.

xtreme-shane-lattanzio commented 3 years ago

@vr333dev Yes thats right for the JSON you implement. My example was more about what is being read in from the package managers currently. As in we are getting package name, version and URL but the license name is blank. The next step would then be to look at your newly created JSON to populate that license name value. Hopefully that makes sense!

vr333dev commented 3 years ago

Agreed, I'll work on compiling the JSON file.

Thanks

vr333dev commented 3 years ago

Hi, this is the proposed licenses.josn: https://github.com/pivotal/LicenseFinder/compare/master...vr333dev:patch-1 The last 12 items in the file would be especially useful, for which to return license names, since those are third party references to well known licenses.

xtreme-shane-lattanzio commented 3 years ago

Assuming those links are correct, that looks good to me! I look forward to your PR to use the new file. Thanks again!

vr333dev commented 3 years ago

Most links come from https://opensource.org/licenses/alphabetical and I've tested the last dozen which we see after License-Compliance scanning in GitLab (and to determine actual license types/names). Sounds good, I'll work on a PR.

vr333dev commented 3 years ago

PR created: https://github.com/pivotal/LicenseFinder/pull/818 Thanks