tern-tools / tern

Tern is a software composition analysis tool and Python library that generates a Software Bill of Materials for container images and Dockerfiles. The SBOM that Tern generates will give you a layer-by-layer view of what's inside your container in a variety of formats including human-readable, JSON, HTML, SPDX and more.
BSD 2-Clause "Simplified" License
960 stars 188 forks source link

Tern output is much better in text format than in SPDX format #1188

Closed vargenau closed 1 year ago

vargenau commented 1 year ago

Describe the bug Tern output is much better in text format than in SPDX format.

To Reproduce

tern report -i apache/airflow:2.3.0b1-python3.10 -o airflow-tern2.10.1.txt
tern report -f spdxtagvalue -i apache/airflow:2.3.0b1-python3.10 -o airflow-tern2.10.1.spdx

Text output:

    +------------------------+------------------------------+-----------------------------------------------+------------+
    | Package                | Version                      | License(s)                                    | Pkg Format |
    +------------------------+------------------------------+-----------------------------------------------+------------+
    | bsdutils               | 1:2.36.1-8+deb11u1           | BSD-2-clause, BSD-3-clause, BSD-4-clause,     | deb        |
    |                        |                              | GPL-2, GPL-2+, GPL-3+, LGPL, LGPL-2+,         |            |
    |                        |                              | LGPL-2.1+, LGPL-3+, MIT, public-domain        |            |

SPDX output:

PackageName: bsdutils
SPDXID: SPDXRef-bsdutils-1-2.36.1-8-deb11u1
PackageVersion: 1:2.36.1-8+deb11u1
PackageDownloadLocation: NOASSERTION
FilesAnalyzed: false
PackageLicenseConcluded: NOASSERTION
PackageLicenseDeclared: NONE

Expected behavior

In the text format, we have a long list of licenses that have been correctly identified by Tern. I would expect to have the same list in the SPDX output (after conversion to SPDX identifiers).

Environment you are running Tern on

tern --version
Tern version 2.10.1
   python version = 3.10.6 (main, Aug 10 2022, 11:40:04) 
xxLiuxx commented 1 year ago

Agreed. Also, the dependencies in text format are grouped by layers but in spdx.json they are not. Can we also have them grouped by layer in spdx.json?

rnjudge commented 1 year ago

Hi @vargenau -- thanks for opening this issue. I apologize for the delay, I have been on maternity leave.

The conversion from the text licenses that Tern finds to SPDX identifiers has long been an issue (see discussions here and here). There is no library that reliably does this that we have found but we do attempt to do this using the license_expression library (added when this PR was merged). I think you are what inspired this PR, actually.

This does seem like a bug, however, that the LicenseRefs are not listed for the bsdutils package licenses. I suspect it has to do with the fact that the package is a debian package which means that Tern cannot use the package manager to get clear license expressions (since debian does not provide them) and instead has to scan the copyright files within a package to attempt to get package licenses. Because of this we may have chosen not to represent these as LicenseRefs. Let me dig a little deeper and see what I find.

@xxLiuxx -- I will look in to your request as well.

rnjudge commented 1 year ago

@vargenau I looked in to this and this is a debian-based container specific problem. The licesnes are the same in text and SPDX documents for other package managers such as apk (Alpine) and rpm (Photon):


~$ tern report -i photon:3.0

| curl                | 7.86.0-1.ph3   | MIT                           | rpm        |

~$ tern report -i photon:3.0 -f spdxtagvalue 

PackageName: curl
SPDXID: SPDXRef-curl-7.86.0-1.ph3
PackageVersion: 7.86.0-1.ph3
PackageDownloadLocation: NOASSERTION
FilesAnalyzed: false
PackageLicenseConcluded: NOASSERTION
PackageLicenseDeclared: MIT

As I mentioned in my previous comment, Debian licenses are not declared by the package/package manager, which is unfortunate. Instead, Tern has to parse the copyright files and pulls out a list of license-looking text (we use debian-inspector for this) which is the long list of licenses you see. Because there is a list of licesnes from the debian copyright text it's difficult to put these into separate license identifiers using SPDX's PackageLicenseDeclared field. SPDX requires using AND or OR if you have more than one license, each with their own compliance implications that Tern cannot infer just from the list of extracted licenses (https://spdx.github.io/spdx-spec/v2.3/SPDX-license-expressions/#d4-composite-license-expressions) .

Do you have any ideas on how Tern could improve this?

rnjudge commented 1 year ago

I guess one option might be to create one single LicenseRef with all of the licenses found in the copyright texts?

i.e.

PackageLicenseDeclared: LicenseRef-123456
.
.
.
LicenseID: LicenseRef-123456
ExtractedText: <text>Original license: GPL-2, GPL-2+, GPL-3+, LGPL, LGPL-3+, MIT, public-domain</text>

I'm not sure if this is allowed in SPDX, though, I would need to email the text mailing list.

rnjudge commented 1 year ago

Hi @vargenau, looks like we can include multiple licenses with a single license ref (Reference here), so I will go ahead and make that change in Tern.

Thanks!

vargenau commented 1 year ago

Hi @rnjudge,

Thank you for taking my ticket into account. What you propose will improve the SPDX output in Tern.

However, I do not consider it the final solution.

Syft does it better for this package bsdutils (see airflow-syft0.62.3.spdx.txt):

PackageLicenseDeclared: BSD-2-Clause AND BSD-3-Clause AND BSD-4-Clause AND GPL-2.0-only AND GPL-2.0-or-later AND GPL-3.0-only AND GPL-3.0-or-later AND LGPL-2.0-only AND LGPL-2.0-or-later AND LGPL-2.1-only AND LGPL-2.1-or-later AND LGPL-3.0-only AND LGPL-3.0-or-later AND MIT

(but it misses public domain)

They do it by maintaining an equivalence table: https://github.com/anchore/syft/blob/main/internal/spdxlicense/license_list.go

For example, they map GPL-2 to GPL-2.0-only and GPL-2+ to GPL-2.0-or-later.

Would something like that be doable in Tern?

rnjudge commented 1 year ago

Hi @vargenau. I would argue that the Syft SBOM is not entirely correct and not exactly what Tern should strive to match. For starters, they report that the PackageLicenseConcluded is the same as PackageLicenseDeclared. This is not true in all cases particuarly for the handful of licenses found by scanning copyright text in Debian packages. Furthermore, we've discussed in the DocFest and with the SPDX community that PackageLicenseConcluded should typically not be filled in by tools as it is designed to indiciate that some type of analysis was performed on the PackageLicenseDeclared text resulting in a decision as to the true license of the package. Syft is not doing this and simply copying the text from PackageLicenseDeclared which is misleading.

Second, they are using the conjunctive AND operator to join together these licesnes. Normally, this is fine because this is the meaning of what you get from a Debian copyright file but the caveat is that MIT, public-domain are NOT license keys. These are merely references in the style of a local LicenseRef and their actual meaning is entirely determined by the license or notice text that comes after them so using - including a license ref and the corresponding copyright text (which Syft omits) can be helpful.

That being said, the equivalence table that Syft uses to map license identifiers to raw license text is an interesting concept. I suspect this is similar to what the license-expression library does, though (which Tern utilizes to resolve license text to identifiers), and if we can avoid manual upkeep of a license list that would be ideal.

Lastly, I am in correspondance with Philippe, the maintainer of Scancode, and he says that we can utilize scancode to detect and make sense of Debian copyright files whether they are structured machine-readable files or legacy non-structured. This may take some time, so in the meantime I will make the change discussed above as a temporary workaround. Does that work for you?

vargenau commented 1 year ago

Hi @rnjudge

I agree on the fact that the tools should only fill PackageLicenseDeclared. That is exactly why I gave only PackageLicenseDeclared in my comment.

I have seen the email from Philippe on the SPDX mailing lists. It will be very good if he helps you to implement a better solution based on ScanCode. In the meantime you can merge your changes for your proposed solution with LicenseRef- containing the multiple licenses.