tern-tools / tern

Tern is a software composition analysis tool and Python library that generates a Software Bill of Materials for container images and Dockerfiles. The SBOM that Tern generates will give you a layer-by-layer view of what's inside your container in a variety of formats including human-readable, JSON, HTML, SPDX and more.
BSD 2-Clause "Simplified" License
967 stars 188 forks source link

The generated SPDX JSON does not match the json schema #1064

Closed maxhbr closed 3 years ago

maxhbr commented 3 years ago

Describe the bug The generated output, generated via "spdxjson" does not validate against the spdx-schema.json. I observed the following four issues.

The example was generated by scanning osadl/debian-docker-base-image:buster-amd64-211011 and the result is tern.spdx.json.gz.

The issues:

(1) The schema expects that creationInfo.creators to be an array

The schema has the following section

        "creators" : {
          "description" : "Identify who (or what, in the case of a tool) created the SPDX file. If the SPDX file was created by an individual, indicate the person's name. If the SPDX file was created on behalf of a company or organization, indicate the entity name. If the SPDX file was created using a software tool, indicate the name and version for that tool. If multiple participants or tools were involved, use multiple instances of this field. Person name or organization name may be designated as “anonymous” if appropriate.",
          "minItems" : 1,
          "type" : "array",
          "items" : {
            "description" : "Identify who (or what, in the case of a tool) created the SPDX file. If the SPDX file was created by an individual, indicate the person's name. If the SPDX file was created on behalf of a company or organization, indicate the entity name. If the SPDX file was created using a software tool, indicate the name and version for that tool. If multiple participants or tools were involved, use multiple instances of this field. Person name or organization name may be designated as “anonymous” if appropriate.",
            "type" : "string"
          }
        },

the json output contains a string:

$ cat tern.spdx.json  | jq .creationInfo.creators
"Tool: tern-2.8.0"

and the schema validator complains with:

Tool: tern-2.8.0: 'Tool: tern-2.8.0' is not of type 'array'

(2) The schema expects that packages[].filesAnalyzed should be a bool and not a string

The json contains "false" instead of false in the serialized output:

$ cat tern.spdx.json  | jq .packages[0].filesAnalyzed
"false"

The schema expects filesAnalyzed to be a bool.

(3) The packages do not have a key fileName, it should be packageFileName

The schema does not contain the fileName key in the packages section. The packageFileName is defined in the schema.

The JSON contains:

$ cat tern.spdx.json  | jq .packages[1].fileName
"50445ea47417946f2e6f276a78dcf8ed395df4703932a180b963c0e22d5f3478/layer.tar"

(4) not all packages contained the required key name

The package name is required and not allowed to be null.

$ cat tern.spdx.json | jq .packages[2].name
null

The full package is:

$ cat tern.spdx.json | jq .packages[2]
{
  "name": null,
  "SPDXID": "SPDXRef-None-None",
  "versionInfo": "NOASSERTION",
  "downloadLocation": "NONE",
  "filesAnalyzed": "false",
  "licenseConcluded": "NOASSERTION",
  "licenseDeclared": "NONE",
  "copyrightText": "NONE",
  "comment": ""
}

The environment

$ tern --version
Tern version 2.8.0
   python version = 3.8.11 (default, Oct 16 2021, 17:24:33)
$ uname -a
Linux x1extremeG2 5.13.0-rc6 #1-NixOS SMP Sun Jun 13 21:43:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ scancode --version
/opt/scancode/src/cluecode/copyrights.py:3382: FutureWarning: Possible set difference at position 3
  remove_tags = re.compile(
ScanCode version 30.1.0
ScanCode Output Format version 1.0.0
SPDX License list version 3.14
rnjudge commented 3 years ago

Thanks @maxhbr for catching this. Fix will be out shortly.

maxhbr commented 3 years ago

The SPDX output seems to not contain the full paths for files, just the file names. From conversation:

in the spdx for photon:3.0 I can see for example:

   {
     "fileName": "libc.so",
     "SPDXID": "SPDXRef-abd286f",
     "checksums": [
       {
         "algorithm": "SHA1",
         "checksumValue": "4229de92a0517d2b08ea21913771825d84c82977"
       }
     ],
     "licenseConcluded": "NOASSERTION",
     "copyrightText": "NOASSERTION",
     "fileTypes": [
       "TEXT"
     ],
     "licenseInfoInFiles": [
       "NONE"
     ]
   },

but the file is at $LAYER1/usr/lib/libc.so. Shouldn't the fileName contain the full path? As far as I can see the full location is nowhere preserved in the spdx json.

hesa commented 3 years ago

Just a side note packageFileName is not required (see https://github.com/spdx/spdx-spec/blob/239189bee6074d8228a1bd7cc24d669934585d92/schemas/spdx-schema.json#L419-419)