nexB / dejacode

Automate open source license compliance and ensure software supply chain integrity
https://dejacode.readthedocs.io
GNU Affero General Public License v3.0
18 stars 7 forks source link

BUG: SBOM import does not trigger scan of packages #121

Open ghsa-retrieval opened 2 months ago

ghsa-retrieval commented 2 months ago

Describe the bug On a self-hosted instance of DejaCode, it appears that the current main branch of DejaCode does not scan individual packages after loading the SBOM. This feature seems to work on the public demo instance.

Tested with:

To Reproduce Configure dataspace:

  1. In "Application Process Settings" activate "Enable package scanning"
  2. In "Application Process Settings" activate "Update packages automatically from scan"

Steps to reproduce the behavior:

  1. Create a product
  2. Open the product
  3. Click on the "Scan" dropdown and select "Load Packages from SBOMs"
  4. Select an SBOM of your choice (e.g. sbom-1-4.cdx.json)
  5. Enable "Update existing packages with discovered packages data"
  6. Enable "Scan all packages of this product post-import"

Additional information which may or may not be relevant:

Expected behavior After loading the packages through the load_sbom pipeline in ScanCode.io, each individual package should be analyzed with a scan_single_package pipeline and the results added to the respective packages in DejaCode.

Screenshots No screenshots, as error is that actions are not happening

Context (OS, Browser, Device, etc.): Firefox

tdruez commented 2 months ago

@ghsa-retrieval Could you confirm that the ScanCode.io integration is properly configured on your DejaCode instance? Click on your username in the top right corner to display the dropdown menu and select "Integration Status" or directly use this URL /integrations_status/ From this view, we can make sure that ScanCode.io is "Configured" and "Available".


I renamed and edited the nexB dataspace for this (which also locks me out of creating new dataspace, not sure if that is expected?)

You need to update the REFERENCE_DATASPACE setting https://dejacode.readthedocs.io/en/latest/application-settings.html#reference-dataspace accordingly to the renaming to ensure your Dataspace and related users have those permissions.

ghsa-retrieval commented 2 months ago

@tdruez Yes, it shows both "Configured" and "Available" with a green checkmark. The load_sbom pipeline works (with limitations) and packages are being added to the project, but they are not scanned individually to get detailed license and copyright information. The scanning for those details also works if I add a single package with "Add Package" and an URL to the package's archive. So some parts of the integration are definitely working.

You need to update the REFERENCE_DATASPACE setting https://dejacode.readthedocs.io/en/latest/application-settings.html#reference-dataspace accordingly to the renaming to ensure your Dataspace and related users have those permissions.

Makes sense, that was just a bit unexpected when configuring it through the UI.

ghsa-retrieval commented 2 months ago

The same issue seems to happen when using "Scan" > "Scan All Packages". The UI reports that the job has been successfully submitted, but they never appear in the scan list nor does ScanCode.io list new projects. Hence, this might not be related to the SBOM import itself.

2024-05-16-dejacode-scan-all-packages

tdruez commented 2 months ago

@ghsa-retrieval Thanks for the details. My hunch is that the problem may be located in the async task that is responsible for submitting the scan requests. Could you look into the worker logs if you find anything looking like an error using: docker compose logs worker

ghsa-retrieval commented 2 months ago

@tdruez Unfortunately no errors are being reported. It looks like DejaCode thinks it has successfully submitted a job, but the ScanCode.io log does not indicate that it is receiving anything nor that it runs into errors.

Do you have any other ideas where I should look?

2024-05-17-dejacode-log-censored 2024-05-17-scancode-log-censored

tdruez commented 2 months ago

@ghsa-retrieval Thaks for the log, that's helpful. We can see that the task dje.tasks.scancodeio_submit_scan is properly called and executed but no URIs are provided:

INFO Entering scancodeio submit scan task with uris=[] ...

My guess is that none of your packages have a download_url defined. At the moment, a download URL is required to fetch and scan a package from DejaCode.

Some Download URL could be generated from Package URL using the purl2url library but only a few package types are supported.

As a side note, the UI should be improved to warn you about the lack of Dowload URL instead of displaying a success message.

ghsa-retrieval commented 2 months ago

It seems that you're right, the imported packages from the SBOM only have the "Package URL" and "Inferred URL" populated, but not "Download URL". The SBOM that was uploaded has a purl and beneath properties a ResolvedURL. It's the same SBOMs as in https://github.com/nexB/scancode.io/issues/1230

[...]
"components": [
        {
            "group": "",
            "name": "bootstrap",
            "version": "5.3.3",
            "hashes": [
                {
                    "alg": "SHA-512",
                    "content": "f072c2756832a0c82e48ef68f9a1fe8ae67e6a1b7e9b35b4bb71c833356eed2aeba6fec4041c539eb165482b24c1d635f843854129bbb8c2613501e474f7268e"
                }
            ],
            "purl": "pkg:npm/bootstrap@5.3.3",
            "type": "library",
            "bom-ref": "pkg:npm/bootstrap@5.3.3",
            "evidence": {
                "identity": {
                    "field": "purl",
                    "confidence": 1,
                    "methods": [
                        {
                            "technique": "manifest-analysis",
                            "confidence": 1,
                            "value": "/builds/beta/dso/tests-and-demos/dejacode-transitive-test/package-lock.json"
                        }
                    ]
                }
            },
            "properties": [
                {
                    "name": "SrcFile",
                    "value": "/builds/beta/dso/tests-and-demos/dejacode-transitive-test/package-lock.json"
                },
                {
                    "name": "ResolvedUrl",
                    "value": "https://registry.npmjs.org/bootstrap/-/bootstrap-5.3.3.tgz"
                },
                {
                    "name": "LocalNodeModulesPath",
                    "value": "node_modules/bootstrap"
                }
            ]
        },
[...]

Shouldn't that be working though? Where does DejaCode expect the URL to come from?

tdruez commented 2 months ago

@ghsa-retrieval Unfortunately the CycloneDX does not include a clear field to store download URL for SBOM "components".

In ScanCode.io/DejaCode the download_url field is exported in the CycloneDX SBOM as aboutcode:download_url using custom properties defined at https://github.com/nexB/aboutcode-cyclonedx-taxonomy, see also https://github.com/CycloneDX/cyclonedx-property-taxonomy

cdxgen seems to be using the same properties approach with the ResolvedUrl property. I couldn't find much documentation about it on their repo though.

It would be interesting to have the list of properties generated by cdxgen to implement a mapping for importing those value during the CycloneDX ScanCode.io resolution.

ghsa-retrieval commented 2 months ago

@tdruez There does not appear to be any documentation as far as I'm aware. The properties can be found in https://github.com/CycloneDX/cdxgen/blob/4a27933ee55914afecbd465ba4ca9a1da62a9cc1/utils.js#L818 being added through pkg.properties and apkg.properties.

Wouldn't it make more sense to derive the URL from the PURL though? I thought that was already uniquely identifying assuming that the PURL is for a package manager such as maven, npm, pypi and so on. That would be a general solution rather then trying to parse the custom properties of a particular SBOM generation tool.

Any solution is very much appreciated though!

tdruez commented 2 months ago

Wouldn't it make more sense to derive the URL from the PURL though?

Maybe, but in the context of loading an SBOM, generating data that is not present in the SBOM may not always be wanted. So kind of data integrity with the input is likely expected as the imported data. This will require more discussion though.

Any solution is very much appreciated though!

I think in the very short term, we can add support for the ResolvedUrl property.

ghsa-retrieval commented 2 months ago

Maybe, but in the context of loading an SBOM, generating data that is not present in the SBOM may not always be wanted. So kind of data integrity with the input is likely expected as the imported data. This will require more discussion though.

That is a valid point. The suggested approach would ensure that only information already present in the SBOM would be used.

I think in the very short term, we can add support for the ResolvedUrl property.

That would be great!

tdruez commented 2 months ago

@ghsa-retrieval Support for ResolvedUrl property added on the ScanCode.io side in https://github.com/nexB/scancode.io/pull/1241

You can update your ScanCode.io instance (no changes on the DejaCode side) and try again the "Load Packages from SBOMs" + "Scan all packages of this product post-import"

Keep in mind that only the packages that end up with a value for the download_url field will be scanned.

ghsa-retrieval commented 2 months ago

@tdruez Works like a charm.

pombredanne commented 1 month ago

@ghsa-retrieval re:

Wouldn't it make more sense to derive the URL from the PURL though? I thought that was already uniquely identifying assuming that the PURL is for a package manager such as maven, npm, pypi and so on. That would be a general solution rather then trying to parse the custom properties of a particular SBOM generation tool.

There is code:

So there are many ways and what we need likely here is likely an explicit action to call the PurlDB to "enrich" an SBOM with these URLs... or do this in ScanCode.io.... a little design needed. https://github.com/nexB/dejacode/issues/45

ghsa-retrieval commented 1 month ago

@pombredanne that is what I suspected. From an outside perspective it would make sense to me if this feature would be in ScanCode.io, given that we already analyze the SBOM and try to do the same for underlying packages there.

DennisClark commented 1 month ago

Note progress on deriving a download URL from a PURL when adding a package: https://github.com/nexB/dejacode/issues/131