Open gg-aleblanc opened 1 year ago
I have modified the proposed schema. The gist of the change is that the schema now includes the result of a scan of an entire release, not just of a given artifact.
Thanks for the patience on this, I've taken a look now and have some thoughts:
@miketheman is working on developing the infrastructure for the Malicious Package Reporting API and much of that work will be required for this API endpoint as well (authentication, observation model, etc) so that work will need to be completed before we can implement the API endpoint itself.
For our own uses we'll need to decide a "minimum" set of required fields that will actually get used by our backend and then all other fields can be sent as additional information that we might use later on. The nice thing about the Observation model that Mike's designed is that we can gather all the information and then choose to use more later down the road, so don't let our small number of required fields discourage sending more information in the payload.
Identifying some straightforward required fields we'll likely need:
Since we're applying these observations to individual files, not necessarily to releases, we might want to have the API endpoint be file-centric as well? Something along the lines of:
{
"scanner_info": {
"display_name": "GitGuardian",
"report_issue_url": "..."
},
"scan_results": [
{
"filename": "urllib3-2.0.3.tar.gz",
"digests": {
"sha256": "..."
},
"secrets": [
{
"type": "google_aiza",
"display_name": "Google API Key",
"filepath": "",
"line": 1
"documentation_url": "..."
"validity_status": "VALID|NOT_VALID|UNKNOWN"
}
]
}
]
}
What's the problem this feature will solve?
As part of our ongoing collaboration to find exposed secrets in PyPI packages, we are working on a scanning pipeline that automatically scans newly released packages. In order to report our findings, we will need an endpoint we can call, with an agreed-upon schema.
Describe the solution you'd like
Schema
Ideally, the endpoint’s payload would be on a per artifact basis, allowing us to include metadata about the artifact alongside the list of secrets that were found. Here is a possible schema for the payload.
Response
We do not expect the endpoint to return any data, we just need to be able to distinguish between a successful call and one that fails: standard status codes should be more than enough.
API versioning
We have no strong requirement on this point, and will be fine with whichever solution you choose for the versioning of the schema.
Call volume and rate limiting
Since we are planning to call the endpoint once per artifact in which we find secrets, the worst case would be that we find secrets in every single artifact. In that case, our volume of calls would be directly proportional to the number of releases. We consequently don’t expect our volume of calls to be such as to restricted by rate limiting.
Authentication
This endpoint should not be publicly available. A possible approach would be to use both authentication via a secret (ideally just an API key) and an IP allowlist, to guarantee that only known entities have access to the endpoint.
Remediation
In the case of prolonged downtime of the endpoint, we won’t be able to upload our findings. They will be persisted on our end, and can be re-uploaded at a later point. We do not plan to have a way to automate this: this will be done “manually”, on an ad-hoc fashion.
We would also probably need to have an automated way of revoking / renewing our own API key, to be able to remediate any leak on our end immediately.