Clarify documentation of return format for /v1/querybatch API

p3pijn commented 2 years ago

According to the Swagger documentation at https://osv.dev/docs/#operation/OSV_QueryAffectedBatch the endpoint https://api.osv.dev/v1/querybatch should return a list of osvVulnerability.

Instead it returns a list of vulnerability IDs without any vulnerability details or the actual link to the package the vulnerability belongs to.

This leads to the following issues:

The batch endpoint does not return the OSV format, leading to clients having a need to query those vulnerability IDs. This completely goes against the idea of having a batch endpoint where you get all info for a set of libraries in a single HTTP request so that you do not overload the service.
It is unclear and undocumented how the current response from the batch endpoint should be used. How to link these vulnerability IDs to the actual packages that were provided in the request?
The documentation and implementation are out of sync

Example:

cat <<EOF | curl -X POST -d @- "https://api.osv.dev/v1/querybatch"
{
  "queries": [
    {
      "package": {
        "ecosystem": "Maven",
        "name": "org.apache.logging.log4j:log4j-core"
      },
      "version": "2.13.0"
    },
    {
      "package": {
        "ecosystem": "Packagist",
        "name": "noumo/easyii"
      },
      "version": "0.8"
    }
  ]
}
EOF

Returns:

{
    "results": [
        {
            "vulns": [
                {
                    "id": "GHSA-7rjr-3q55-vv33"
                },
                {
                    "id": "GHSA-jfh8-c2jp-5v3q"
                },
                {
                    "id": "GHSA-p6xc-xr62-6r2g"
                },
                {
                    "id": "GHSA-vwqq-5vrc-xw9h"
                }
            ]
        },
        {}
    ]
}

With this approach, the current service is almost impossible to use without hitting rate limits.

oliverchang commented 2 years ago

Hi! Thanks for trying our API and filing this issue.

According to the Swagger documentation at https://osv.dev/docs/#operation/OSV_QueryAffectedBatch the endpoint https://api.osv.dev/v1/querybatch should return a list of osvVulnerability.

Instead it returns a list of vulnerability IDs without any vulnerability details or the actual link to the package the vulnerability belongs to.

This leads to the following issues:

The batch endpoint does not return the OSV format, leading to clients having a need to query those vulnerability IDs. This completely goes against the idea of having a batch endpoint where you get all info for a set of libraries in a single HTTP request so that you do not overload the service.

Unfortunately since we allow up to 1000 packages per batch query, it's not feasible for us to return the full vulnerability details for every single package.

There is no rate limiting. GETing vulnerability details is very cheap and you can do them in parallel. We may add a batch GET endpoint in the near future if we see enough demand for it.

It is unclear and undocumented how the current response from the batch endpoint should be used. How to link these vulnerability IDs to the actual packages that were provided in the request?

We return an array where each element corresponds to the package you provided. I.e. in your example, you provided two packages. The returned response has two entries in "results". The first shows 4 vulnerabilities for the first package you provided. There were no vulnerabilities that were matched for the second package you provided (given by the "{}"). Note that Packagist querying is not fully supported yet (tracked in https://github.com/google/osv/issues/230).

We'll clarify this in our API.

The documentation and implementation are out of sync

Could you please clarify this? Is this just the same issue with us not returning fully hydrated results?

With this approach, the current service is almost impossible to use without hitting rate limits.

We have no rate limits :)

p3pijn commented 2 years ago

@oliverchang Many thanks! I’ll have a go at the /v1/query endpoint instead of the batch endpoint. Great to hear there is no rate limitting.

The documentation and implementation are out of sync Could you please clarify this? Is this just the same issue with us not returning fully hydrated results?

Indeed, the Swagger documentation currently suggests that the batch endpoint returns the fully hydrated vulnerabilities. Hence the confusion.

fviernau commented 2 years ago

I've just ran into this same issue as well in context of integrating OSV into [1]. Comparing the options

Query the vulnerabilities for each project separately
Batch request the vulnerabilities of all projects M plus requesting the N distinct vulnerabilities details by id.

Which one is actually faster for my use case in [1] is not obvious to me instantly, e.g: Assuming both requests execute fast on the server, which option is better depends on the size of projects M and distinct vulnerabilities N and how much overlap of vulnerabilities there is across the projects. So, option (2) avoids redundantly transferring the details while it can (in some cases) also lead to a higher amount of total requests.

Anyhow, what I'm after is whether option 2. would allow for caching vulnerability details, which would be beneficial. That'd work e.g. if the ID was usable as cache key for the vulnerability details. My suspicion however is that the ID is not usable as cache key, because the details have a "modified" flag and also the events seem like they would change over time while the ID remains stable. So, can you tell whether

details may change over time for any given ID
Are there plans on supporting the use case of client side caching of vulnerability details? e.g. by
- by returning the whole cache key
- extending the batchQuery to allow for filtering by a last modified date, like also discussed in ##448.

[1] https://github.com/oss-review-toolkit/ort [2] #448

oliverchang commented 2 years ago

Hi @fviernau,

Thanks for trying the OSV API!

Option 2. is definitely the preferred option from our perspective also. GETing vulnerabilities by ID is extremely fast/cheap and you can do this in parallel as well to improve performance. We've also been considering adding a bulk GET (by multiple IDs) to make this easier.

Re your questions specifically,

details may change over time for any given ID That's correct.

Are there plans on supporting the use case of client side caching of vulnerability details? e.g. by

This makes a lot of sense -- we can certainly include the "modified" timestamp with the ID as well in the batch response. The combination of ID+modified would serve as a reliable cache key. I just filed https://github.com/google/osv.dev/issues/492 to track this.

Extending batchQuery by filtering by a last modified date is likely not what we're going to go with given the complexity it would add to the API.

github-actions[bot] commented 1 month ago

This issue has not had any activity for 60 days and will be automatically closed in two weeks

google / osv.dev

Clarify documentation of return format for /v1/querybatch API #466