Support for offline operation (e.g. using local copy of PyPA advisory repo as vulnerability service)

riwoodward commented 11 months ago

Is your feature request related to a problem? Please describe.

Currently two vulnerability services are offered, pypi and osv, but these are both based on pip-audit retrieving information from the Internet (e.g. with URLs specified in the code). This is a problem when needing to operate on offline machines or with limited Internet access. As it stands, I don't think pip-audit can operate without an available Internet conncection?

Describe the solution you'd like

I would like an option to execute pip-audit using a locally available copy of the advisory database. For example, I could maintain a local mirror on my network for the PyPA advisory database repo (https://github.com/pypa/advisory-database), then when pip-audit needs to be run on any offline machine on my network, I could simply retrieve from the local mirror and pass the path for this to pip-audit for it to use as a vulnerability service. This would be particularly useful for offline / air-gapped CI systems.

I hacked together a quick implementation for this and it works well. I just modified the query function of pip_audit/_service/pypi.py to become as below, and passed the path to the local copy of PyPA advisory database repo as an env var (e.g. export PIPAUDITDB=~/advisory-database)

  def query(self, spec: Dependency) -> tuple[Dependency, list[VulnerabilityResult]]:
      """
      Queries PyPI for the given `Dependency` specification.

      See `VulnerabilityService.query`.
      """
      if spec.is_skipped():
          return spec, []
      spec = cast(ResolvedDependency, spec)

      # Path to local PyPA Advisory Database
      repo_path = os.environ['PIPAUDITDB']

      results: list[VulnerabilityResult] = []

      # Get list of YAML files representing the advisories for the dependency
      # TODO: error checking! Check path exists, is valid PyPA database etc
      vuln_yaml_paths = glob(f'{repo_path}/{spec.canonical_name}/*')

      for vuln_yaml_path in vuln_yaml_paths:
          with open(vuln_yaml_path,'r') as f:
              data = yaml.safe_load(f)

          introduced_version = Version(data['affected'][0]['ranges'][0]['events'][0]['introduced'])
          # TODO: how to handle case when no fixed version? i.e. an active vuln?
          fixed_version = Version(data['affected'][0]['ranges'][0]['events'][1]['fixed'])
          fix_versions = [fixed_version]

          if spec.version >= introduced_version and spec.version < fixed_version:
              id = data['id']
              description = data['details']
              fix_versions = fix_versions
              aliases = data['aliases']
              published = data['published']
              # Normalize description into a single line
              description = description.replace("\n", " ")

              results.append(
                  VulnerabilityResult(
                      id=id,
                      description=description,
                      fix_versions=fix_versions,
                      aliases=set(aliases),
                      published=published,
                  )
              )

      return spec, results

I guess to make this an officially supported feature, the path to the PyPA repo could be specified using the -s SERVICE, --vulnerability-service SERVICE arg? Or make another option beyond osv and pypi, e.g. pypa-repo with another arg for the path to said repo?

I note that offline indexes are already supported using the --index-url arg, so this could be complementary?

If interested, I could put together a PR (i.e. taking the above approach and adding error handling, proper use of args, tidy up etc)? I wanted to see what you thought of the method / proposed approach first though.

Describe alternatives you've considered

Running a local PyPI mirror including the JSON advisory info could work but that would be considerably more effort and resource usage to achieve the same goal.

Additional context

Really great tool otherwise - thanks!

woodruffw commented 11 months ago

Thanks for the detailed report @riwoodward! We greatly appreciate it.

Your understanding is correct: pip-audit currently always needs some form of online access, whether to PyPI or OSV.

I'll let @di opine here as well, but here's my thoughts:

Unless the PyPA Advisory DB is standardized (maybe it is -- it might be OSV as YAML instead of JSON?), I don't think we want to introduce the direct dependency of "pip-audit knows how to parse the PyPA Advisory DB": doing so might hamper their ability to make breaking changes, and would also put us in the hotseat whenever they do. In other words: we get a lot of compatibility benefit out of only consuming the Advisory DB through the PyPI JSON API, which has standardized how it presents it 🙂
Running a local mirror of the JSON API is indeed a bit of a pain, and I don't have a great alternative here if being offline is a strict requirement. At the same time, having everything go through the JSON API simplifies things in pip-audit quite a bit.

So TL;DR: I think we could do this if (1) we can get some kind of stability guarantees about the Advisory DB's format, and (2) doing so doesn't make our ultimate plans to integrate into pip any harder. But I'm not sure if we should, given that the "right" (but admittedly annoying) way to do this is to mirror the JSON API locally.

di commented 11 months ago

I don't think we should try to integrate against the advisory DB directly, mostly for the reasons @woodruffw mentioned.

I do think we could possibly support some kind of local file cache, but I think I need to understand the use case a bit more before I can say that would be a useful thing to add.

Overall I don't think I really understand how or why this feature will be used... assuming you have a lockfile and a snapshot of an advisory database, nothing is going to change: at the time you're online to get the snapshot, you already know about all the possible vulnerabilities for the subset of dependencies you have. Any offline audit with these two things is always going to produce the same result.

The main point of pip-audit is to find new vulnerabilities in dependencies you're using as they are discovered. The other use case is being aware of existing vulnerabilities when adopting new dependencies, but again, I don't see how you could be introducing new dependencies somewhere offline where you wouldn't be able to also query for vulnerabilities.

Is the general goal just to remove a dependency on an external service? That's why we provide PyPI as a provider, because you already have a dependency on PyPI anyways for everything else you need for installation.

riwoodward commented 11 months ago

Thanks both for the comments.

As @di suggests, perhaps I can flesh out the real-world use case more clearly.

In high-security environments, CI runners and build boxes etc are often kept offline. Dependencies that are required can be mirrored to a dedicated local mirror machine on the LAN. Pip-audit sounds ideal for "finding new vulnerabilities in dependencies as they're discovered" as di suggests, which could be achieved using e.g. scheduled CI pipelines where a job includes the vulnerability check. The problem, however, is: how can one keep the pip-audit informatoin source updated without internet access. Without an up-to-date vulnerability database, new vulnerabilites in a project won't be discovered and the developer won't be alerted automatically.

Pip-audit is still useful to developers if they run the command manually during development on machines with Internet access, but I believe there is an opportunity to make this even more useful and add "offline" support.

I like the "mirror the PyPA advisory database" approach to some internal LAN mirror since then you're only storing the vulnerability info. Mirroring the whole PyPI JSON API would take significantly more space, bandwidth etc and most of that wouldn't be useful data. If you're aware of a good open-source tool for mirroring the API by the way, please let me know as it seems only partially supported in most packages that do this (e.g. devpi). At present, pip-audit hard-codes the PyPI URL anyway so even a local JSON API mirror would still need work to implement pip-audit support.

I understand the concerns @woodruffw raises to using PyPA's repo though. For now, I guess I'll just have to use my hacked approach at the start of this issue, as that does work well for me. But I'll keep on eye on the project in case this (or another offline method) becomes an official feature.

Thanks again for the great tool too!

di commented 11 months ago

In high-security environments, CI runners and build boxes etc are often kept offline. Dependencies that are required can be mirrored to a dedicated local mirror machine on the LAN. Pip-audit sounds ideal for "finding new vulnerabilities in dependencies as they're discovered" as di suggests, which could be achieved using e.g. scheduled CI pipelines where a job includes the vulnerability check. The problem, however, is: how can one keep the pip-audit informatoin source updated without internet access. Without an up-to-date vulnerability database, new vulnerabilites in a project won't be discovered and the developer won't be alerted automatically.

I guess what I'm missing is: why does the check need to happen offline? Why can't it happen at the time you would "download" the cache, which you would need to be online for anyways?

bittner commented 7 months ago

I guess what I'm missing is: why does the check need to happen offline? Why can't it happen at the time you would "download" the cache, which you would need to be online for anyways?

Banks and similar environments have dedicated security teams that have their own idea of safety and convenience. They might agree in establishing a very narrow channel to download configuration data, a vulnerability database, etc.

Large organizations also have a policy to "accept risk". This might translate to having a vulnerability database which is a few days or a week old. Cyber security people are held accountable for their decisions, hence they want to stay in control—they define the price.

They want to make sure that an attacker has an attack surface as minimal as possible. They're not totally wrong: A lot can happen with direct access to the Internet. Supply chain attacks being another enormous danger, followed by social engineering.

Bottom-line: Air-gapped environments are real. Solutions are needed. See GitLab's security scanning offerings for an example. They didn't invent that themselves, banks and other customers pushed them. I know it, first-hand.

woodruffw commented 7 months ago

I'm sympathetic to the idea that there are contexts/setups where these checks should happen offline, but I think the engineering points in https://github.com/pypa/pip-audit/issues/698#issuecomment-1826992607 are still outstanding: we can't (reasonably) support an offline mode in pip-audit until the PyPA advisory DB is guaranteed to not change formats on us. Right now we get that property by indirectly relying on it via the PyPI JSON API, but it's not a guarantee of the DB itself. So upstream coordination is required before we can even think about an improved offline workflow here.

woodruffw commented 3 months ago

Linking things together: #805 presents a similar need.

pypa / pip-audit

Support for offline operation (e.g. using local copy of PyPA advisory repo as vulnerability service) #698