polywrap / wrap-cli

Used to create, build, and integrate wraps.
https://polywrap.io
MIT License
170 stars 52 forks source link

Polywrap Dependency Locking #402

Open dOrgJelli opened 3 years ago

dOrgJelli commented 3 years ago

For Polywrap imports, currently there's no differentiation between dynamic URIs (ex: ENS domains), and static URIs (ex: IPFS CIDs). The usage of dynamic URIs can presents problems, since the underlying wrapper could be changed, which may not desired.

We should support the ability to "lock down" your dependencies, meaning that all dynamic URIs would be fully resolved to their underlying static URI. These static URIs would be cached (embedded?) into the wrapper's package, ensuring that all dependencies will be resolved the same way every time.

WIP Implementation Details

TODO: lock file (folder?), generated at build-time (install-time?), w3 build --lock, shallow vs recursive, query-time functionality, dynamic vs static URI implementation, document benefits (speed up URI resolution, ensure packages will be the same), how does this affect redirects

Addition Research

pwvpwvpwvpwv commented 3 years ago

Introduction

Currently, there is no functional difference in how URIs using indirect addressing protocols, such as ENS, and URIs that use direct addressing protocols, such as IPFS (CIDs), are handled by Polywrap. This can cause an issue if the underlying wrapper is changed but the indirect URI remains the same, potentially breaking consuming apps, or causing unwanted behaviour. To prevent this class of issues from arising, Polywrap needs to somehow cache the underlying wrappers, regardless of what protocol end users use to address them, so that they are resolved the same way every time.

Background

Polywrap allows for the creation of "wrappers" which allow for easy access to Web3 protocols from calling code in any language that has an appropriate client. These wrappers can be uploaded to decentralized endpoints, such as ENS or IPFS addresses, or they can be packaged as "plugins" that are written in the same language as the client that will handle them and can be deployed locally, making use of URI redirects to query them in the same way as the usual wrappers.

While creating these wrappers, the functionality of other wrappers, either locally or externally situated, can be imported into the GraphQL schemata and made available for use. These imports are resolved at build time, and are marked as comments with the following syntax [20]:

Import kind Syntax
External Import import { Type, Query } into Namespace from "external.uri"
Internal Import import { Type } from "./local/path/file.graphql"

Because these imports point to potentially dynamic files (as URIs are redirected and resolved) builds cannot be guaranteed to be deterministic.

NB: as an aside, this issue also exists on the 'client' side of Polywrap, though in a slightly different manner since those imports are known at runtime, as opposed to compile time, and thus pose a similar but ultimately non-trivial problem of their own.

Scope

Requirements

Prior Art

NPM & Yarn

In the JavaScript ecosystem, NPM and Yarn have become the mainstay package management solutions. Both employ a traditional registry/index + package manager architecture, with a declared dependency manifest (package.json), as well as a stored cache (/node_modules) and local index (package-lock.json or yarn.lock respectively). In general, we can look to the architecture of Yarn to see the overarching principles at play in such package managers:

1) Resolution: Yarn first resolves which dependencies already exist locally, which are missing, and which need to be downloaded (this is done based on a combination of things, including the known dependency tree and package metadata, such as requested SemVer ranges). 2) Fetch: Once the final dependency tree is ascertained, Yarn gathers the requisite dependencies, either from the registry, or from various other definitions (e.g. archive folders, local filesystem, etc.). 3) Link: Finally, after all the packages have been downloaded, they are stored in the local filesystem, both as a cache for future fetch steps and for the actual use in the final compiled binary. Yarn then interfaces between all the packages to provide the necessary APIs to each other (and to the final product, obviously); which, interestingly, means that packages can be written in different languages and have different runtimes.

(Taken from [17], cf. for greater detail).

Cargo

Inspired by the popularity, as well as the many warts and blemishes of the NPM ecosystem, the Rust community has created the Cargo package manager. Cargo operates [5] much in the same way as NPM/Yarn, solving similar problems in similar (though notably improved) manners.

Deno

What NPM, Yarn, and Cargo have in common, beyond their functionality and architecture, is their default orientation towards package sourcing. All three, and many others (e.g. Rubygems), belong to a class of package managers that rely on a registry. This contrasts them with a class of package management solutions that can be termed "registryless", such as those employed by Golang and Deno. In the particular case of Deno, dependencies are specified as URIs pointing to packages or individual files, allowing DNS resolution to act as a sort of registry without a registry.

This registryless package management strategy (NB: Deno technically does not have a standalone package manager, as it is considered part of the Deno build tool, like Cargo) relies on certain conventions, such as declaring the package version in the URI (e.g. https://deno.land/std@0.107.0/testing/asserts.ts), locking modules down in the cache (although individual modules/dependencies can be selectively reloaded as needed via deno cache --reload=<module>), runtime validation of lockfiles, and support for import maps [8].

Aside: Deno lockfiles only contain hashes of the imported files [8].

As a convenience, Deno also allows for the use of a deps.ts file, which acts as a single point of import for all external dependencies, and as a single point of export as well — deps.ts defines which identifiers to import from the given URI, and then re-exports them so that they can be used elsewhere in the project:

export {
    assert,
    assertEquals,
    assertStrContains,
} from "https://deno.land/std@0.107.0/testing/asserts.ts";

Decentralized package managers

At immense convenience to us, the IPFS Package Managers Task Force, and Andrew Nesbitt in particular [16], have spent quite some time going through and documenting the relationship between decentralization efforts and package managers. The amount of work and thought put into that document cannot be understated, so I will leave it as-is — as a link worth reading. Interestingly, while much of the work in the original document works towards the interests of integrating IPFS into existing package managers, much of insight therein could just as well be used towards the goals of this project; in particular, given that the transfer of an existing package ecosystem is not within scope, any implementation discussed herein can be free of such burden, and instead focus solely on engaging with the problem of decentralized publishing as it is understood in the Web3 context.

One additional characteristic of decentralized package managers that must be considered, beyond the difficulties already presented, is the question of security. Certain research into registry/package management security has been conducted in recent years [2][3], but it remains an open problem with unique requirements given the decentralized approach expected of this project.

Proposals

In response to the problem as it was stated above, three proposals have been drafted. Given the overlap between them, they can be considered separately, or all three implemented and offered for use at a developer's/end-user's discretion, accounting for their advantages and disadvantages. Each having their own tradeoffs and considerations, adopters should plan for what particular use cases may arise for Polywrappers (in general, or for their own in particular). A useful approach might be to consider these three approaches by analogy to other technology, to better understand where they may best fit in: the first approach is akin to traditional binary bundling or a vendored web application, and would have similar benefits and consequences (enumerated in part, below); the second may be closer to Dockerfiles or Docker Compose in the sense that a set of requirements is specified, and appropriate resources are gathered and orchestrated on-the-fly to meet those requirements; the third approach, being a hybrid of the other two, would best be suited for mobile or embedded environments, where memory and resources are heavily constrained.

1. Monolithic Bundle

Consider the traditional idea of an application bundled together with all of its dependencies (that is, "vendored"). Taking a traditional approach to package management in the Polywrap ecosystem would benefit from the many years of knowledge contained in the prior art on this subject, and would ostensibly prove very ergonomic for developers who are already used to engaging with traditional package managers (e.g. Yarn, or Pip). To that end, a tool that mimics the architecture and functionality of Rust's Cargo would serve to address the needs of the Polywrap ecosystem, doing so in a way that is already familiar to most of the developers that would be interacting with it (i.e.* manifests, lockfiles, build pipelines, etc.).

Implementation

Such a tool would follow the three traditional package manager / build tool stages mentioned above:

1) Resolution 1) Dependencies would be gathered, either from inline imports or a manifest (or a combination of the two) 2) Dependencies would be resolved to their respective IPFS CID hashes and constructed into a dependency tree 1) Dependencies which themselves specify dependencies should be able to reuse existing copies from elsewhere in the tree 3) Once a uniform tree has been constructed, it can be flattened and encoded into a lockfile 1) This lockfile would contain a mapping from declared dependency names to specific CID hashes, allowing developers to continue using ENS/registry-style URIs rather than remembering IPFS CIDs 2) Fetch 1) A generated lockfile acts as a single source of truth for which IPFS-stored files to download 1) Each CID is downloaded and stored in a local cache (if not already present) for use during build time 3) Link 1) Downloaded dependencies are inlined (or otherwise "linked", vis a vis schema semantics) during compile time 1) In existing terms, dependencies are either linked during binary building, as in Cargo, or treated like node_modules; in either case, they should be included with the final product

On this understanding of a Polywrap package ecosystem, there would be three kinds of dependencies: direct link; indirect link, such as ENS URIs; and registry link. All would resolve to a single CID hash, each. Furthermore, once these hashes are resolved and stored in a lockfile, they are not updated except when explicitly asked by the user — although no longer the case, there was a time when NPM would automatically update/re-resolve all of your packages when running npm install, which inevitably caused issues, which Yarn and Cargo sought to prevent in their own decisions.

Based on this information, we might imagine that an example wrapper manifest would declare some dependencies as such:

format: 0.0.X
repository: https://github.com/polywrap/monorepo
registry: https://example.link-to-polywrap.registry
# ...
dependencies:
  - 'package_name': '^1.2.3' # registry lookup
  - 'w3://ens/some.wrapper.eth'
  - 'w3://ens/different.wrapper.eth': 'custom_wrapper_rename' # import map a dependency name
  - 'w3://ipfs/bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oclgtqy55fbzdi'
  # ...

And a resulting lockfile may contain:

# ...
"package_name@1.2.8": bafybeierhgbz4zp2x2u67urqrgfnrnlukciupzenpqpipiz5nwtq7uxpx4
"w3://ens/some.wrapper.eth": bafybeierhgbz4zp2x2u67urqrgfnrnlukciupzenpqpipiynwgq6c5sovy
"w3://ens/different.wrapper.eth": bafybeierhgbz4zp2x2u67urqrgfnrnlukciupzenpqpipj62o2j2mnh6na
"w3://ipfs/bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oclgtqy55fbzdi": bafybeierhgbz4zp2x2u67urqrgfnrnlukciupzenpqpipj66dq6o5jefaa
# ...

Consequences

Vendoring

"Vendoring", or the act of bundling all of an application's dependencies along with it, has long been considered a best practice amongst many developers (particularly those dealing with desktop or enterprise paradigms), as it ensures that the final application will be guaranteed to have all of its dependencies readily available at runtime; though, at a certain cost to the end-user. These costs, namely of storage space, caching, and network-use volatility, constitute a key tradeoff in application deployment and package management. Vendoring requires more storage space for the final application, as it must include all of its dependencies as part of the final bundle (consider checking node_modules into Git, or Webpack bundles containing multiple files in addition to a main application file). And, because these dependencies are included in the application bundle, they are opaque to the end-user, and do not benefit from caching — an important consideration if the use cases of Polywrap will include running several wrappers in one runtime/session, with potentially overlapping or reusable dependencies. As a side effect of this lack of caching, end-user network requirements may potentially be increased, as the entire wrapper and associated dependencies must be downloaded in one go; however, this is not necessarily an all-around evil, as it ensures less volatile network conditions, since only one round-trip hop is needed to load full functionality.

Security

Because wrapper dependencies are vendored, they cannot be switched out or misrepresented to the end-user, preventing a large class of man-in-the-middle (MITM) style attacks. Further, given that IPFS hashes are unique, and indirect/registry type links are only resolved once / as requested, wrapper developers can trust that whatever level of security they expect from their dependencies at build time will be respected at runtime as well.

Dependencies

2. Hydratable Lockfile Only

Given the nature and particular peculiarities of IPFS, a certain kind of optimization of the traditional approach is possible, at a much lower risk than would otherwise usually be incurred: because IPFS CIDs are unique, immutable, and (relatively) permanent [15], an end-user can expect that whatever dependencies a wrapper may have been built with at any time in the past, they themselves will be able to retrieve at a given point in the future, without difficulty. Given these changes in constraints, it is ostensibly possible to diverge from the traditional approach, without the usual risks of not vendoring one's dependencies. Rather than downloading and packaging all of the application dependencies at build time, only the first stage (resolution) will need to be done at build time, while the other two stages can be performed by the end-user as run time, given the appropriate lockfile describing what the wrapper depends upon. In short, because of certain properties of IPFS-based file storage, a single lockfile can contain all of the information necessary for an application to reliably be spun up and all of its dependencies gathered correctly.

Implementation

This approach should be implemented very similarly to approach (1). Once the lockfile is generated, however, no further work would need to be done, and it would be bundled with the final application in-place of any vendored dependencies. It would then be up to the consuming Polywrap SDK to correctly download and link/orchestrate the wrapper and its dependencies at run time, respecting the lockfile's declarations.

Consequences

By offloading the work of dependency fetching and linking to the wrapper runtime/SDK, end-users can expect to receive the benefits of caching and smaller application bundles, at the cost of more volatile/unpredictable network usage, and a potentially increased attack surface for certain classes of security vulnerabilities.

Caching & Bundle Size

Because dependencies do not need to be vendored on this understanding, wrapper developers can save on storage space by uploading smaller bundles, which in turn take fewer resources to download for end-users. This comes at the cost of end-users having to download multiple packages from potentially varied endpoints, increasing total network use. This, of course, is well-mitigated by existing caching strategies, and is well-suited to usecases where end-users can expect to be downloading and maintaining a cache of many reusable dependencies.

Security

Conversely to the security benefits of dependency vendoring, allowing/requiring end-users to download the packages themselves potentially opens them up to MITM-style attacks (especially if a malicious party changes the lockfile, possibly en-route, to point to their own resources). This can be resolved in various traditional ways, such as requiring SSL connections for all downloads, and employing some sort of tamper-proofing for lockfiles, although the exact implementation of such a thing will vary depending on the balance between security an flexibility (vis a vis the next section).

URI Redirects

Interestingly, this approach allows for greater flexibility in employing URI redirects on the client side: because all dependencies are known ahead of time via the lockfile, end-users can choose to enforce URI redirects for just certain dependencies. This, of course, carries its own kind of caveat emptor implications, but would nevertheless allow for increased customizability, wherein end-users could substitute specific packages that they prefer for differing ones that a wrapper might expect, to no knowledge and at no expense to the wrapper developers (save the potential headache of debugging end-user complaints later).

Dependencies

3. Server-side Rendering

One possible issue involved with the two approaches described above pertains to resource constraints. Namely, where network or memory use is limited or undesirable, it can be inconvenient to stream large bundles from far away, or to gather many resources and cache them into memory. Recent works related to addressing these kinds of resource constraint issues, particularly as they relate to network use and resources on lower-end systems, seem to point towards performing some (or all) of the required efforts on an intermediary server, between the end-user and the resource origin [22]. By analogy, one might imagine the case of server-side rendering speeding up dynamic web applications for mobile users. This sort of approach, which would involve "rendering" servers gathering dependencies on-the-fly and serving final application/wrapper packages to certain opt-in users, can provide CDN-like performance benefits to end-users who otherwise do not otherwise wish to store large caches (as per approach (2)), but without the developer-side storage costs associated with approach (1). This particular architecture can act as a sort of middle-ground between approach (1) and approach (2), and may be something that would be available as an on-demand service for end-users through their Polywrap SDK.

Implementation

Assuming that approaches (1) and (2) are feasible and have been implemented, additional servers can be spun up which act as CDNs of a sort, taking requests for wrappers and gathering the appropriate resources/dependencies themselves (potentially from their own local cache, especially if they have already seen requests for that wrapper before) before serving a final bundle to end-users.

Consequences

Storage

As mentioned, this approach offers the speed of a CDN cache by offloading the cost of storing dependency bundles from the wrapper developers onto the server hosts. These server hosts can then determine appropriate caching strategies that they believe to be most cost-effective, providing a middle ground between keeping track of common wrappers and calling out to IPFS to gather less-common wrappers and dependencies.

Dependencies

Alternative Considered

1. Registryless

Go and Deno (among others) provide strong evidence in favor of the potential for an effective, registryless package management solution. Both tools rely on DNS resolution to locate and resolve dependencies, as opposed to a registry, and thus gain several meaningful benefits (as discussed above) in the process. This approach would seem to work well in concert with the goals of IPFS and ENS in particular; however, given that Polywrap already intends to provide for a package registry, this particular approach is out-of-scope.

2. Filesystem-based Package Management

Out-of-the-box, Rubygems and Pip do not produce lockfiles or dependency manifests. Instead, they resolve individually-specified dependencies and download their sources into the local filesystem. From there, those downloaded packages are added (either individually or as an entire folder) into the load path of the respective ecosystem (Ruby loads each package's lib folder into the main load path, and Python provides the entire package-containing folder as one of the paths searched at runtime when resolving imports). While this lockless approach ensures determinism between builds on the same system (and subsequently deterministic output for vendored builds), it does nothing to ensure deterministic builds across different systems, short of copying over all of the associated files. This, of course, could potentially be paired with a Docker-style approach of loading system state from a previous "image"; but, such a solution seems overly complex in the face of the parsimony of instead using lockfiles and similar, existing dependency locking semantics.

Unknowns

General Registry Security

While the exact implementation of the registry is out-of-scope for this proposal, certain features thereof may influence the choices made/required for an effective and secure package management solution. In particular, registry implementation details may expect reciprocal functionality in their package manager: given advances in the field of package registries [2], and recently proposed solutions to package registry security involving blockchain-like technologies [10], the integration between package manage and registry may necessitate some understanding of the security interests of each, as well as those of the end-user (independent of the package ecosystem itself).

References

dOrgJelli commented 2 years ago

Some notes from protocol sync:

# Goals:
# - support dep locking
# - support optimized caching
# - support hash verification

type UriInfo {
  supported: Boolean!
  static: Boolean!
  verifier: String!
}

type Query {
  getUriInfo(
    uri: String!
  ): UriInfo!

  tryResolveUri(
    authority: String!
    path: String!
  ): MaybeUriOrManifest

  getFile(
    path: String!
  ): Bytes
}

type MaybeUriOrManifest {
  uri: String
  manifest: String
}

## dep locking
# -> ens/some-wrapper.eth
# # -> ipfs/QmHASH + verifier = ens/verifier.eth => ipfs/QmHASH

# TODO: look into threat model for malicious resolvers
#       (ipfs gateway for ex)
cbrzn commented 1 year ago

the next steps of this has been developed here: https://hackmd.io/U_kjfaAmQ9Km4ZMCzJboDA (still wip)