wasmCloud / wasmcloud-otp

wasmCloud host runtime that leverages Elixir/OTP and Rust to provide simple, secure, distributed application development using the actor model
Apache License 2.0
228 stars 48 forks source link

Actors cannot invoke providers on different hosts #183

Closed brooksmtownsend closed 3 years ago

brooksmtownsend commented 3 years ago

This is the report from the spike #148, where a KVCounter actor cannot invoke the KeyValueAdd operation with a wasmcloud:keyvalue provider running on a different host.

The cause of this issue is actually fairly simple, it takes place on line https://github.com/wasmCloud/wasmcloud-otp/blob/main/host_core/lib/host_core/web_assembly/imports.ex#L156

TL;DR : When determining the target of an invocation, if it is a provider we query the local provider table (which only contains providers running on that host). This is because at the invocation level the namespace is the contract ID, and the binding is the link name. We don't want to hardcode ourselves into invoking a provider public key, so this is by design.

The fix however could be complicated. Essentially, when invoking a provider we need to know its public key so we can publish to the correct NATS topic for invocations. If it's a provider running on the local wasmcloud host, then it's a simple cache lookup. If it's running on a remote wasmcloud host, then we have to have a way to know the public key of a running provider based on the contract ID and the link definition.

I see a few possible resolutions, and looking for some clarity on them (cc @autodidaddict @stevelr)

  1. Add the contract ID into the claims information that is stored with a provider. I believe the capid is already stored in the claims information, so we'd simply need to add that into the claims cache that Jetstream manages, and then query that for the proper public key
  2. Add a separate providers cache in Jetstream (feels bad)

If I'm correct with #1, then it should be fairly simple, but the shape of the claims map in the cache will change so it might take a bit of care to implement.

autodidaddict commented 3 years ago

Changing the shape of the cache is only a problem for durable caches. If we change the shape of the cache all we really need to do is restart NATS (or purge the stream if you're using a disk one).

At first glance, option 1 looks nice, but I think there's some problems with it. In the scenario where there's a Redis provider operating on the default link name and a Cassandra provider operating on the default link name, if we simply blindly delve into the claims cache, even if we have the contract ID, we won't be able to tell which of the providers is the right one. The only real source of truth here is the link definition, which is actually how 0.18 did it.

What I think we should do is search the link definition cache for the contract ID and link name bound to that actor, and then, if we find such a link definition, that gives us all the information we need to construct the outbound topic. If such a provider is offline, then the call will time out.

autodidaddict commented 3 years ago

Given that all we really need to do is find a contractID + link name + actor ID in the link definitions cache, fixing this problem should be fairly straightforward.

autodidaddict commented 3 years ago

BTW thanks for digging into this. :1st_place_medal: