Enable Cosmos to Solomachine relaying

Summary

This is a tracking issue to track the work need to be done in order for the new relayer to support relaying between a Cosmos chain and a solomachine chain.

## MVP Tasks
- [ ] informalsystems/hermes-sdk#30
- [x] informalsystems/hermes#2780
- [x] informalsystems/hermes#2925
- [x] informalsystems/hermes#2840

## Follow up Tasks
- [ ] informalsystems/hermes-sdk#11

For Admin Use

[ ] Not duplicate issue
[ ] Appropriate labels applied
[ ] Appropriate milestone (priority) applied
[ ] Appropriate contributors tagged
[ ] Contributor assigned/self-assigned

A possible challenge for performant solomachine relaying is that the solomachine client on a Cosmos chain uses a sequence as nonce for proof generation, which is essentially a digital signature.

The problem is that when constructing IBC proofs, the solomachine needs to specify the exact sequence that is stored in the client state when signing the payload. Furthermore, the solomachine is not allowed to double-sign two proofs with the same sequence.

This significantly limits the way we can perform concurrent relaying from a solomachine to a Cosmos chain. In theory, we can try to construct two IBC messages in parallel with increasing sequence. However this would mean that an order is imposed on which message must be sent before one other, when the messages are collected to be sent in a separate async task via channels.

Note that this problem is similar to the ordering requirement for relaying packets in ordered channels. So it is not just an issue for relaying solomachine packets. We just haven't figured out whether there is any way to support concurrent message construction while also imposing ordering on the messages being sent.

A naive workaround to this problem is to introduce a global lock so that only one packet can be relayed at a time. For solomachine relaying, the lock applies to the entire relay context from the solomachine to the Cosmos chain. This would significantly limits the relaying performance of solomachine, as regardless of how many channels are there, the solomachine can send at most one IBC message to the Cosmos chain at each height.

We will investigate on whether this limitation can be relaxed in future issues, after the initial implementation for solomachine is completed.

We also uncovered a potential security vulnerability in the solomachine implementation that is related to misbehavior handling. Essentially, a solomachine client can be considered to be misbehaving if there are two signed proofs found to have the same sequence. This can be very punishing, especially in the event of recovering from failures such as network or full node.

The fundamental problem is that once a solomachine signs a proof at a given sequence, it must be fully committed to submitting that proof to the Cosmos chain regardless of possible failure. However, after an IBC message is constructed, there can be many ways the submission can fail in transit. Although the relayer may attempt to retry in such cases, it still does not rule out the case when the relayer itself has crashed, or if the signed data is somehow invalid.

In case if a signed message failed to deliver, the solomachine is in a tricky position that it has to recover which message has failed to send, and retry it again. If the solomachine decides to send a different message, it now runs into the trouble of potentially being treated as misbehaving, in case the previous message somehow ended up in the mempool or get picked up by an adversary.

Mitigation

To protect the solomachine being marked as misbehaving, a potential mitigation is to always perform UpdateClient with a new diversifier after an IBC message is sent. This is because the misbehavior logic is based on the current diversifier of the client state. This means even if there was misbehavior evidence found for a previous diversifier, it would get rejected by the Cosmos chain.

Note that this mitigation is not perfect, as it is possible for the UpdateClient itself to fail. However, the chance of misbehaving is reduced if there is a deterministic way to generate the next diversifier, and if the solomachine always include an UpdateClient in each IBC message. Then even if the solomachine regenerates the UpdateClient message from scratch, it should be exactly the same as the previous attempt. In this way, we can avoid a case of misbehavior evidence arising from two different signed UpdateClients at the same sequence.

Even with that, it is still possible for misbehavior to be reported, since the IBC message that immediately follows an UpdateClient inside a transaction may be different. To avoid that, we must also append another UpdateClient message to the end of the transaction. In this way, any adversary cannot report misbehavior based on the different signed IBC message that is sandwiched in between the two UpdateClients.

Long term fix

All in all, the misbehavior mechanics for solomachine clients is unintuitive and is vulnerable to abuse. The misbehavior detection only depends on the current diversifier set in the client state, making it impossible to report authentic misbehaviors in the past. Furthermore, the system makes it very difficult for solomachines to recover from failure, without being risked of getting reported as misbehaving. The sequence for solomachine works very similar to the regular account sequences, but the penalty for having two different signed messages at the same sequence is much more severe as compared to regular transactions.

To fix that, there are several ways to improve the solomachine design.

The simplest way to fix the mis-misbehavior is by disabling misbehavior detection for solomachine altogether. After all, the current system is more likely to falsely flag accidental misbehavior, and miss out the authentic misbehavior. Instead, the system should allow the solomachine to voluntarily freeze the IBC client, in case its private key is stolen. The system should also make it such that it is still possible to freeze a client with a previous key changed recently, so as to address the case when an attacker stole a solomachine private key and then perform UpdateClient to change it to a different key.

Secondly, the system should accept proofs of a solomachine signature with a sequence that is higher than the current sequence stored in the client state. This helps ensure that if a solomachine failed to send a signed message for some reason, it can skip ahead of the sequence in which it already generated a message, and use a higher sequence during the retry.

Implementation Strategy

In the initial implementation of solomachine relaying as covered in this issue, we will not handle any potential vulnerability of misbehavior as addressed in this comment. Instead, we would just perform naive signing of proofs and retrying on error.

As a simple mitigation, the solomachine relaying would always include UpdateClient with a new diversifier for every IBC message sent. However, we won't be attaching UpdateClient to the end of a transaction, as that would require special case handling in the relayer logic.

It would be worthwhile to discuss with the IBC-go team to investigate if there is any better way to protect a solomachine client from misbehaving in the current implementation.

Note that there are some assumption made on the above concern for misbehavior, which is that the relayer is managing the proof and signature generation in a stateless way on behalf of the concrete solomachine. In our design, the concrete solomachine do not need to worry about how to generate and sign proofs, or how to persist them. This means that it is significantly easier to implement custom solomachines, but the burden of correctness is on the relayer.

An alternative design is to have the solomachine managing its own proof generation and persistence, and then have the relayer merely poll and forward the proofs to the target chain. However this would require the concrete solomachine to implement much of the functionality of the relayer and IBC stack, and acting almost like a single-node blockchain.

Although this alternative design would eliminate the issue of double signing, the solomachine can easily suffer from irrecoverable errors. Consider the case that the solomachine generates and persists an invalid proof. That would mean that the relayer is unable to successfully submit that particular proof to the target chain. However the relayer also cannot skip that particular sequence and continue to relay the next sequence, as the Cosmos chain only accepts proofs with the exact sequence. As a result, the only way to resolve the deadlock is to request the concrete solomachine to regenerate all proofs. That would in turn re-introduce the possibility of double signing.

informalsystems / hermes-sdk