mit-dci / opencbdc-tx

A transaction processor for a hypothetical, general-purpose, central bank digital currency
Other
895 stars 200 forks source link

Part of "Run the Code" section in README.md appears broken #186

Closed mszulcz-mitre closed 1 year ago

mszulcz-mitre commented 1 year ago

Affected Branch

trunk

Basic Diagnostics

Description

In the "Run the Code" section of README.md, the following step hangs:

  • Send currency from the first wallet to the second wallet created in the previous step (e.g., 30 atomic units of currency)
    # ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat send 30 <wallet address>

    where <wallet address> is the address returned from running the client-cli with the newaddress keyword.

I followed the instructions in the sections "Launch the System" and "Setup test wallets and test them" for the 2PC architecture. The instructions are:

  1. Run the System
    # docker compose --file docker-compose-2pc.yml up --build
  2. Launch a container in which to run wallet commands (use --network atomizer-network instead of --network 2pc-network if using the atomizer architecture)
    # docker run --network 2pc-network -ti opencbdc-tx /bin/bash
    • Mint new coins (e.g., 10 new UTXOs each with a value of 5 atomic units of currency)
      # ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat mint 10 5
      [2021-08-17 15:11:57.686] [WARN ] Existing wallet file not found
      [2021-08-17 15:11:57.686] [WARN ] Existing mempool not found
      4bc23da407c3a8110145c5b6c38199c8ec3b0e35ea66bbfd78f0ed65304ce6fa
  • Inspect the balance of a wallet

    # ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat info
    Balance: $0.50, UTXOs: 10, pending TXs: 0
  • Make a new wallet

    # ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool1.dat wallet1.dat newaddress
    [2021-08-17 15:13:16.148] [WARN ] Existing wallet file not found
    [2021-08-17 15:13:16.148] [WARN ] Existing mempool not found
    usd1qrw038lx5n4wxx3yvuwdndpr7gnm347d6pn37uywgudzq90w7fsuk52kd5u
  • Send currency from the first wallet to the second wallet created in the previous step (e.g., 30 atomic units of currency)

    
    # ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat send 30 usd1qrw038lx5n4wxx3yvuwdndpr7gnm347d6pn37uywgudzq90w7fsuk52kd5u

Code of Conduct

mszulcz-mitre commented 1 year ago

@ykaravas repeated the steps above and also observed that the send-currency step hangs.

kylecrawshaw commented 1 year ago

I've also observed this recently when running in a non-replicated configuration. This issue does not seem to occur when running RAFT components in a replicated configuration.

HalosGhost commented 1 year ago

I can also confirm running into this. @kylecrawshaw it's interesting that it doesn't occur in replicated configurations. Does anyone have any log output that would suggest where the issue is? At least for me, it seems to result in the components hanging as well (so no logs following the hang); is that true for everyone else?

mszulcz-mitre commented 1 year ago

This commit seems to have caused the break:
b9a259718f1ccc4c0283fc8fbd7d9e1922640a9f

This previous commit is the last one for which the send-currency step succeeds: cb0e78df568652c58b270ec7ecfbfada0617fc8f

HalosGhost commented 1 year ago

CC @metalicjames and @wadagso-gertjaap.

The commit that appears to have broken this process (b9a2597) really only has two effects: changing the level at which a lot of init-failure messages are logged, and removing early-returns which caused components to shutdown.

It seems fair to rule out the log-level being related. So, the most likely issue is that one of the components is failing to init properly, but doesn't recover (in a non-replicated environment), resulting in the send step hanging.

In a strange development, this issue is also not reproducible for me using the pre-built docker images from ghcr.io, only locally-built images.

Digging in a bit deeper, I was finally able to see the specific failure:

tx-sentinel0-1     | [2022-09-29 16:10:29.923] [WARN ] Failed to start coordinator client

Namely, the sentinel is failing to connect to the coordinator and at this point prints nothing further. Because the early-returns were removed, this failure leads to the system no longer being able to handle transactions at all because the sentinel (despite not fatally-erroring out) is non-functional.

This is also supported by the fact that the mint transaction still completes successfully (which, at the moment, bypasses the sentinel layer entirely). The specific failure appears to be this conditional (given the error message). However, reverting just that conditional does not result in a working environment (because now the sentinel just fatally-errors, which isn't surprising).

Should this be another place where we put an explicit init-retry to give the sentinel multiple chances to setup its coordinator client correctly?

mszulcz-mitre commented 1 year ago

@HalosGhost Allowing the sentinel controller to retry initializing the coordinator client makes the send command work as expected. Here's the code I used to replace the conditional you pointed out:

        static constexpr auto retry_delay = std::chrono::seconds(1);
        while(!m_coordinator_client.init()) {
            m_logger->warn("Failed to start coordinator client.  Retrying...");
            std::this_thread::sleep_for(retry_delay);
        }

Here's the result:

root@da310df40f5a:/opt/tx-processor# ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat mint 10 5 
[2022-09-29 17:16:57.011] [WARN ] Existing wallet file not found
[2022-09-29 17:16:57.011] [WARN ] Existing client file not found
4d5375705720de249b700cacb28fc6217141a77ca85d728f79754590d2a46b29
root@da310df40f5a:/opt/tx-processor# ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat info
Balance: $0.50, UTXOs: 10, pending TXs: 0
root@da310df40f5a:/opt/tx-processor# ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool1.dat wallet1.dat newaddress
[2022-09-29 17:17:43.320] [WARN ] Existing wallet file not found
[2022-09-29 17:17:43.321] [WARN ] Existing client file not found
usd1qz3mm8x3rgzdstcpuvmlh22270qvtg2laxxd2pxn2rwyvnjs9s9tknj33nl
root@da310df40f5a:/opt/tx-processor# ./build/src/uhs/client/client-cli 2pc-compose.cfg mempool0.dat wallet0.dat send 30 usd1qz3mm8x3rgzdstcpuvmlh22270qvtg2laxxd2pxn2rwyvnjs9s9tknj33nl
tx_id:
de7bff2a620231a313f9c2ac940f721d6ecdc8ad65e3c16dcd2ca030c2a8d29a
Data for recipient importinput:
de7bff2a620231a313f9c2ac940f721d6ecdc8ad65e3c16dcd2ca030c2a8d29a00000000000000006cb2356962cbae73edfcb137e7fef19ced8f67342f0fb09b6ae970681702ee1e1e00000000000000
Sentinel responded: Confirmed
HalosGhost commented 1 year ago

We probably don't want infinite-retry… though maybe that's actually the right choice. In other places, we tend to put a delayed-retry loop (just like you have above), but with a retry-threshold (after which it would probably fatally-error out).

@metalicjames thoughts?

mszulcz-mitre commented 1 year ago

This issue seems related to Issue #127, Issue #158, and Pull Request #157. All involve the question, "What should be done when a controller fails to initialize a component: log a warning and keep going or retry the initialization some number of times?". Retrying seems to solve the problem here and in Issue #158--and potentially at least one aspect of #127. But I don't know enough about all the code to know whether it's a reasonable approach for most or all initialization attempts, or if the approach has to be decided case by case. Since those Issues have been active for a while and are relevant, maybe it's useful to have a broader discussion here.

HalosGhost commented 1 year ago

I'm also unsure about this. I think, for system stability, it probably makes sense for most initialization to use a delayed-retry-with-threshold (potentially using exponential back-off for the delay). But, we also don't want to spend too much time on this since it's really only useful for us to spend cycles on to get the system stable for testing (production-grade use-cases aren't necessarily the focus).

karlovskiy commented 1 year ago

I had same issue and have to change controller code to support retries as @mszulcz-mitre proposed earlier.

HalosGhost commented 1 year ago

Yeah, this has sat for longer than I'm comfortable with. I'll open a PR sometime in the next day or so (@mszulcz-mitre you're welcome to open it if you'd like, but you'll be credited as a co-author if not) to get this merged in.