tezos-reward-distributor-organization / tezos-reward-distributor

Tezos Reward Distributor (TRD): A reward distribution software for tezos bakers.
https://tezos-reward-distributor-organization.github.io/tezos-reward-distributor/
GNU General Public License v3.0
87 stars 51 forks source link

fix #678 - Fix single-shot mode always exiting in error #679

Closed nicolasochem closed 5 months ago

nicolasochem commented 11 months ago

We added code in #653 to have TRD send a non-zero exit status when anything wrong happens - including payout failures, or (relevant here) keyboard interrupts.

However, it turns out that when using TRD in runmode -2, -3, -4, the regular course of operations is to have the producer thread send a KeyboardInterrupt when done... which is then treated as error!

This is done with _thread.interrupt_main(). It's bad: humans send interrupts, programs should not. Instead, I'm suggesting that the producer use SIGUSR1 to trigger a normal interruption.

This triggers the same code path as today, just with an exit status of SUCCESS. So I expect that the main thread will still wait for the consumer, and it won't actually exit before payouts are done, if there are any.

This allows kubernetes bakers (for example) to use a k8s cronjob and get alerted automatically in case of payment failures.

For the record, I still believe this entire producer/consumer logic should be thrown away and we should simply do things in order (calculate, then pay). I've ranted about this before (#491).

Work effort: 3 hours

jmo-staked commented 11 months ago

Hmm there might still be an error in the logic, we tested this with a payout address that had insufficient funds (so we'd expect it to fail), and it did fail and logged that the exit code should be 6 (INSUFFICIENT_FUNDS = 6), but the overall exit code was SUCCESS = 0. Maybe that 6 just isn't bubbling up properly?

2023-08-23 20:11:49,438 - consumer0 - INFO - Total estimated amount to pay out is 2,730,732,539 mutez.
2023-08-23 20:11:49,439 - consumer0 - INFO - 2580 payments will be done in 11 batches
2023-08-23 20:11:49,439 - consumer0 - INFO - Current balance in payout address is 0 mutez.
2023-08-23 20:11:49,440 - consumer0 - ERROR - Payment attempt failed because of insufficient funds in the payout address. The current balance of 0 mutez is insufficient to pay for cycle rewards of 2,730,732,539 mutez.
2023-08-23 20:11:49,440 - consumer0 - INFO - [Plugins] Not sending notification; no plugins enabled
2023-08-23 20:11:49,442 - consumer0 - INFO - Processing completed for 2585 payment items, 2580 failed.
2023-08-23 20:11:49,444 - consumer0 - INFO - Payment report is created at '/home/tezos/pymnt/simulations/.../payments/done/643.csv'
2023-08-23 20:11:49,452 - consumer0 - INFO - Payment report is created at '/home/tezos/pymnt/simulations/.../payments/failed/643.csv'
2023-08-23 20:11:49,611 - consumer0 - INFO - Anonymous statistics disabled, (Dry run)
2023-08-23 20:11:49,611 - consumer0 - INFO - Unknown Error at payment consumer. Please consult the verbose logs! Exit code: 6
2023-08-23 20:11:53,046 - producer  - INFO - Reward creation is done for cycle 643, created 3328 rewards.
2023-08-23 20:11:53,047 - producer  - INFO - Run mode ONETIME satisfied. Terminating...
2023-08-23 20:11:53,047 - producer  - INFO - TRD Exit triggered by producer Exit code: 0
2023-08-23 20:11:53,047 - MainThread - INFO - Application stop handler called by producer: 10
2023-08-23 20:11:53,048 - MainThread - INFO - TRD is shutting down...
2023-08-23 20:11:53,048 - MainThread - INFO - --------------------------------------------------------
2023-08-23 20:11:53,048 - MainThread - INFO - Sensitive operations are in progress!
2023-08-23 20:11:53,048 - MainThread - INFO - Please wait while the application is being shut down!
2023-08-23 20:11:53,048 - MainThread - INFO - --------------------------------------------------------
2023-08-23 20:11:53,048 - MainThread - INFO - Shutdown. Exit code: 0
nicolasochem commented 11 months ago

@jmo-staked good catch! I just pushed a fix, can you please try again?

jmo-staked commented 11 months ago

That's looking better, the consumer fails with a 6, the producer exits with a 0, and the app itself exits with a 1. But all that is really critical is the Zero-or-Notzero nature of shell exit codes so this patch looks like it'll work fine for us, thanks again!

2023-08-24 00:48:25,910 - consumer0 - INFO - Anonymous statistics disabled, (Dry run)
2023-08-24 00:48:25,910 - consumer0 - INFO - Unknown Error at payment consumer. Please consult the verbose logs! Exit code: 6
2023-08-24 00:48:29,458 - producer  - INFO - Reward creation is done for cycle 643, created 3328 rewards.
2023-08-24 00:48:29,459 - producer  - INFO - Run mode ONETIME satisfied. Terminating...
2023-08-24 00:48:29,459 - MainThread - INFO - Application stop handler called: 12
2023-08-24 00:48:29,459 - producer  - INFO - TRD Exit triggered by producer Exit code: 0
2023-08-24 00:48:29,459 - MainThread - INFO - TRD is shutting down...
2023-08-24 00:48:29,460 - MainThread - INFO - --------------------------------------------------------
2023-08-24 00:48:29,460 - MainThread - INFO - Sensitive operations are in progress!
2023-08-24 00:48:29,460 - MainThread - INFO - Please wait while the application is being shut down!
2023-08-24 00:48:29,460 - MainThread - INFO - --------------------------------------------------------
2023-08-24 00:48:29,460 - MainThread - INFO - Shutdown due to error! Exit code: 1
$ kubectl describe po ...
State:          Terminated
  Reason:       Error
  Exit Code:    1
nicolasochem commented 11 months ago

yes I'm using two signals, SIGUSR1 indicates success and SIGUSR2 is failure... it's fine.

Does it return 0 when payout succeeds?

jmo-staked commented 11 months ago

We won't have a payout to test that against until cycle 644 completes

jmo-staked commented 11 months ago

Two more tests passed swimmingly, both with exit code 0:

jmo-staked commented 11 months ago

Might have hit one minor other exit error, in this case the MainThread exit code was logged incorrectly:

2023-08-29 16:00:53,401 - MainThread - INFO - [Plugins] No plugins enabled
2023-08-29 16:00:53,402 - MainThread - INFO - Initial cycle set to -1
2023-08-29 16:00:53,411 - MainThread - INFO - Application is READY!
2023-08-29 16:00:53,411 - MainThread - INFO - --------------------------------------------
2023-08-29 16:00:54,489 - producer  - ERROR - Unable to fetch current cycle from provider tzkt, Not synced. Exiting.
2023-08-29 16:00:54,490 - consumer0 - WARNING - Exit signal received. Terminating...
2023-08-29 16:00:54,490 - producer  - INFO - TRD Exit triggered by producer Exit code: 8
2023-08-29 16:00:54,490 - MainThread - INFO - Application stop handler called by producer: 10
2023-08-29 16:00:54,491 - MainThread - INFO - TRD is shutting down...
2023-08-29 16:00:54,492 - MainThread - INFO - --------------------------------------------------------
2023-08-29 16:00:54,492 - MainThread - INFO - Sensitive operations are in progress!
2023-08-29 16:00:54,492 - MainThread - INFO - Please wait while the application is being shut down!
2023-08-29 16:00:54,492 - MainThread - INFO - --------------------------------------------------------
2023-08-29 16:00:54,492 - MainThread - INFO - Lock file removed!
2023-08-29 16:00:54,493 - MainThread - INFO - Shutdown. Exit code: 0
$ kubectl describe po ...
    State:          Terminated
      Reason:       Error
      Exit Code:    1

MainThread reports it's exiting with code 0, even though the app is exiting due to error, but the actual exit signal k8s sees is 1 which is good, so the job correctly restarted.

nicolasochem commented 6 months ago

Might have hit one minor other exit error, in this case the MainThread exit code was logged incorrectly:

2023-08-29 16:00:53,401 - MainThread - INFO - [Plugins] No plugins enabled
2023-08-29 16:00:53,402 - MainThread - INFO - Initial cycle set to -1
2023-08-29 16:00:53,411 - MainThread - INFO - Application is READY!
2023-08-29 16:00:53,411 - MainThread - INFO - --------------------------------------------
2023-08-29 16:00:54,489 - producer  - ERROR - Unable to fetch current cycle from provider tzkt, Not synced. Exiting.
2023-08-29 16:00:54,490 - consumer0 - WARNING - Exit signal received. Terminating...
2023-08-29 16:00:54,490 - producer  - INFO - TRD Exit triggered by producer Exit code: 8
2023-08-29 16:00:54,490 - MainThread - INFO - Application stop handler called by producer: 10
2023-08-29 16:00:54,491 - MainThread - INFO - TRD is shutting down...
2023-08-29 16:00:54,492 - MainThread - INFO - --------------------------------------------------------
2023-08-29 16:00:54,492 - MainThread - INFO - Sensitive operations are in progress!
2023-08-29 16:00:54,492 - MainThread - INFO - Please wait while the application is being shut down!
2023-08-29 16:00:54,492 - MainThread - INFO - --------------------------------------------------------
2023-08-29 16:00:54,492 - MainThread - INFO - Lock file removed!
2023-08-29 16:00:54,493 - MainThread - INFO - Shutdown. Exit code: 0
$ kubectl describe po ...
    State:          Terminated
      Reason:       Error
      Exit Code:    1

MainThread reports it's exiting with code 0, even though the app is exiting due to error, but the actual exit signal k8s sees is 1 which is good, so the job correctly restarted.

I don't think this is true. I think that actually, in case of tzkt unresponsive, the app exited succesfully, which is not what should happen. I have now modified the code to propagate producer errors (unresponsive tzkt, disk full etc)... to the main thread. @jmo-staked @jdsika could you please check urgently?

nicolasochem commented 5 months ago

@jdsika pointed out that these signals are not available on windows, more work is needed

https://stackoverflow.com/questions/16429480/what-is-equivalent-to-sigusr1-2-signals-in-windows-using-python