tezos-reward-distributor-organization / tezos-reward-distributor

Tezos Reward Distributor (TRD): A reward distribution software for tezos bakers.
https://tezos-reward-distributor-organization.github.io/tezos-reward-distributor/
GNU General Public License v3.0
87 stars 51 forks source link

re-running failed operations with TzKT backend is so slow, it hangs #674

Open nicolasochem opened 1 year ago

nicolasochem commented 1 year ago

This was described on slack and I've seen it as well:

When TRD fails, the next run attempts to run the failed payments again. But this results in a lot of calls to tzkt api. Recently, it seems that this endpoint has been rate-limited, causing failure like the example below (taken on ghostnet TRD).

A workaround is to delete the failed payment folder (assuming it's a solid failure and not a partial payment). But it would be good to look into why it's querying every

│ 2023-06-22 21:39:11,754 - producer  - INFO - Summary 4 paid, 0 done, 0 injected, 9 fail, 1 avoided                                                                                                               │
│ Exception in thread producer:                                                                                                                                                                                    │
│ Traceback (most recent call last):                                                                                                                                                                               │
│   File "/app/.local/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn                                                                                                                  │
│     conn = connection.create_connection(                                                                                                                                                                         │
│   File "/app/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 72, in create_connection                                                                                                      │
│     for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):                                                                                                                                       │
│   File "/usr/local/lib/python3.10/socket.py", line 955, in getaddrinfo                                                                                                                                           │
│     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):                                                                                                                                      │
│ socket.gaierror: [Errno -3] Try again                                                                                                                                                                            │
│                                                                                                                                                                                                                  │
│ During handling of the above exception, another exception occurred:                                                                                                                                              │
│                                                                                                                                                                                                                  │
│ Traceback (most recent call last):                                                                                                                                                                               │
│   File "/app/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen                                                                                                                │
│     httplib_response = self._make_request(                                                                                                                                                                       │
│   File "/app/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request                                                                                                          │
│     self._validate_conn(conn)                                                                                                                                                                                    │
│   File "/app/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn                                                                                                        │
│     conn.connect()                                                                                                                                                                                               │
│   File "/app/.local/lib/python3.10/site-packages/urllib3/connection.py", line 363, in connect                                                                                                                    │
│     self.sock = conn = self._new_conn()                                                                                                                                                                          │
│   File "/app/.local/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn                                                                                                                  │
│     raise NewConnectionError(                                                                                                                                                                                    │
│ urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7ff7a2064460>: Failed to establish a new connection: [Errno -3] Try again                                                 │
│                                                                                                                                                                                                                  │
│ During handling of the above exception, another exception occurred:          

Link to baker slack discussion: https://tezos-baking.slack.com/archives/CQ35AM8KE/p1685978794781209

nicolasochem commented 5 months ago

An update on this.

When running in retry mode, for every delegator with a failed payout, we update the balance:

https://github.com/tezos-reward-distributor-organization/tezos-reward-distributor/blob/master/src/pay/retry_producer.py#L106-L107

but the only other occurence of update_current_balances() function in the producer logic is in CalculatePhase4 which is for founders/owners.

We are not actually querying individual delegators balances with tzkt API during initial payment: we are bulk querying the indexer with a balance list of delegators self.reward_provider_model.delegator_balance_dict.

Only during retries, we do it, which takes a very long time and often fails.

A solution would be to modify the retry logic to query all balances again using the same API as initial payment. I'm not going to do this now, instead I'll just record my observations here.

nicolasochem commented 5 months ago

This is what's in the verbose log when this happens:

2024-02-15 17:46:15,839 - producer  - DEBUG - Requesting https://api.tzkt.io/v1/accounts/tz2xxxxx
2024-02-15 17:46:15,950 - producer  - DEBUG - Response from TzKT is:
{'activeRefutationGamesCount': 0,
 'activeTicketsCount': 0,
 'activeTokensCount': 60,