opengsn / gsn

GSN v3.0.0-beta.10 - Ethereum Gas Station Network
https://opengsn.org/
592 stars 210 forks source link

Is something wrong with mainnet1.ddns.tabookey.com? #241

Closed edukaded closed 5 years ago

edukaded commented 5 years ago

I was looking over the blockchain to see if my node is behaving and perhaps getting some traffic. I noticed there was a flurry of activity some hours ago on the contract all from mainnet1.ddns.tabookey.com

In the space of a few minutes it created the following transactions all of which were calls to register relay

https://etherscan.io/tx/0x974b13fcb9dcc7e27b2643f1a92cd8c4f68f5870b726ab56ab24b382b3851688 https://etherscan.io/tx/0xf0511dd58646714ae5798d54722b71df1d915579515e466bbb0988d5843802b7 https://etherscan.io/tx/0x43e4f29f87a12140159518a16977c90b232cd24f4e495e9080791b9128ee4eca https://etherscan.io/tx/0x08181eb8723627aaa85871022202f0d8e203b1ec09cfe78fe6f9c216fa4f685e https://etherscan.io/tx/0xf39fb63c2dc3d7ceb869edccb1dc6f18e1d552f8f0c73d40f289bb882e5cdf0e https://etherscan.io/tx/0x1e5c0c51813940f99be125d88ac4a1561a4690083fdd04c118ab50b7efbb5222 https://etherscan.io/tx/0xff73e9468a8de366dd4cf5fc8834f65c0af4672e476e71587093d8787d56853c

Did something go wrong with the node or is this a bug?

drortirosh commented 5 years ago

Yes, this is a known issue in the relay. @shahafn looked into it, but we're still not sure why: Relays sometimes perform register, but don't see the event RelayAdded, so they have to register again... Looking at the code, there's no reason why the relay would not see its own registration (and after all, it USUALLY does see it) - looks like a node bug (or synchronization between infura nodes) - but we're not sure. After some time (minutes, or hours), it gets resolved. at most, we've seen relays registering 10-20 times a day (instead of once) A possible overcome to this is for a relay to keep in-memory state of the last time it had completed a registration, but we would rather understand why the event query failed..

edukaded commented 5 years ago

Looks like mine is doing it as well.

Oct 13 19:26:12 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:26:12 RelayHttpServer.go:253: Could not receive RelayAdded events for our relay
Oct 13 19:26:16 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:26:16 RelayHttpServer.go:280: Updating unconfirmed txs...
Oct 13 19:26:19 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:26:19 RelayHttpServer.go:130: address 0x9DC7642Dde1C0A0BCe00ae70Bb6650E23465FfdA
 sent
Oct 13 19:26:27 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:26:27 RelayHttpServer.go:253: Could not receive RelayAdded events for our relay
Oct 13 19:26:42 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:26:42 RelayHttpServer.go:253: Could not receive RelayAdded events for our relay
Oct 13 19:26:55 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:26:55 RelayHttpServer.go:323: Could not receive RelayAdded events for our relay
Oct 13 19:26:55 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:26:55 RelayHttpServer.go:328: Registering relay...
Oct 13 19:26:55 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:26:55 relay_server.go:651: RegisterRelay(address=0xD216153c06E857cD7f72665E0aF1d
7D82172F494, url=https://gsnrelay.tqnext.com) tx sending
Oct 13 19:26:56 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:26:56 relay_server.go:667: RegisterRelay(address=0xD216153c06E857cD7f72665E0aF1d
7D82172F494, url=https://gsnrelay.tqnext.com) tx sent: 0x692be8332b7fe6d80f1bb4f19bf9efdcaa5669deeede62ce2d320971f9983d41 (7)
Oct 13 19:26:58 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:26:58 RelayHttpServer.go:253: Could not receive RelayAdded events for our relay
Oct 13 19:27:13 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:27:13 RelayHttpServer.go:253: Could not receive RelayAdded events for our relay
Oct 13 19:27:16 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:27:16 RelayHttpServer.go:280: Updating unconfirmed txs...
Oct 13 19:27:16 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:27:16 relay_server.go:822: UpdateUnconfirmedTransactions: awaiting transaction t
o be mined 7 0x692be8332b7fe6d80f1bb4f19bf9efdcaa5669deeede62ce2d320971f9983d41
Oct 13 19:27:19 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:27:19 RelayHttpServer.go:130: address 0x9DC7642Dde1C0A0BCe00ae70Bb6650E23465FfdA
 sent
Oct 13 19:27:28 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:27:28 RelayHttpServer.go:253: Could not receive RelayAdded events for our relay
Oct 13 19:27:43 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:27:43 RelayHttpServer.go:253: Could not receive RelayAdded events for our relay
Oct 13 19:27:56 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:27:56 relay_server.go:725: Could not get tx receipt not found
Oct 13 19:27:56 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:27:56 RelayHttpServer.go:334: not found
Oct 13 19:27:56 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:27:56 RelayHttpServer.go:335: Trying to register again...
Oct 13 19:27:59 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:27:59 RelayHttpServer.go:253: Could not receive RelayAdded events for our relay
Oct 13 19:28:14 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:28:14 RelayHttpServer.go:253: Could not receive RelayAdded events for our relay
Oct 13 19:28:17 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:28:17 RelayHttpServer.go:280: Updating unconfirmed txs...
Oct 13 19:28:17 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:28:17 relay_server.go:822: UpdateUnconfirmedTransactions: awaiting transaction to be mined 7 0x692be8332b7fe6d80f1bb4f19bf9efdcaa5669deeede62ce2d320971f9983d41
Oct 13 19:28:19 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:28:19 RelayHttpServer.go:130: address 0x9DC7642Dde1C0A0BCe00ae70Bb6650E23465FfdA sent
Oct 13 19:28:29 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:28:29 RelayHttpServer.go:268: GasPrice: 1100000000
Oct 13 19:28:56 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:28:56 relay_server.go:651: RegisterRelay(address=0xD216153c06E857cD7f72665E0aF1d7D82172F494, url=https://gsnrelay.tqnext.com) tx sending
Oct 13 19:28:57 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:28:57 relay_server.go:667: RegisterRelay(address=0xD216153c06E857cD7f72665E0aF1d7D82172F494, url=https://gsnrelay.tqnext.com) tx sent: 0x605775cfa4aaddee9fa8fb7e4e6fe6345dae07f7d6674166762e16f7df653f14 (8)
Oct 13 19:29:15 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:29:15 RelayHttpServer.go:338: Done registering
Oct 13 19:29:18 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:29:18 RelayHttpServer.go:280: Updating unconfirmed txs...
Oct 13 19:29:18 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:29:18 relay_server.go:816: UpdateUnconfirmedTransactions: awaiting confirmations for next mined transaction 9 7 0x692be8332b7fe6d80f1bb4f19bf9efdcaa5669deeede62ce2d320971f9983d41
Oct 13 19:29:19 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:29:19 RelayHttpServer.go:130: address 0x9DC7642Dde1C0A0BCe00ae70Bb6650E23465FfdA sent

I'm honestly unsure what's causing it, but it feels like it's related to caching. It might be best to assume success and check once per hour to ensure we are still registered.

edukaded commented 5 years ago

I think I've narrowed it down to around line 330 in HttpServer.go

    for {
        err := relay.RegisterRelay()
        if err == nil {
            break
        }
        log.Println(err)
        log.Println("Trying to register again...")
        sleep(1*time.Minute, devMode)
    }

This for loop is a problem because it doesn't check to see if we even need to register again. Replacing it with something along the lines of this works for me...

    log.Println("Registering relay...")

    err = relay.RegisterRelay()
    if err == nil {
        return
    }
    log.Println(err)
    log.Println("Trying to register again...")
    sleep(10*time.Minute, devMode)
    keepAlive();

    log.Println("Done registering")

This expands the sleep from 1 minute to 10, thereby giving time for the cache to expire (cloudflare has a default cache period of 5 minutes on API routes). It also forces the code back through the entire keepAlive function, thereby ensuring we check that registration is in fact still needed.

If you like this solution I'll submit a PR.

edukaded commented 5 years ago

Actually after looking at the code again, it turns out the keepAlive function (of which this is part) is on a 10 minute timer. Therefore we can dispense with the sleep and retry altogether, resulting in code that is more like this...

    err = relay.RegisterRelay()
    if err == nil {
        log.Println("Done registering")
        return
    }
    log.Println(err)

This way if it does fail to register for some reason it will be picked up under the next natural pass and thereby avoid deepening the stack via recursive calls to itself.

edukaded commented 5 years ago

So 36 hours of running the patch and it looks good except there was a 2 hour gap where the relay remained unregistered. It looks like it sent it but without a high enough gas limit and thus the transaction got delayed waiting to enter a block. We may need to babysit the transaction from within the keepAlive function and nudge it along by upping the gas if it still hasn't made it into a block. However since the problem we are fixing appears to be related to not getting a receipt in time, I'm unsure as to what the correct approach to babysitting the tx would look like.

drortirosh commented 5 years ago

@shahafn : so maybe this is the reason it doesn't get registered: the transaction is sent, but with too little gas. since we already have a "gas-fee babysitting" for relayed transaction, maybe the same mechanism can be used for registration. In general, its good to send registration at low-fee times, but if it doesn't get mined, then the gas price should be increased.

edukaded commented 5 years ago

@drortirosh I think you are at least partially correct, look here... https://github.com/tabookey/tabookey-gasless/search?q=sendRegisterTransaction&unscoped_q=sendRegisterTransaction

relay_server.go line 274

return relay.awaitTransactionMined(tx)

Then starting at line 718

func (relay *RelayServer) awaitTransactionMined(tx *types.Transaction) (err error) {
    start := time.Now()
    var receipt *types.Receipt
    for ; (receipt == nil || err != nil) && time.Since(start) < TxReceiptTimeout; receipt, err = relay.Client.TransactionReceipt(context.Background(), tx.Hash()) {
        time.Sleep(500 * time.Millisecond)
    }
    if err != nil {
        log.Println("Could not get tx receipt", err)
        return
    }
    if receipt.Status != 1 {
        log.Println("tx failed: tx receipt status", receipt.Status)
        return
    }

    return nil
}

This looks like it's polling every 500ms to check to see if there is a transaction receipt ready. If the receipt isn't ready we should be seeing an error here, and we do.

Oct 13 19:27:28 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:27:28 RelayHttpServer.go:253: Could not receive RelayAdded events for our relay
Oct 13 19:27:43 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:27:43 RelayHttpServer.go:253: Could not receive RelayAdded events for our relay
Oct 13 19:27:56 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:27:56 relay_server.go:725: Could not get tx receipt not found
Oct 13 19:27:56 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:27:56 RelayHttpServer.go:334: not found
Oct 13 19:27:56 gsnrelay RelayHttpServer[1986]: 2019/10/13 19:27:56 RelayHttpServer.go:335: Trying to register again..

Interestingly there are 2 problems being shown here. First of all the node we are connecting to isn't giving us a transaction receipt. Since we are calling every half second and since there is a limit of 30 requests per second from infura, it is possible although unlikely that we are being rate limited.

https://infura.io/docs/ethereum/json-rpc/ratelimits

Infura does however sit behind cloudflare proxies which have a default cache on API calls of something like 200 seconds, but my experience has been the cache time is closer to a minute, ergo the 2 problems are...

Firstly that a transaction receipt is sometimes not returned because sometimes we are pestering infura too much (or they are just overloaded) and secondly that since there is no transaction receipt there is nothing for the babysitter function awaitTransactionMined to watch.

This means that when we come to that for loop I initially identified, each minute that passes we are firing off a new series of events that creates a new transaction and pegs infura 120 times for a transaction receipt on the existing transaction before giving up, but then it never comes back around to check on the txhash.

What we should be doing is ditching the for loop. Create the transaction and not worry too much about the receipt, returning the txhash and storing that in the txstore, then relying on updatePendingTxs (which is on a 1 min timer anyways), to shove it through since that is already checking and calling resend if needed with more gas.

@shahafn do you agree with my analysis?

shahafn commented 5 years ago

@edukaded I completely agree, the loop to register isn't necessary anymore, it should be removed as per PR #242