terra-money / fcd-classic

Terra ETL + RestFul API Server
https://fcd.terra.dev/swagger
MIT License
63 stars 40 forks source link

FCD collector stuck in collecting validator information #153

Open LCyson opened 2 years ago

LCyson commented 2 years ago

Hi there, we are starting a fcd node with a backfilled database until 11/24/21. But our collector keeps collecting validators' info and has been doing so for 12+hrs. We are using the branch bombay. Does anyone know how to make it start picking up blocks from 11/24 (height~= 5417391)? Thanks in advance!

Logs from collector:

01-02 19:26 [INFO]: Trying to run process ProposalCollector
01-02 19:26 [INFO]: Process ProposalCollector starting...
01-02 19:26 [INFO]: Proposal collector started.
01-02 19:26 [INFO]: Checking for deleted proposals
01-02 19:26 [INFO]: Saved proposal 137
01-02 19:26 [INFO]: Proposal collector completed.
01-02 19:26 [INFO]: Process ProposalCollector ended.
01-02 19:26 [INFO]: collectValidator: Talis Protocol(terravaloper1qd0uk3wrw73x662y2gx4kaulrzlcky6275gl5s)
01-02 19:26 [INFO]: collectValidator: Accomplice Blockchain(terravaloper1p72vswk5zfzzr7myhrerm78ty5tjc8ypl259tm)
01-02 19:26 [INFO]: collectValidator: Smart Stake(terravaloper1alpf6snw2d76kkwjv3dp4l7pcl6cn9uyt0tcj9)
01-02 19:26 [INFO]: collectValidator: Orion.Money(terravaloper1259cmu5zyklsdkmgstxhwqpe0utfe5hhyty0at)
01-02 19:26 [INFO]: collectValidator: SolidStake(terravaloper1fhx7y75643tze8dxf4m9gwhkxn955q8r7vxjel)
01-02 19:26 [INFO]: collectValidator: Marte Cloud(terravaloper1dg7zhmt4g4zq74y4tksq4xfzf5pwx4cnngavjk)
01-02 19:26 [INFO]: collectValidator: stake.systems(terravaloper1a9q6jl792qg36cp025ccjtgyf4qxrwzqjkmk5d)
01-02 19:26 [INFO]: collectValidator: Orion.Money(terravaloper1259cmu5zyklsdkmgstxhwqpe0utfe5hhyty0at)
01-02 19:26 [INFO]: collectValidator: Bit Cat🐱(terravaloper1k4ef8m95t7eq522evmmuzvfkpla04pezmu4j7k)
01-02 19:26 [INFO]: collectValidator: SolidStake(terravaloper1fhx7y75643tze8dxf4m9gwhkxn955q8r7vxjel)
01-02 19:26 [INFO]: collectValidator: OneStar(terravaloper18hpew39uymssr52w8euxqh4zrrjt02x7k0jmhk)
01-02 19:26 [INFO]: collectValidator: Inotel(terravaloper1vqegsqhe8q06t6jwgvww0qcr2u6v6g9xrwjnmw)
...

Our .envrc:

export TYPEORM_CONNECTION=postgres
export TYPEORM_HOST=xxxxxxxxxxxx
export TYPEORM_USERNAME=postgres
export TYPEORM_PASSWORD=xxxxxxxxx
export TYPEORM_DATABASE=fcd
export TYPEORM_PORT=5432
export TYPEORM_SYNCHRONIZE=false
export TYPEORM_LOGGING=false
export TYPEORM_ENTITIES=src/orm/*Entity.ts
export TYPEORM_MIGRATIONS=src/orm/migration/*.ts

export SERVER_PORT=3060
export CHAIN_ID=columbus-5
export LCD_URI=http://localhost:1317
export FCD_URI=https://tequila-fcd.terra.dev
export RPC_URI=
export BYPASS_URI=https://tequila-fcd.terra.dev
export MIRROR_GRAPH_URI=https://tequila-graph.mirror.finance/graphql
export STATION_STATUS_JSON=https://terra.money/station/version-web.json
export SENTRY_DSN=
#export USE_LOG_FILE=true
export INITIAL_HEIGHT=5417391
export TOKEN_NETWORK=mainnet

Our ormconfig:

module.exports = {
        name: 'default',
        type: 'postgres',
        host: 'xxxxxxxx,
        database: 'fcd',
        username: 'postgres',
        password: 'xxxxxxxxx',
        synchronize: false
}
hanjukim commented 2 years ago

My guess it that your rpc node data is corrupted. You need to find out which query is making it stalling. Try to add console.log at https://github.com/terra-money/fcd/blob/c4bd2cebd0a6361bc474ef8d53c9797a1e459369/src/lib/lcd.ts#L30

LCyson commented 2 years ago

A thing I observed is that the block height our lcd is still collecting is lower than the height at 11/24/21 (the latest backfilled block). We are downloading a complete Columbus-5 node data right now. Do you think that will be the problem?

LCyson commented 2 years ago

hi @hanjukim, we were able to make the collector run successfully after I backfilled our terra rpc node's data. We meet another issue thou however, that we modify our aws RDS database while didn't close the collector. It now seems to be stuck on the the height at the DB modification moment, with this error msg:

01-04 04:42 [INFO]: collectBlock: begin transaction for block 5442812
01-04 04:42 [ERROR]: Cannot destructure property 'tx_response' of '(intermediate value)' as it is undefined.
TypeError: Cannot destructure property 'tx_response' of '(intermediate value)' as it is undefined.
    at Object.getTx (/home/ec2-user/efcd/fcd/src/lib/lcd.ts:55:11)
    at async generateLcdTransactionToTxEntity (/home/ec2-user/efcd/fcd/src/collector/block/tx.ts:166:14)

Do you by any chance know how we can reset the status of our collector so that it can reload that height?

LCyson commented 2 years ago

^I'm looking into our postgres DB to try to find the corrupted row but couldn't locate it. Do you by any chance know which row in which table we should delete in order to let the collector restart parsing that height? or does the fcd collector modify the terra lcd data and possibly corrupt that as well?

muratso commented 2 years ago

hi @hanjukim, we were able to make the collector run successfully after I backfilled our terra rpc node's data. We meet another issue thou however, that we modify our aws RDS database while didn't close the collector. It now seems to be stuck on the the height at the DB modification moment, with this error msg:

01-04 04:42 [INFO]: collectBlock: begin transaction for block 5442812
01-04 04:42 [ERROR]: Cannot destructure property 'tx_response' of '(intermediate value)' as it is undefined.
TypeError: Cannot destructure property 'tx_response' of '(intermediate value)' as it is undefined.
    at Object.getTx (/home/ec2-user/efcd/fcd/src/lib/lcd.ts:55:11)
    at async generateLcdTransactionToTxEntity (/home/ec2-user/efcd/fcd/src/collector/block/tx.ts:166:14)

Do you by any chance know how we can reset the status of our collector so that it can reload that height?

This seems to be the same error I was getting after restarting the FCD (https://github.com/terra-money/fcd/issues/146). I restarted it a few times, but at some point, it simply stopped working and threw this error. At this point I dropped and recreated my db twice, it didn't work (I mean... it worked but I eventually end up getting the same error). And now I restored my node from a snapshot, dropped and recreated the db again, and I'm waiting for the FCD to sync from scratch to see if I'm going to have it working this time or not. :( PS: since it seems some of the features that I needed from bombay branch were merged into the main branch... I'm giving a try on the main.

hanjukim commented 2 years ago

I will add logs to see which specific tx is not found in the node.

hanjukim commented 2 years ago

https://github.com/terra-money/fcd/blob/main/src/lib/lcd.ts#L53 could you add some error logs here for debugging? I cannot reproduce your problem in my nodes..

LCyson commented 2 years ago

hi @hanjukim , think I have some clues about why this happens, so this error happens to me each time when the internet connection was cut during a tx processing into fcd DB. The non-atomic write of a height leaves some partial data in the fcd DB, which causes the collector to be stuck there by using those corrupted data to query lcd.

Is it possible for the fcd team to add a crash protection mechanism (basically make the write to fcd DB atomic)? That should eliminate this issue. Right now what we are doing is just to manually backfill that height in the postgres DB to unblock the collector, which takes a lot of times each time. Thanks a lot!

LCyson commented 2 years ago

hi @hanjukim , think I have some clues about why this happens, so this error happens to me each time when the internet connection was cut during a tx processing into fcd DB. The non-atomic write of a height leaves some partial data in the fcd DB, which causes the collector to be stuck there by using those corrupted data to query lcd.

Is it possible for the fcd team to add a crash protection mechanism (basically make the write to fcd DB atomic)? That should eliminate this issue. Right now what we are doing is just to manually backfill that height in the postgres DB to unblock the collector, which takes a lot of times each time. Thanks a lot!

actually I think the above hypothesis is wrong^, I did the backfill again with a new DB, and it fails on the same block. I suspect it's because of the terra rpc node data now.

roccomuso commented 2 years ago

@LCyson what if we skip that block? will the fcd make some progress? edit: actually I do think the issue is not the block itself but some transaction in the block that is not correctly fetched. So maybe skipping that tx would help.

LCyson commented 2 years ago

@roccomuso haha yea that was what I have tried. The collector will start working for some blocks but unfortunately it will get stuck on another few blocks later and eventually there was a block that simply skip didn't help (I couldn't remember the height). I guess there may be some data corrupted in the terra rpc node, not sure if backfilling from scratch or using another snapshot will help or not.

muratso commented 2 years ago

@LCyson yeah, confirmed on my side that there are some data corrupted in our terra rpc node. I restored my node from a full archive snapshot, however, some blocks are there, but some transactions are missing for some reason. And the FCD throws this error exactly on the block where the txs are missing... Which is kinda annoying considering I restored my node from a full snapshot. I'm trying again the latest snapshot, but this time I'm trying the default snapshot instead of the archive one.

kamsz commented 2 years ago

I'm experiencing the same issue, been backfilling nodes multiple times and it eventually starts to fail anyway.