stellar / stellar-core

Reference implementation for the peer-to-peer agent that manages the Stellar network.
https://www.stellar.org
Other
3.12k stars 970 forks source link

race (that could lead to data loss) and confusing log messages with captive core #3248

Closed MonsieurNicolas closed 3 years ago

MonsieurNicolas commented 3 years ago

There are a couple issues with captive-core logging.

First, logs start with a bunch of messages related to temporary application objects, leading to a log sequence like:

2021-10-19T18:40:54.601 GCP7A [default INFO] Application destructing
2021-10-19T18:40:54.602 GCP7A [default INFO] Application destroyed
2021-10-19T18:40:54.602 GCP7A [default INFO] Minimal database not found, creating one...
2021-10-19T18:40:54.603 GCP7A [Database INFO] Connecting to: sqlite3://buckets/minimal.db
2021-10-19T18:40:54.611 GCP7A [SCP INFO] LocalNode::LocalNode@GCP7A qSet: 59d361
2021-10-19T18:40:54.654 GCP7A [default INFO] *
2021-10-19T18:40:54.655 GCP7A [default INFO] * The database has been initialized
2021-10-19T18:40:54.655 GCP7A [default INFO] *
2021-10-19T18:40:54.667 GCP7A [Database INFO] Applying DB schema upgrade to version 14
2021-10-19T18:40:54.669 GCP7A [Database INFO] Applying DB schema upgrade to version 15
2021-10-19T18:40:54.672 GCP7A [Database INFO] Applying DB schema upgrade to version 16
2021-10-19T18:40:54.674 GCP7A [Database INFO] DB schema is in current version
2021-10-19T18:40:54.687 GCP7A [Ledger INFO] Established genesis ledger, closing
2021-10-19T18:40:54.687 GCP7A [Ledger INFO] Root account: GBRPYHIL2CI3FNQ4BXLFMNDLFJUNPU2HY3ZMFSHONUCEOASW7QC7OX2H
2021-10-19T18:40:54.687 GCP7A [Ledger INFO] Root account seed: SDHOAMBNLGCE2MV5ZKIVZAQD3VCLGP53P3OBSBI6UN5L5XZI5TKHFQL4
2021-10-19T18:40:54.697 GCP7A [default INFO] Listening on 0.0.0.0:11626 for HTTP requests
2021-10-19T18:40:54.700 GCP7A [default INFO] Connection effective settings:
2021-10-19T18:40:54.700 GCP7A [default INFO] TARGET_PEER_CONNECTIONS: 8
2021-10-19T18:40:54.700 GCP7A [default INFO] MAX_ADDITIONAL_PEER_CONNECTIONS: 64
2021-10-19T18:40:54.700 GCP7A [default INFO] MAX_PENDING_CONNECTIONS: 500
2021-10-19T18:40:54.700 GCP7A [default INFO] MAX_OUTBOUND_PENDING_CONNECTIONS: 56
2021-10-19T18:40:54.700 GCP7A [default INFO] MAX_INBOUND_PENDING_CONNECTIONS: 444
2021-10-19T18:40:54.700 GCP7A [default INFO] Application destructing
2021-10-19T18:40:54.700 GCP7A [default INFO] Application destroyed
2021-10-19T18:40:54.704 GCP7A [default INFO] Starting stellar-core stellar-core 18.1.0.rc1 (dc5f5a392098b82bd9453a2aa4259e7af600ad9d)

note the logs only say "Starting stellar-core..." only after spewing a bunch of stuff related to temporary applications getting created and torn down.

Second, if there is a conflicting core instance running, the error messages are:

2021-10-19T18:40:50.543 GB2VY [default INFO] Application destructing
2021-10-19T18:40:50.543 GB2VY [default INFO] Application destroyed
2021-10-19T18:40:50.543 GB2VY [default INFO] Minimal database not found, creating one...
2021-10-19T18:40:50.544 GB2VY [default INFO] Application destructing
2021-10-19T18:40:50.544 GB2VY [default INFO] Application destroyed
2021-10-19T18:40:50.544 GB2VY [default FATAL] Got an exception: unable to flock file: buckets/stellar-core.lock (Resource temporarily unavailable). This can be caused by access rights issues or another stellar-core process already running
2021-10-19T18:40:50.544 GB2VY [default FATAL] Please report this bug along with this log file if this was not expected

What is wrong here is that the error unable to flock file: buckets/stellar-core.lock is being silently handled and then triggers Minimal database not found, creating one.... This is confusing and dangerous: the database already exists (it's just that another instance is already running) and there is an actual race: if the other instance stops, the current one will happily call new-db thinking it needs to create a new database. In this situation we should rethrow the exception and not handle it like this.

marta-lokhova commented 3 years ago

opened https://github.com/stellar/stellar-core/pull/3268 to deal with the race in captive core. I didn't change the logging - we're using the same app creation path to spin up temp apps, and I'm not really sure this is a big problem? I mean, we log things like "did not find the DB, so initializing a new one" or "found the DB" to give users an idea of what's happening, so maybe that's fine?