graphprotocol / graph-node

Graph Node indexes data from blockchains such as Ethereum and serves it over GraphQL
https://thegraph.com
Apache License 2.0
2.91k stars 968 forks source link

[Bug] Graphman copy operation is stuck in 'queued' phase. #5052

Open AnPiakhota opened 10 months ago

AnPiakhota commented 10 months ago

Bug report

A self hosted graphnode instance has been configured to use a number of chains. With one DB instance and multiple stores linked to the configured chains, subgraphs are successfully deployed and queried. There is only one default indexer.

Problems begin when DB sharing needs to be applied. For this a separate physical server has been allocated for another installation of PostgreSQL. The latter has the identical configuration: user, extensions, asf. Toml configuration file has been adjusted to have another shard in store section, although neither deployment nor chains sections know nothing about this shard.

[store]
# ethereum
[store.primary]
connection = "postgresql://graphpguser:pass@host:5432/graphnode"
pool_size = 10
[store.primary_svr18]
connection = "postgresql://graphpguser:pass@host:5432/graphnode"
pool_size = 10

[store.mantle]
connection = "postgresql://graphpguser:pass@host:5432/mantle"
pool_size = 10
[store.mantle_svr18]
connection = "postgresql://graphpguser:pass@host:5432/mantle"
pool_size = 10

Making a copy of one of the successfully predeployed subgrap

graphman --config $GRAPH_NODE_CONF copy create mantle/onft-mints-1 mantle_svr18 default

produces no errors. Running

graphman --config $GRAPH_NODE_CONF copy list

outputs (a number of attempts were made)

------------------------------------------------------------------------------
deployment           | QmdQixo5pjYtFspEJLWLP5A2iYJ9b6aF3gA7QEUi9BTGWD
action               | sgd2 -> sgd3 (mantle_svr18)
queued               | 2023-12-05T11:14:02+00:00
------------------------------------------------------------------------------
deployment           | QmfPrkcWi7U1kCHnoNgeM8RVk8TCy54mua3RRM5bM6JL2V
action               | sgd6 -> sgd7 (mantle_svr18)
queued               | 2023-12-05T16:41:09+00:00

It has been hanging in queued phase for more than a day. svr18 is just a separate machine with only the PostgreSQL. When metis_svr18 specified as the shard for certain network, new subgraphs are predictably deployed to that server and stored to that DB. No issues here. Moreover, when copying a subgraph from the shard (svr18) to the main database, the copying process finishes in no time and everything seems fine. --activate and --replace options also work in this case.

The issue is that we cannot unload the main data by moving subgraphs to another shard. The shard DB accessibility is no issue here because it works throughout deployment of new subgraphs that is new subgraphs get deployed to the shard all right.

Another issue is that we cannot unassign, remove, or drop such hanging subgraphs because they are duplicated

name             | mantle/onft-mints-1
status           | current
id               | QmdQixo5pjYtFspEJLWLP5A2iYJ9b6aF3gA7QEUi9BTGWD
namespace        | sgd3
shard            | mantle_svr18
active           | false
chain            | mantle
node_id          | default
paused           | false
synced           | false
health           | healthy
earliest block   | 0
latest block     | -
chain head block | 25113446
-----------------+------------------------------------------------------------
name             | mantle/onft-mints-1
status           | current
id               | QmdQixo5pjYtFspEJLWLP5A2iYJ9b6aF3gA7QEUi9BTGWD
namespace        | sgd2
shard            | mantle
active           | true
chain            | mantle
node_id          | default
paused           | false
synced           | true
health           | healthy
earliest block   | 0
latest block     | 25113445
chain head block | 25113446

and any attempt to remove them produces the following output:

Found 2 deployment(s) to remove:
name       | mantle/onft-mints-1
deployment | QmdQixo5pjYtFspEJLWLP5A2iYJ9b6aF3gA7QEUi9BTGWD
-----------+------------------------------------------------------------------
name       | mantle/onft-mints-1
deployment | QmdQixo5pjYtFspEJLWLP5A2iYJ9b6aF3gA7QEUi9BTGWD

Continue? [y/N] y
Error: Found 2 deployments for `mantle/onft-mints-1`

We are aware of:

Either we mishandle graphman somehow because the documentation is very scarce or there is definitely an issue. Please see relevant logs attached. The entry about successful copying are related to the copy operation done from shard to the main preexisting database. Database remap does not help either.

graphman --config $GRAPH_NODE_CONF database remap 

The overall impression is that Graphman is the excellent tool but very buggy and unfinished. Could you please help resolve the issue? Maybe we need any database being handled by at least one instance of the graph-node (horizontal scaling) to cache the data and attaching an additional DB as a shard to a singular graph node is not supposed to work, which is hardly viable. The doc does not mention it anyway.

Relevant log output

Dec 06 10:26:53 server1 cargo[2688199]: Dec 06 10:26:53.610 INFO Initializing graft by copying data from sgd1 to sgd4, sgd: 4, subgraph_id: QmSSVSfyCNNWhwGAA6q1QYXenSi85nUBejjuSxTsLZt3vJ, component: SubgraphInstanceManager
Dec 06 10:26:53 server1 cargo[2688199]: Dec 06 10:26:53.615 INFO Initializing graft by copying data from sgd6 to sgd7, sgd: 7, subgraph_id: QmfPrkcWi7U1kCHnoNgeM8RVk8TCy54mua3RRM5bM6JL2V, component: SubgraphInstanceManager
Dec 06 10:26:53 server1 cargo[2688199]: Dec 06 10:26:53.615 INFO Initializing graft by copying data from sgd2 to sgd3, sgd: 3, subgraph_id: QmdQixo5pjYtFspEJLWLP5A2iYJ9b6aF3gA7QEUi9BTGWD, component: SubgraphInstanceManager
Dec 06 10:26:53 server1 cargo[2688199]: Dec 06 10:26:53.815 INFO Obtaining copy lock (this might take a long time if another process is still copying), dst: sgd4, sgd: 4, subgraph_id: QmSSVSfyCNNWhwGAA6q1QYXenSi85nUBejjuSxTsLZt3vJ, component: SubgraphInstanceManager
Dec 06 10:26:53 server1 cargo[2688199]: Dec 06 10:26:53.821 INFO Obtaining copy lock (this might take a long time if another process is still copying), dst: sgd7, sgd: 7, subgraph_id: QmfPrkcWi7U1kCHnoNgeM8RVk8TCy54mua3RRM5bM6JL2V, component: SubgraphInstanceManager
Dec 06 10:26:54 server1 cargo[2688199]: Dec 06 10:26:54.123 INFO Obtaining copy lock (this might take a long time if another process is still copying), dst: sgd3, sgd: 3, subgraph_id: QmdQixo5pjYtFspEJLWLP5A2iYJ9b6aF3gA7QEUi9BTGWD, component: SubgraphInstanceManager
Dec 06 10:41:32 server1 cargo[2696895]: Dec 06 10:41:32.982 INFO Initializing graft by copying data from sgd1 to sgd4, sgd: 4, subgraph_id: QmSSVSfyCNNWhwGAA6q1QYXenSi85nUBejjuSxTsLZt3vJ, component: SubgraphInstanceManager
Dec 06 10:41:32 server1 cargo[2696895]: Dec 06 10:41:32.983 INFO Initializing graft by copying data from sgd6 to sgd7, sgd: 7, subgraph_id: QmfPrkcWi7U1kCHnoNgeM8RVk8TCy54mua3RRM5bM6JL2V, component: SubgraphInstanceManager
Dec 06 10:41:32 server1 cargo[2696895]: Dec 06 10:41:32.983 INFO Initializing graft by copying data from sgd2 to sgd3, sgd: 3, subgraph_id: QmdQixo5pjYtFspEJLWLP5A2iYJ9b6aF3gA7QEUi9BTGWD, component: SubgraphInstanceManager
Dec 06 10:41:33 server1 cargo[2696895]: Dec 06 10:41:33.187 INFO Obtaining copy lock (this might take a long time if another process is still copying), dst: sgd7, sgd: 7, subgraph_id: QmfPrkcWi7U1kCHnoNgeM8RVk8TCy54mua3RRM5bM6JL2V, component: SubgraphInstanceManager
Dec 06 10:41:33 server1 cargo[2696895]: Dec 06 10:41:33.187 INFO Obtaining copy lock (this might take a long time if another process is still copying), dst: sgd4, sgd: 4, subgraph_id: QmSSVSfyCNNWhwGAA6q1QYXenSi85nUBejjuSxTsLZt3vJ, component: SubgraphInstanceManager
Dec 06 10:41:33 server1 cargo[2696895]: Dec 06 10:41:33.488 INFO Obtaining copy lock (this might take a long time if another process is still copying), dst: sgd3, sgd: 3, subgraph_id: QmdQixo5pjYtFspEJLWLP5A2iYJ9b6aF3gA7QEUi9BTGWD, component: SubgraphInstanceManager
Dec 06 10:55:28 server1 cargo[2696895]: Dec 06 10:55:28.520 INFO Initializing graft by copying data from sgd8 to sgd9, sgd: 9, subgraph_id: Qmaqr5gsaF7r9WZQwh5KLVR7iAnAQHRvfCibn2YoswjQrR, component: SubgraphInstanceManager
Dec 06 10:55:28 server1 cargo[2696895]: Dec 06 10:55:28.522 INFO Obtaining copy lock (this might take a long time if another process is still copying), dst: sgd9, sgd: 9, subgraph_id: Qmaqr5gsaF7r9WZQwh5KLVR7iAnAQHRvfCibn2YoswjQrR, component: SubgraphInstanceManager
Dec 06 10:55:29 server1 cargo[2696895]: Dec 06 10:55:29.142 INFO Initialize data copy from Qmaqr5gsaF7r9WZQwh5KLVR7iAnAQHRvfCibn2YoswjQrR[sgd8] to Qmaqr5gsaF7r9WZQwh5KLVR7iAnAQHRvfCibn2YoswjQrR[sgd9], dst: sgd9, sgd: 9, subgraph_id: Qmaqr5gsaF7r9WZQwh5KLVR7iAnAQHRvfCibn2YoswjQrR, component: SubgraphInstanceManager
Dec 06 10:55:29 server1 cargo[2696895]: Dec 06 10:55:29.756 INFO Finished copying data into Qmaqr5gsaF7r9WZQwh5KLVR7iAnAQHRvfCibn2YoswjQrR[sgd9], dst: sgd9, sgd: 9, subgraph_id: Qmaqr5gsaF7r9WZQwh5KLVR7iAnAQHRvfCibn2YoswjQrR, component: SubgraphInstanceManager

IPFS hash

No response

Subgraph name or link to explorer

No response

Some information to help us out

OS information

Linux

paymog commented 9 months ago

https://github.com/graphprotocol/graph-node/issues/4719 might be helpful

github-actions[bot] commented 1 month ago

Looks like this issue has been open for 6 months with no activity. Is it still relevant? If not, please remember to close it.