Open jharveyb opened 12 months ago
Hi!
I've set up tapd 0.4.1-alpha with a local bitcoind 27.1 and try to sync the global default universe. I use the default SQLite and after about a week of syncing my tapd.db is only about 150 MB and I've only managed to sync a few assets, I think the stats didn't change since several days:
root@server:~# tapcli universe stats
{
"num_total_assets": "6440",
"num_total_groups": "37",
"num_total_syncs": "0",
"num_total_proofs": "6930"
}
I get the "db tx retries exceeded" error as well:
2024-08-15 13:17:09.938 [DBG] UNIV: Looking up root node for base Universe (universe.Identifier) issuance-5cef8faf4ce2d667bb31ada76ff6d0a3c1d7944cf36a39f4af4d4e55084257a7
[... about 895 similar lines in the same second... ]
2024-08-15 13:17:09.941 [DBG] UNIV: Looking up root node for base Universe (universe.Identifier) issuance-8ab9ad693975feaabb386772753217a8d0122570610ca8a6aeec8ca900c745a0
2024-08-15 13:17:09.942 [WRN] UNIV: encountered an error whilst syncing with server=(universe.ServerAddr) {
ID: (int64) 1,
addrStr: (string) (len=32) "universe.lightning.finance:10029",
addr: (net.Addr) <nil>
}
: unable to register proofs: unable to register proofs: unable to register new group anchor issuance proofs: db tx retries exceeded
There is no high I/O on the local disk nor high CPU load. Is a global universe sync only supported/working with a Postgres DB? I've used SQLite for other projects with several gigabyte databases without issues. I have this issue on both testnet3 and mainnet.
Hi @freerko. The sync should work for SQLite users. Not sure why you run into this particular issue consistently. I assume you're running with the --universe.sync-all-assets
flag (or universe.sync-all-assets=true
config option)?
If you're able (and comfortable) building from source, could you try increasing this number to 50
and seeing if that changes things?
Thanks @guggero! I didn't use the universe.sync-all-assets=true config option yet, but set it now since it sounds like what I want :) I've changed the DefaultNumTxRetries to 50 and compiled tapd and let it run on the same db as before. Now it adds new assets, I see the numbers from "tapcli universe stats" increasing. But still after running the universe sync command I run into the same error, after half an hour or so:
root@server:~# tapcli universe sync --universe_host universe.lightning.finance [tapcli] rpc error: code = Unknown desc = unable to sync universe: unable to register proofs: unable to register proofs: unable to register new group anchor issuance proofs: db tx retries exceeded
But it continues to sync in the background, so I'll let it run over the next days and will report what happens.
On the other node, on testnet3, I got this error after setting the retries to 50, every 10 minutes or so with different hashes and the numbers in the stats command didn't increase:
2024-08-16 09:44:02.763 [DBG] UNIV: Looking up root node for base Universe (universe.Identifier) issuance-bda186d39d9707ec78a639659de8fe067d0163a42a0850f75c224cb9dbe00ae3
2024-08-16 09:44:02.763 [DBG] UNIV: Looking up root node for base Universe (universe.Identifier) issuance-4d248e3497ebfc47f01eee146aded9b1d321b5a33ebb3f8e4f5a145b8cc7fb43
2024-08-16 09:44:02.763 [DBG] UNIV: Looking up root node for base Universe (universe.Identifier) issuance-b6e4311901cadc3d11286d231a5e4c4585041934bdfd3261d64a37937c1330de
2024-08-16 09:44:02.764 [WRN] UNIV: encountered an error whilst syncing with server=(universe.ServerAddr) { ID: (int64) 1, addrStr: (string) (len=40) "testnet.universe.lightning.finance:10029", addr: (net.Addr)
} : unable to register proofs: unable to register proofs: unable to verify issuance proofs: unable to fetch previous asset snapshot: unable to fetch previous proof: no universe proof found, id=(universe.Identifier) issuance-a823a0096bd93c6683fb0189272521a4f7a2b57baec1558c60aa22a3afe446c5 , leaf_key=(universe.LeafKey) { OutPoint: (wire.OutPoint) 2dcc508b05efd21f594f1632a63a68d81d157e76fb22f64c9a15a61f2356148b:0, ScriptKey: (asset.ScriptKey)(0xc001004de0)({ PubKey: (secp256k1.PublicKey)(0xc00101e320)({ x: (secp256k1.FieldVal) 9c75753189074f4fa8cd322c9ca409d8d472a5e7ae0f7dd466226c93cc79a5c6, y: (secp256k1.FieldVal) d4501219d06825c66444e82b367d06eb8f84a61a4b97e1a1f715cfa4fa320258 }), TweakedScriptKey: (*asset.TweakedScriptKey)( ) }) } , new_script_key=02540e4eff752cbd7cd9e963e69bea5ba45c73909928dd171ff98e1b4adc03fc24
So I've removed the data folder with the db and starting from scratch on testnet3, I got some assets synced but then I got another error:
2024-08-16 10:57:37.117 [DBG] UNIV: Looking up root node for base Universe (universe.Identifier) issuance-dfa1a1bf7bfee242ba635bcd966393d8df48beccdc4be9fecd386728b563edcd
2024-08-16 10:57:37.117 [DBG] UNIV: Looking up root node for base Universe (universe.Identifier) issuance-3a8e10323168e7ebf19fbc59bcb607071f7cb4f195f041522e9ec576f49624ee
2024-08-16 10:57:37.165 [WRN] UNIV: encountered an error whilst syncing with server=(universe.ServerAddr) { ID: (int64) 1, addrStr: (string) (len=40) "testnet.universe.lightning.finance:10029", addr: (net.Addr)
} : unable to register proofs: unable to register proofs: unable to verify issuance proofs: unable to verify proof: failed to validate proof block header: block hash and block height mismatch; (height: 2867248, hashAtHeight: 00000000d1e5841c9873c053fafc6972da94549198268d1b9d73ddfa184cb6bf, expectedHash: 0000000075f28b79d00152765fcf6feb4d28b4270f8000fd396b675797826d3a) 2024-08-16 10:57:47.284 [INF] TADB: Refreshing stats cache, duration=30m0s 2024-08-16 10:57:47.310 [DBG] TADB: Refreshed stats cache, interval=30m0s, took=26.134612ms
I get the error with the same hashes every 10 minutes now and the numbers in the stats command are not increasing anymore. The hashAtHeight seems to be fine for that block: https://mempool.space/testnet/block/00000000d1e5841c9873c053fafc6972da94549198268d1b9d73ddfa184cb6bf I don't know where that expectedHash comes from.
block hash and block height mismatch; (height: 2867248, hashAtHeight: 00000000d1e5841c9873c053fafc6972da94549198268d1b9d73ddfa184cb6bf, expectedHash: 0000000075f28b79d00152765fcf6feb4d28b4270f8000fd396b675797826d3a)
I think someone uploaded a proof referencing a block that was later re-organized (testnet3 is a bit of a mess when it comes to re-orgs). We increased the number of blocks we watch for re-orgs on testnet since. But it seems like we probably have to remove that proof to fix the sync. I'll look into it.
@freerko I couldn't reproduce the block height mismatch on testnet... Do you have more information from the log what asset gave the problem?
hmm, how can I see in the logs which asset gave the problem? Does the group key in the error of the sync command helps?
I try to sync everything (with universe.sync-all-assets=true) like this and get this error:
root@server:~# tapcli universe sync --universe_host testnet.universe.lightning.finance [tapcli] rpc error: code = Unknown desc = unable to sync universe: unable to register proofs: unable to register proofs: unable to verify issuance proofs: unable to verify proof: group key not known: 02d8394c926be907c30af34e7b998afd09cb2638c69950a604d5a54711142bc97d: group verifier: no matching asset group: sql: no rows in result set: group key not known
In the tapd.log it looks like this, I get always different "Looking up for root node for base Universe..." lines before the sync error.
2024-08-26 12:04:38.276 [DBG] UNIV: Looking up root node for base Universe (universe.Identifier) issuance-012b09f5ce832145b2cc4302bca5d23f6beb2b7125826120621561f94c59957f
2024-08-26 12:04:38.276 [DBG] UNIV: Looking up root node for base Universe (universe.Identifier) issuance-a7c4183f78cf3c09e7d85d49844c5dec36610c4feae02524368fb1d17da73d7a
2024-08-26 12:04:38.276 [DBG] UNIV: Looking up root node for base Universe (universe.Identifier) issuance-6362b41fd0201dabe36cd02c5c02d7b8d6b76062d8b6a7d2f1ae0fd7c4cc493d
2024-08-26 12:04:38.276 [DBG] UNIV: Looking up root node for base Universe (universe.Identifier) issuance-5a1afa5667055ea76951cc913065606c8f4d539d9d35176a27559e31822f509e
2024-08-26 12:04:38.276 [DBG] UNIV: Looking up root node for base Universe (universe.Identifier) issuance-d421c41e5cccc7a46853a67a77f9bfbff0e46b636be45080b70d6228430c2675
2024-08-26 12:04:38.276 [DBG] UNIV: Looking up root node for base Universe (universe.Identifier) transfer-4d248e3497ebfc47f01eee146aded9b1d321b5a33ebb3f8e4f5a145b8cc7fb43
2024-08-26 12:04:38.276 [DBG] UNIV: Looking up root node for base Universe (universe.Identifier) issuance-8e2feef6d0ef9f0b544140b8f74ceb327a84b0442a1d2e2a15bef8686755efd3
2024-08-26 12:04:38.310 [WRN] UNIV: encountered an error whilst syncing with server=(universe.ServerAddr) { ID: (int64) 1, addrStr: (string) (len=40) "testnet.universe.lightning.finance:10029", addr: (net.Addr)
} : unable to register proofs: unable to register proofs: unable to verify issuance proofs: unable to verify proof: failed to validate proof block header: block hash and block height mismatch; (height: 2867248, hashAtHeight: 00000000d1e5841c9873c053fafc6972da94549198268d1b9d73ddfa184cb6bf, expectedHash: 0000000075f28b79d00152765fcf6feb4d28b4270f8000fd396b675797826d3a) 2024-08-26 12:04:40.658 [INF] GRDN: New block at height 2875486
My local stats look similar to the ones from the Lightning labs universe, at least it is not missing a lot:
root@server:# tapcli universe stats { "num_total_assets": "1240", "num_total_groups": "189", "num_total_syncs": "0", "num_total_proofs": "1703" } root@server:# curl -s https://testnet.universe.lightning.finance/v1/taproot-assets/universe/stats | jq { "num_total_assets": "1244", "num_total_groups": "190", "num_total_syncs": "2538359", "num_total_proofs": "6081" } root@server:#
Now looking at my Mainnet node again it seems like the sync is stuck, the stats havn't changed for a few days:
root@server:# tapcli universe stats { "num_total_assets": "16189", "num_total_groups": "65", "num_total_syncs": "0", "num_total_proofs": "21606" }
The data directory looks weird as well, apparently tapd didn't write to the tapd.db for more than a week and to the big "tapd.db-wal" file two days ago:
root@server:# ls -alh /root/.tapd/data/mainnet/ total 2.9G drwx------ 3 root root 4.0K Aug 19 14:49 . drwx------ 4 root root 4.0K Aug 7 13:48 .. -rw-r--r-- 1 root root 248 Aug 7 13:35 admin.macaroon drwxr-x--- 2 root root 4.0K Aug 7 13:35 proofs -rw-r--r-- 1 root root 200M Aug 19 15:14 tapd.db -rw-r--r-- 1 root root 12K Aug 7 13:35 tapd.db.1723030556431609440.backup -rw-r--r-- 1 root root 148M Aug 16 11:08 tapd.db.1723799284664422550.backup -rw-r--r-- 1 root root 5.1M Aug 25 20:51 tapd.db-shm -rw-r--r-- 1 root root 2.6G Aug 25 20:51 tapd.db-wal root@server:~# date Tue Aug 27 03:39:14 PM CEST 2024
But when I look at the "db cache universe_roots hits/misses hit ratio" in the tapd.log I see a very slowly increasing hit ratio of currently 2,46% (yesterday it was about 1,6%). The node syncs since about two weeks now from the default universe. In the tapd.log I still see some new "Verifying 200 new proofs for insertion into Universe" lines, but the stats are not increasing. The system is a VM with 8 vCores of AMD EPYC 7282 and 24GB RAM and SSD disk running several other services as well, but in my monitoring I don't see the system limited by neither CPU, RAM nor network or disk I/O.
Is this to be expected? Does it take months to sync? Or is my system to slow? Should I restart tapd regularly?
Logs look ok for me:
That is a bit weird, I would expect the sync to make progress after each run at least. For context, if I pull the stats from the mainnet Universe server:
curl -s https://universe.lightning.finance/v1/taproot-assets/universe/stats | jq
{
"num_total_assets": "108011",
"num_total_groups": "176",
"num_total_syncs": "80016341",
"num_total_proofs": "174907"
}
The system is a VM with 8 vCores of AMD EPYC 7282 and 24GB RAM and SSD disk
That should be totally fine, the sync should not be particularly resource intensive, and I think the total storage used for all proofs is still relatively small also.
IIRC there is some server-side rate limiting that may slow syncing progress if you're trying to sync all assets, but it isn't intended to prevent that.
Do all full universe server encounter this issue when initiating?
Do all full universe server encounter this issue when initiating?
Working on repro for mainnet now.
With DefaultNumTxRetries set to 999 now it syncs again, albeit very slowly with about 1000 assets per day, which would mean about 90 days left. Lets see :)
Can you try the default number of retries, with GOMAXPROCS=1
, running this PR or later?
https://github.com/lightninglabs/taproot-assets/pull/1123
We're also checking to see if some traffic rate limiting may be affecting this / syncing a new full universe from scratch.
With the default number of retries and GOMAXPROCS=1 the proofs are increasing but the assets are not.
This was about 20 hours ago:
root@server:~# tapcli universe stats
{
"num_total_assets": "20842",
"num_total_groups": "67",
"num_total_syncs": "0",
"num_total_proofs": "40238"
}
and this is now:
root@server:~# tapcli universe stats
{
"num_total_assets": "20842",
"num_total_groups": "67",
"num_total_syncs": "0",
"num_total_proofs": "44889"
}
There are no "DB TX retries exceeded" errors in the log. After the restart tapd inserts 200 new leaves about every hour, beginning with 2024-09-26 14:36:34.803 [DBG] UNIV: UniverseRoot(issuance-87225111b8571dcd09e00116238f61c71091485538801b2d6b6457a75119861e): Inserting 200 new leaves (200 of 4452)
up to 2024-09-27 05:32:35.791 [DBG] UNIV: UniverseRoot(issuance-87225111b8571dcd09e00116238f61c71091485538801b2d6b6457a75119861e): Inserting 51 new leaves (4451 of 4452)
. And then continuing with a new root?: 2024-09-27 06:03:08.000 [DBG] UNIV: UniverseRoot(issuance-997e28940f643b41a3d937bcc842b01c3ba863d66dbd365db015d62a79dd32fc): Inserting 200 new leaves (200 of 3563)
.
Should I increase the DefaultNumTxRetries
again?
Hmm, sounds like an issue with how we are passing work to the worker pool then, that single worker should end up receiving all the leaves anyways and not just batches of 200.
Will have to look into this deeper, I don't think NumRetries
is the core issue.
Background
Describe your issue here.
When running with tapd v0.3.0 and a global sync of all issuances on, I get numerous log messages about DB TX retries due to serialization errors. In some cases, these operations hit the 10 retry limit and cause sync to fail.
I know that for 0.3.1 onwards global sync is not the default, but IMO this is an underlying issue that I think could appear even without global sync.
Your environment
Steps to reproduce
Tell us how to reproduce this issue. Please provide stacktraces and links to code in question.
Perform global issuance sync with the Lightning Labs default universe, with a fresh tapd.
From the log lines preceding these DB issues, I'm pretty sure this is triggered when multiple large (3000+ leaf) asset groups are being synced, and those proofs are being written to disk. The default batch size here is 200 elements, which can be a max size of at least 200 MiB. Since the sync is parallel, we can have multiple goroutines attempting writes in parallel.
The log lines aren't tagged with a goroutine ID or any way to distinguish writers, but based on the attempt counter it seems like 10+ concurrent writers.
Partial logs, generated with
cat tapd.log | rg serializ -C20
:Possible fixes
I expect most end users with smaller wallets (not universe operators, companies running larger nodes) with run with the defaults and on an SQLite backend.
We could decrease the default DB write batch size, or the # of parallel goroutines during sync.
Alternatively, we could add another synchronization layer between concurrent DB writers like a counting semaphore to rate-limit DB access but still have many goroutines for the other work performed during sync.