status-im / infra-status

Infrastructure for Status fleets
https://github.com/status-im/nim-waku
0 stars 2 forks source link

Duplicate key value errors for messages table #26

Closed jakubgs closed 7 months ago

jakubgs commented 10 months ago

These logs have started appearing in database logs on shards.test nodes:

[1006408] ERROR:  duplicate key value violates unique constraint "messageindex"
[1006408] DETAIL:  Key (messagehash)=(0da6c0e74e9193f537f154d5a62331c3a561c4313486bd300c5020410a4583cb) already exists.
[1006408] STATEMENT:  INSERT INTO messages (id, messageHash, storedAt, contentTopic, payload, pubsubTopic,
#011  version, timestamp) VALUES ($1, $2, $3, $4, $5, $6, $7, $8);

And the numbers are worrying:

admin@store-db-01.do-ams3.shards.test:~ % zgrep 'messageindex' /var/log/docker/nim-waku-store-db/docker.* | wc -l
985303
admin@store-db-01.do-ams3.shards.test:~ % zgrep 'duplicate key value' /var/log/docker/nim-waku-store-db/docker.* | wc -l        
985303

Node version used:

admin@store-01.do-ams3.shards.test:~ % /docker/nim-waku-store/rpc.sh get_waku_v2_debug_v1_version | jq -c
{"jsonrpc":"2.0","id":1,"result":"v0.22.0-rc.0-29-gbd60bf"}
jakubgs commented 10 months ago

The gbd60bf commit doesn't appear to exist in the repo: https://github.com/waku-org/nwaku/commit/gbd60bf

jakubgs commented 10 months ago

This fleet currently does not use the layout where a database is shared: https://github.com/status-im/infra-shards/blob/ae8ff33aa207241cdf57a3029e1ff022ae6548b8/ansible/group_vars/store.yml#L40

jakubgs commented 10 months ago

If we looks at ElasticSearch logs for shards.test fleet we can see the issue started on 2023.12.11:

 > ./elasticsearch/esclean.py -i 'logstash-2023.12.*' -f shards.test -p 'docker/nim-waku-store-db' -m 'duplicate key value' 
Cluster: status-logs-search
logstash-2023.12.01    count:        0
logstash-2023.12.02    count:        0
logstash-2023.12.03    count:        0
logstash-2023.12.04    count:        0
logstash-2023.12.05    count:        0
logstash-2023.12.06    count:        0
logstash-2023.12.07    count:        0
logstash-2023.12.08    count:        0
logstash-2023.12.09    count:        0
logstash-2023.12.10    count:        0
logstash-2023.12.11    count:     1427
logstash-2023.12.12    count:     5191
logstash-2023.12.13    count:     8664
logstash-2023.12.14    count:     9390
logstash-2023.12.15    count:    10977
logstash-2023.12.16    count:     8047
logstash-2023.12.17    count:     3624
logstash-2023.12.18    count:    27613
logstash-2023.12.19    count:    33225
logstash-2023.12.20    count:    34919
logstash-2023.12.21    count:    82543
logstash-2023.12.22    count:    73998
logstash-2023.12.23    count:     6123
logstash-2023.12.24    count:     8733
logstash-2023.12.25    count:     8880
logstash-2023.12.26    count:    41259
logstash-2023.12.27    count:  1229433
logstash-2023.12.28    count:  1329964
logstash-2023.12.29    count:    69862
logstash-2023.12.30    count:     9576
logstash-2023.12.31    count:    12972

And reached a peak on 2023.12.28.

jakubgs commented 10 months ago

Oh, sorry, my mistake, the DB is actually used by two nodes per DC:

store-01.do-ams3.shards.test
store-02.do-ams3.shards.test

But that doesn't explain why the warnings started showing up on 2023.12.11.

jakubgs commented 10 months ago

We could avoid these by first checking if a given message is already in the table: https://stackoverflow.com/a/12388511

But in theory that could have a performance impact, but not sure, since the DB still has to check the uniqueness of the record anyway, even if you just insert.

jakubgs commented 10 months ago

For now I have decided to filter out these logs on Logstash level to avoid unnecessary ElasticSearch spam:

And opened an issue to possibly query before inserting:

jakubgs commented 10 months ago

Also took a look at waku node logs and found out most come from two files: pubsub.nim and pubsubpeer.nim

image

Unrelated but interesting.

jakubgs commented 7 months ago

There is one thing we could use, not sure what the performance impact would be:

INSERT ... ON CONFLICT DO NOTHING;

The ON CONFLICT clause could prevent these warnings: https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT

Opened an issue:

No idea if it would work.