jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project
https://github.com/jenkins-infra/helpdesk/issues/new/choose
16 stars 10 forks source link

[uplink] Download failing for `JavaSystemProperties` with `error: missing chunk number 0 for toast value xx in pg_toast_xxx` #3883

Open dduportal opened 7 months ago

dduportal commented 7 months ago

Service(s)

Uplink

Summary

As reported by @daniel-beck in IM discussion (ref. https://matrix.to/#/!JLUOInpEYmxJIYXlzs:matrix.org/$VlAjMDeZOPjLhMwkevPtM7P1zklTyH_wvJ9eXfpkHaE?via=g4v.dev&via=gitter.im&via=matrix.org):

Hi folks, I am unable to download the JavaSystemProperties telemetry trial data from uplink.jenkins.io, downloads abort. Other trials' data can be downloaded. Do we have data(base) problems? Anything interesting being logged? (...) Specifically, any date range including Dec 24 fails. Earlier and later work. (...) Correction, 2023-12-31 also fails.

Reproduction steps

dduportal commented 7 months ago

Reproduced the error:

error: missing chunk number 0 for toast value 21681973 in pg_toast_17895
    at Connection.parseE (/srv/uplink/node_modules/pg/lib/connection.js:553:11)
    at Connection.parseMessage (/srv/uplink/node_modules/pg/lib/connection.js:378:19)
    at Socket.<anonymous> (/srv/uplink/node_modules/pg/lib/connection.js:119:22)
    at Socket.emit (events.js:180:13)
    at Socket.emit (domain.js:422:20)
    at addChunk (_stream_readable.js:274:12)
    at readableAddChunk (_stream_readable.js:261:11)
    at Socket.Readable.push (_stream_readable.js:218:10)
    at TCP.onread (net.js:581:20)
dduportal commented 7 months ago

Triggered a REINDEX of the table events (will take some time and resources) from a screen named helpdesk-3883 on the VPN private machine)

dduportal commented 7 months ago

Triggered a REINDEX of the table events (will take some time and resources) from a screen named helpdesk-3883 on the VPN private machine)

Update:

> select reltoastrelid::regclass from pg_class where relname = 'events';
      reltoastrelid      
-------------------------
 pg_toast.pg_toast_17895
> REINDEX table pg_toast.pg_toast_17895;

TODO:

dduportal commented 7 months ago

Update:

dduportal commented 7 months ago

Update:

dduportal commented 7 months ago

Update: still trying to find the corrupted table segment

dduportal commented 7 months ago

Update:

> select reltoastrelid::regclass from pg_class where relname = 'events';
reltoastrelid      
-------------------------
pg_toast.pg_toast_17895

> select count(*) from events;
  count
----------
 34673551
(1 row)

# Some retries
> select * from events order by id limit 5000000 offset 30000000;
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

=> now to use dichotomy to locate the failing segment

dduportal commented 7 months ago

Update: dichotomy

> select * from events order by id limit 5000000 offset 30000000;
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 2500000 offset 30000000;
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 1250000 offset 30000000; # 30000000 -> 31250000
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 625000 offset 30000000; # 30000000 -> 30625000
(no error)

> select * from events order by id limit 625000 offset 30625000; # 30625000 -> 31250000
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 312500 offset 30625000; # 30625000 -> 30937500
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895
dduportal commented 7 months ago

Update:

> select * from events order by id limit 156250 offset 30625000; # 30625000 -> 30781250
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 78125 offset 30625000; # 30625000 -> 30703125
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 35000 offset 30625000; # 30625000 -> 30660000
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 10000 offset 30625000; # 30625000 -> 30635000
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 5000 offset 30625000; # 30625000 -> 30630000
No error

Almost there!

dduportal commented 7 months ago

Update:

> select * from events order by id limit 5000 offset 30630000; # 30630000 -> 30635000
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 2500 offset 30630000; # 30630000 -> 30632500
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 1000 offset 30630000; # 30630000 -> 30631000
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 500 offset 30630000; # 30630000 -> 30630500
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 100 offset 30630000; # 30630000 -> 30630100
No error

> select * from events order by id limit 100 offset 30630100; # 30630100 -> 30630200
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 50 offset 30630100; # 30630100 -> 30630150
# In progress
dduportal commented 7 months ago

Finally!

> select * from events order by id limit 50 offset 30630100; # 30630100 -> 30630150
No error

> select * from events order by id limit 50 offset 30630150; # 30630150 -> 30630200
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 25 offset 30630150; # 30630150 -> 30630175
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 10 offset 30630150; # 30630150 -> 30630160
No error

> select * from events order by id limit 10 offset 30630160; # 30630160 -> 30630170
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 5 offset 30630160; # 30630160 -> 30630165
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 2 offset 30630160;
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 1 offset 30630160;
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895
dduportal commented 7 months ago

Update:

> select id from events order by id limit 1 offset 30630160;
id
----------
 30648884
(1 row)

> delete from events where id = 30648884;
DELETE 1
uplink=> select id from events order by id limit 1 offset 30630160;
    id
----------
 30648885
(1 row)
dduportal commented 7 months ago

Here we go again, let's fin the other corrupted records (same screen session, same VM):

> select * from events order by id limit 78125 offset 30625000; # 30625000 -> 30703125
# WiP
dduportal commented 6 months ago

Deleted a few more (corrupted) records and ran a VACUUM:

> delete from events where id = 30648885;
DELETE 1
> delete from events where id = 30648886;
DELETE 1
> delete from events where id = 30648887;
DELETE 1
> delete from events where id = 30648888;
DELETE 1

> VACUUM analyze events;

=> download goes further but still errors. Had to reboot the machine (and loose the screen session) so the research of other corrupted records will have to wait after FOSDEM.

daniel-beck commented 6 months ago

FWIW this particular data collection is now (mostly) finished. It was how I got file.encoding stats for the advisory. The other system properties might be interesting but not an immediate concern.

Since it's not just one record it might be interesting to find out what's happening here though, to prevent it from occurring in the future.

dduportal commented 6 months ago

Reseach for corrupted records still running

dduportal commented 6 months ago

Update:

dduportal commented 6 months ago

Of course:

> REINDEX table pg_toast.pg_toast_17895;
ERROR:  permission denied for schema pg_toast

because Azure Flexible Server does NOT provide super user access.

Gotta check https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/how-to-perform-fullvacuum-pg-repack