Lazyfilewalker GC does not resume from previously saved progress point after storagenode restart

MaxMaximus commented 4 weeks ago

Hello.

Very useful feature to save-state-resume feature for GC filewalker is marked as already done some time ago: https://github.com/storj/storj/issues/6708 https://review.dev.storj.io/plugins/gitiles/storj/storj/+/0f90f061b028a9c877dbed3c01d8c3d95e4bc518

And it listed as merged and in production since storagenode v1.102: https://github.com/storj/storj/commit/0f90f06

But tests on my nodes (v 1.104.5 and 1.105.4 running as windows service) show what this feature is not working properly. GC (running in "lazy mode" as separate process) DO save current progress to db (garbage_collection_filewalker_progress.db). But it does not use this information after node restart and begins GC process from scratch instead.

Steps to reproduce the issue:

I taken node currently in the process of GC for satelliteID 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S
I checked which prefix it was checking at the moment (by monitoring the disk I/O of GC process). It was \blobs\ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa\2m\ I also opened garbage_collection_filewalker_progress.db (I use sqlite3 tool to view it):
```
sqlite3 garbage_collection_filewalker_progress.db
sqlite> .dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE versions (version int, commited_at text);
INSERT INTO versions VALUES(55,'2024-05-02 03:37:50.4383409 +0300 MSK m=+1.158066201');
CREATE TABLE progress (
                                            satellite_id BLOB NOT NULL,
                                    bloomfilter_created_before TIMESTAMP NOT NULL,
                                    last_checked_prefix TEXT NOT NULL,
                                    PRIMARY KEY (satellite_id)
                            );
INSERT INTO progress VALUES(X'a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000','2024-06-29 17:59:59.993737+00:00','2m');
COMMIT;
```
Looks like progress is saved. But here's an important note: I noticed that lazy GC writes the prefixs to this database immediately when it STARTS working on it. Whereas the name of the field in db table last_checked_prefix (as well as the description of the changes on the github) assume that the last prefix already processed should be stored in the database. Either this is an incorrect name and description of the field. Or could be another bug in the code that records the current (just started) prefix instead of the previous (last completed) one.
I stopped the node correctly: sc stop storagenode and waited while all storj processes finished and exited
I started node again: sc start storagenode
Checked which prefix GC is processing now. It was \blobs\ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa\aa\ That is, it started its work from the beginning, the prefix stored in the database (\2m), which was last processing during the previous launch, was ignored.
After some time progress in garbage_collection_filewalker_progress.db was also reset.

Logs: storagenode.log around restart

2024-07-07T05:32:53+03:00   INFO    piecestore  downloaded  {"Piece ID": "ZOM7ROCSA4MGKZIUWNRHJIT6OUZIMEE7Y7MPXPKBUNB5CTQBQV4A", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 181248, "Remote Address": "109.61.92.73:60450"}
2024-07-07T05:32:53+03:00   INFO    piecestore  downloaded  {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.213.33:55368"}
2024-07-07T05:32:53+03:00   INFO    Stop/Shutdown request received.
2024-07-07T05:32:53+03:00   ERROR   piecestore  download failed {"Piece ID": "TUHGHHLPRCLJ773QCZZO7YP6LDBKQV4EWAHKRUYRD54NPL2E3OYA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "Offset": 0, "Size": 46336, "Remote Address": "5.161.196.36:38340", "error": "write tcp 192.168.0.2:28967->5.161.196.36:38340: use of closed network connection", "errorVerbose": "write tcp 192.168.0.2:28967->5.161.196.36:38340: use of closed network connection\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:409\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:470\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:408\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).sendData.func1:860\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2024-07-07T05:32:54+03:00   INFO    lazyfilewalker.gc-filewalker    subprocess exited with status   {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "status": 1, "error": "exit status 1"}
2024-07-07T05:32:54+03:00   ERROR   pieces  lazyfilewalker failed   {"error": "lazyfilewalker: exit status 1", "errorVerbose": "lazyfilewalker: exit status 1\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkSatellitePiecesToTrash:160\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:561\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:373\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:259\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-07-07T05:32:54+03:00   ERROR   filewalker  failed to get progress from database
2024-07-07T05:32:54+03:00   ERROR   retain  retain pieces failed    {"cachePath": "D:\\Storj_Data\\Storage Node/retain", "error": "retain: filewalker: context canceled", "errorVerbose": "retain: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash:181\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:568\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:373\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:259\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-07-07T05:32:54+03:00   INFO    lazyfilewalker.used-space-filewalker    subprocess exited with status   {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "status": 1, "error": "exit status 1"}
2024-07-07T05:32:54+03:00   ERROR   pieces  failed to lazywalk space used by satellite  {"error": "lazyfilewalker: exit status 1", "errorVerbose": "lazyfilewalker: exit status 1\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-07T05:32:54+03:00   INFO    lazyfilewalker.used-space-filewalker    starting subprocess {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-07-07T05:32:54+03:00   ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess  {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "error": "context canceled"}
2024-07-07T05:32:54+03:00   ERROR   pieces  failed to lazywalk space used by satellite  {"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-07-07T05:32:54+03:00   INFO    lazyfilewalker.used-space-filewalker    starting subprocess {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T05:32:54+03:00   ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess  {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "error": "context canceled"}
2024-07-07T05:32:54+03:00   ERROR   pieces  failed to lazywalk space used by satellite  {"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T05:32:54+03:00   INFO    lazyfilewalker.used-space-filewalker    starting subprocess {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-07-07T05:32:54+03:00   ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess  {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "error": "context canceled"}
2024-07-07T05:32:54+03:00   ERROR   pieces  failed to lazywalk space used by satellite  {"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-07-07T05:32:54+03:00   ERROR   piecestore:cache    error getting current used space:   {"error": "filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled", "errorVerbose": "group:\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-07-07T05:35:03+03:00   INFO    Configuration loaded    {"Location": "D:\\Storj_Data\\Storage Node\\config.yaml"}
2024-07-07T05:35:04+03:00   INFO    Anonymized tracing enabled
2024-07-07T05:35:04+03:00   INFO    Operator email  {"Address": "***"}
2024-07-07T05:35:04+03:00   INFO    Operator wallet {"Address": "***"}
2024-07-07T05:35:04+03:00   INFO    Telemetry enabled   {"instance ID": "***"}
2024-07-07T05:35:04+03:00   INFO    Event collection enabled    {"***"}
2024-07-07T05:35:04+03:00   INFO    db.migration    Database Version    {"version": 57}
2024-07-07T05:35:04+03:00   INFO    preflight:localtime start checking local system clock with trusted satellites' system clock.
2024-07-07T05:35:05+03:00   INFO    preflight:localtime local system clock is in sync with trusted satellites' system clock.
2024-07-07T05:35:05+03:00   INFO    Node **** started
2024-07-07T05:35:05+03:00   INFO    Public server started on [::]:28967
2024-07-07T05:35:05+03:00   INFO    Private server started on 127.0.0.1:7778
2024-07-07T05:35:05+03:00   INFO    bandwidth   Persisting bandwidth usage cache to db
2024-07-07T05:35:05+03:00   INFO    trust   Scheduling next refresh {"after": "6h26m36.718354629s"}
2024-07-07T05:35:05+03:00   INFO    pieces:trash    emptying trash started  {"Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-07-07T05:35:05+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker starting subprocess {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-07-07T05:35:05+03:00   INFO    retain  Prepared to run a Retain request.   {"cachePath": "D:\\Storj_Data\\Storage Node/retain", "Created Before": "2024-06-29T20:59:59+03:00", "Filter Size": 10861233, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T05:35:05+03:00   WARN    piecestore:monitor  Disk space is less than requested. Allocated space is   {"bytes": 5305067636272}
2024-07-07T05:35:05+03:00   INFO    lazyfilewalker.used-space-filewalker    starting subprocess {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-07T05:35:05+03:00   INFO    piecestore  download started    {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.226.100:54196"}
2024-07-07T05:35:05+03:00   INFO    piecestore  download started    {"Piece ID": "O4YOG4ZJUGK53W55FQ3I3TWS4P6XCGJL2CQEUFSLIFKY2ZDS5LJA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 9984, "Remote Address": "109.61.92.78:57744"}
2024-07-07T05:35:05+03:00   INFO    lazyfilewalker.used-space-filewalker    subprocess started  {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-07T05:35:05+03:00   INFO    piecestore  download started    {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.226.100:54208"}
2024-07-07T05:35:05+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker subprocess started  {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-07-07T05:35:05+03:00   ERROR   piecestore  download failed {"Piece ID": "O4YOG4ZJUGK53W55FQ3I3TWS4P6XCGJL2CQEUFSLIFKY2ZDS5LJA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 9984, "Remote Address": "109.61.92.78:57744", "error": "trust: rpc: tcp connector failed: rpc: dial tcp 34.150.199.48:7777: operation was canceled", "errorVerbose": "trust: rpc: tcp connector failed: rpc: dial tcp 34.150.199.48:7777: operation was canceled\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190"}
2024-07-07T05:35:05+03:00   ERROR   piecestore  download failed {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.226.100:54196", "error": "trust: rpc: tcp connector failed: rpc: dial tcp 34.159.134.91:7777: operation was canceled", "errorVerbose": "trust: rpc: tcp connector failed: rpc: dial tcp 34.159.134.91:7777: operation was canceled\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190"}
2024-07-07T05:35:05+03:00   INFO    lazyfilewalker.gc-filewalker    starting subprocess {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T05:35:05+03:00   ERROR   piecestore  download failed {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.226.100:54208", "error": "trust: rpc: tcp connector failed: rpc: dial tcp 34.159.134.91:7777: operation was canceled", "errorVerbose": "trust: rpc: tcp connector failed: rpc: dial tcp 34.159.134.91:7777: operation was canceled\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190"}
2024-07-07T05:35:05+03:00   INFO    lazyfilewalker.gc-filewalker    subprocess started  {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T05:35:05+03:00   INFO    piecestore  download started    {"Piece ID": "Z34OT3LZDEYU7S7NMGZSJD2GJCIOHAOKJNII5KKRZFRVFDGIB2VA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_AUDIT", "Offset": 1358080, "Size": 256, "Remote Address": "34.124.189.53:52220"}
2024-07-07T05:35:05+03:00   INFO    piecestore  download started    {"Piece ID": "MQN2PAHQJAUTWJFZAB3PS64LHKD43CIAYZDFLTK5SNOYX3JF2EGA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_AUDIT", "Offset": 109824, "Size": 256, "Remote Address": "34.124.189.53:33711"}
2024-07-07T05:35:05+03:00   INFO    lazyfilewalker.used-space-filewalker.subprocess Database started    {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Process": "storagenode"}
2024-07-07T05:35:05+03:00   INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started   {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Process": "storagenode"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  trash-filewalker started    {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "dateBefore": "2024-06-30T05:35:05+03:00", "Process": "storagenode"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  Database started    {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Process": "storagenode"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  trash-filewalker completed  {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Process": "storagenode", "bytesDeleted": 0, "numKeysDeleted": 0}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker subprocess finished successfully    {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-07-07T05:35:06+03:00   INFO    pieces:trash    emptying trash finished {"Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "elapsed": "448.0257ms"}
2024-07-07T05:35:06+03:00   INFO    pieces:trash    emptying trash started  {"Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker starting subprocess {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker subprocess started  {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  trash-filewalker started    {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Process": "storagenode", "dateBefore": "2024-06-30T05:35:06+03:00"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  Database started    {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Process": "storagenode"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  trash-filewalker completed  {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Process": "storagenode", "bytesDeleted": 0, "numKeysDeleted": 0}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker subprocess finished successfully    {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-07T05:35:06+03:00   INFO    pieces:trash    emptying trash finished {"Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "elapsed": "251.0143ms"}
2024-07-07T05:35:06+03:00   INFO    pieces:trash    emptying trash started  {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker starting subprocess {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker subprocess started  {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T05:35:06+03:00   INFO    piecestore  downloaded  {"Piece ID": "MQN2PAHQJAUTWJFZAB3PS64LHKD43CIAYZDFLTK5SNOYX3JF2EGA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_AUDIT", "Offset": 109824, "Size": 256, "Remote Address": "34.124.189.53:33711"}
2024-07-07T05:35:06+03:00   INFO    piecestore  download started    {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.219.46:58210"}
2024-07-07T05:35:06+03:00   INFO    piecestore  downloaded  {"Piece ID": "Z34OT3LZDEYU7S7NMGZSJD2GJCIOHAOKJNII5KKRZFRVFDGIB2VA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_AUDIT", "Offset": 1358080, "Size": 256, "Remote Address": "34.124.189.53:52220"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  trash-filewalker started    {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode", "dateBefore": "2024-06-30T05:35:06+03:00"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  Database started    {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  trash-filewalker completed  {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "numKeysDeleted": 0, "Process": "storagenode", "bytesDeleted": 0}
2024-07-07T05:35:06+03:00   INFO    piecestore  downloaded  {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.219.46:58210"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker subprocess finished successfully    {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T05:35:06+03:00   INFO    pieces:trash    emptying trash finished {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "elapsed": "331.019ms"}
2024-07-07T05:35:06+03:00   INFO    pieces:trash    emptying trash started  {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker starting subprocess {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker subprocess started  {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.gc-filewalker.subprocess Database started    {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker started   {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode", "createdBefore": "2024-06-29T20:59:59+03:00", "bloomFilterSize": 10861233}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  trash-filewalker started    {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Process": "storagenode", "dateBefore": "2024-06-30T05:35:06+03:00"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  Database started    {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Process": "storagenode"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  trash-filewalker completed  {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Process": "storagenode", "bytesDeleted": 0, "numKeysDeleted": 0}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker subprocess finished successfully    {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-07-07T05:35:06+03:00   INFO    pieces:trash    emptying trash finished {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "elapsed": "267.0152ms"}
2024-07-07T05:35:07+03:00   INFO    piecestore  download started    {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.219.37:44700"}
2024-07-07T05:35:07+03:00   INFO    piecestore  downloaded  {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.219.37:44700"}
2024-07-07T05:35:07+03:00   INFO    piecestore  download started    {"Piece ID": "VCBCOUI4ZXAAPT4EXULQTWVIMZJUS5JM7HLXMDLGEITNWS5QI5LA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 6912, "Remote Address": "109.61.92.73:36386"}

I also recoeded full trace of disk IO of GC processed (two PIDs - before and after restart) by using Process Monitor. Attach it as zipped .csv as it large (log of ~5k IOP)

Garbage_Collector_IO.zip

MaxMaximus commented 4 weeks ago

Update. Strangely, I tried to reproduce the problem according to my own sequence from the 1st post. And on the 2nd (as well as the 3rd) restart of the node, GC successfully continued working from the saved position!

I haven't changed anything in the config or setup between them. The only three things that come to mind so far of how the first restart differed from the subsequent ones: 1 - node worked for a long time without restarts at the time of the first restart (more than a week, if necessary, I can later find the exact time of operation before restarting by scanning the log), and the 2nd and 3rd restarts were only after a few dozen minutes of work 2 - GC progress, judging by the number of sub-directories already created in \trash\ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxv5qaaaaaa\2024-07-06\ was almost finished (there were 845 of 1024 prefixes done) while at the 2nd and 3rd restarts it managed to process only a few prefixes. 3- Also, the difference will be that due to the reset of progress on the 2nd and 3rd passes GC did NOT move any files from \blobs\ to \trash\ because all files matching Bloom Filter were already moved during the first pass.

P.S. This is not the first time I notice that GC loses progress when restarting a node. Just only the first time I decided to document it. And the first time I saw that saving progress worked after all.

MaxMaximus commented 4 weeks ago

UPDATE2 I wanted to catch a moment at the end of the GC run and restart node there to check if that was the case. But I missed the right moment - the re-run of GC was completed significantly faster than I expected. And below it will be clear why it completed the work so "quickly" (in ~1.5 hours). But still, I managed to catch something very interesting: in addition to losing progress on first restart, the lazy GC also incorrectly transmitted information to the parent process about the amount of garbage collected:

2024-07-07T09:44:17+03:00   INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker completed {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode", "piecesCount": 7408653, "trashPiecesCount": 0, "piecesTrashed": 0, "piecesSkippedCount": 0}
2024-07-07T09:44:17+03:00   INFO    lazyfilewalker.gc-filewalker    subprocess finished successfully    {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T09:44:17+03:00   INFO    retain  Moved pieces to trash during retain {"cachePath": "D:\\Storj_Data\\Storage Node/retain", "Deleted pieces": 0, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 7408653, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "1h32m50.0489607s", "Retain Status": "enabled"}

So it report ZERO deleted/trashed pieces count. While the actual count is 1 684 268 pieces / 270 465 252 064 bytes were moved from \blobs\ to \trash\ I count this by size of \trash\ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa\2024-07-06\ folder. Wow, it's pretty extreme! That's more than 20% of all files stored on the node for this satellite were deleted at one pass of GC.

And it also give hint to source of the problem with wrong stat - because \2024-07-06\ folder contain only 845 sub-folders/prefixes. So the GC actually collected ZERO in the second pass. All these ~1.7M pieces were collected in the first pass. Which he just "forgot" about. I think this is one of the important reasons for the continuing large discrepancies in the statistics of disk space usage. Which a lot of SNOs continue to complain about. Such or similar losses of stats during garbage collection or subsequent trash deletion could easily be the reason.

So we have a whole big package of problems here at once due to the incorrect (in some cases, it is not yet clear which) operation of the save-state-resume function of storj filewalkers.

in some cases, the process starts over with node restart, the previous progress is not taken into account
if this happens, then it will not be fully completed in fact at all - those folders that were not processed in the first pass will remain skipped, only resources are wasted on re-scanning already processed sub-folders
after all this incorrect statistics on disk usage will be collected

storjrobot commented 4 weeks ago

This issue has been mentioned on Storj Community Forum (official). There might be relevant details there:

https://forum.storj.io/t/two-weeks-working-for-free-in-the-waste-storage-business/26854/61

storj / storj

Lazyfilewalker GC does not resume from previously saved progress point after storagenode restart #7041