storj / storj

Ongoing Storj v3 development. Decentralized cloud object storage that is affordable, easy to use, private, and secure.
https://storj.io
GNU Affero General Public License v3.0
2.91k stars 390 forks source link

Lazyfilewalker GC does not resume from previously saved progress point after storagenode restart #7041

Open MaxMaximus opened 4 weeks ago

MaxMaximus commented 4 weeks ago

Hello.

Very useful feature to save-state-resume feature for GC filewalker is marked as already done some time ago: https://github.com/storj/storj/issues/6708 https://review.dev.storj.io/plugins/gitiles/storj/storj/+/0f90f061b028a9c877dbed3c01d8c3d95e4bc518

And it listed as merged and in production since storagenode v1.102: https://github.com/storj/storj/commit/0f90f06

But tests on my nodes (v 1.104.5 and 1.105.4 running as windows service) show what this feature is not working properly. GC (running in "lazy mode" as separate process) DO save current progress to db (garbage_collection_filewalker_progress.db). But it does not use this information after node restart and begins GC process from scratch instead.

Steps to reproduce the issue:

Logs: storagenode.log around restart

2024-07-07T05:32:53+03:00   INFO    piecestore  downloaded  {"Piece ID": "ZOM7ROCSA4MGKZIUWNRHJIT6OUZIMEE7Y7MPXPKBUNB5CTQBQV4A", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 181248, "Remote Address": "109.61.92.73:60450"}
2024-07-07T05:32:53+03:00   INFO    piecestore  downloaded  {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.213.33:55368"}
2024-07-07T05:32:53+03:00   INFO    Stop/Shutdown request received.
2024-07-07T05:32:53+03:00   ERROR   piecestore  download failed {"Piece ID": "TUHGHHLPRCLJ773QCZZO7YP6LDBKQV4EWAHKRUYRD54NPL2E3OYA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "Offset": 0, "Size": 46336, "Remote Address": "5.161.196.36:38340", "error": "write tcp 192.168.0.2:28967->5.161.196.36:38340: use of closed network connection", "errorVerbose": "write tcp 192.168.0.2:28967->5.161.196.36:38340: use of closed network connection\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:409\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:470\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:408\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).sendData.func1:860\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"}
2024-07-07T05:32:54+03:00   INFO    lazyfilewalker.gc-filewalker    subprocess exited with status   {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "status": 1, "error": "exit status 1"}
2024-07-07T05:32:54+03:00   ERROR   pieces  lazyfilewalker failed   {"error": "lazyfilewalker: exit status 1", "errorVerbose": "lazyfilewalker: exit status 1\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkSatellitePiecesToTrash:160\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:561\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:373\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:259\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-07-07T05:32:54+03:00   ERROR   filewalker  failed to get progress from database
2024-07-07T05:32:54+03:00   ERROR   retain  retain pieces failed    {"cachePath": "D:\\Storj_Data\\Storage Node/retain", "error": "retain: filewalker: context canceled", "errorVerbose": "retain: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash:181\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:568\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:373\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:259\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-07-07T05:32:54+03:00   INFO    lazyfilewalker.used-space-filewalker    subprocess exited with status   {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "status": 1, "error": "exit status 1"}
2024-07-07T05:32:54+03:00   ERROR   pieces  failed to lazywalk space used by satellite  {"error": "lazyfilewalker: exit status 1", "errorVerbose": "lazyfilewalker: exit status 1\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-07T05:32:54+03:00   INFO    lazyfilewalker.used-space-filewalker    starting subprocess {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-07-07T05:32:54+03:00   ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess  {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "error": "context canceled"}
2024-07-07T05:32:54+03:00   ERROR   pieces  failed to lazywalk space used by satellite  {"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-07-07T05:32:54+03:00   INFO    lazyfilewalker.used-space-filewalker    starting subprocess {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T05:32:54+03:00   ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess  {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "error": "context canceled"}
2024-07-07T05:32:54+03:00   ERROR   pieces  failed to lazywalk space used by satellite  {"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T05:32:54+03:00   INFO    lazyfilewalker.used-space-filewalker    starting subprocess {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-07-07T05:32:54+03:00   ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess  {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "error": "context canceled"}
2024-07-07T05:32:54+03:00   ERROR   pieces  failed to lazywalk space used by satellite  {"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-07-07T05:32:54+03:00   ERROR   piecestore:cache    error getting current used space:   {"error": "filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled", "errorVerbose": "group:\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-07-07T05:35:03+03:00   INFO    Configuration loaded    {"Location": "D:\\Storj_Data\\Storage Node\\config.yaml"}
2024-07-07T05:35:04+03:00   INFO    Anonymized tracing enabled
2024-07-07T05:35:04+03:00   INFO    Operator email  {"Address": "***"}
2024-07-07T05:35:04+03:00   INFO    Operator wallet {"Address": "***"}
2024-07-07T05:35:04+03:00   INFO    Telemetry enabled   {"instance ID": "***"}
2024-07-07T05:35:04+03:00   INFO    Event collection enabled    {"***"}
2024-07-07T05:35:04+03:00   INFO    db.migration    Database Version    {"version": 57}
2024-07-07T05:35:04+03:00   INFO    preflight:localtime start checking local system clock with trusted satellites' system clock.
2024-07-07T05:35:05+03:00   INFO    preflight:localtime local system clock is in sync with trusted satellites' system clock.
2024-07-07T05:35:05+03:00   INFO    Node **** started
2024-07-07T05:35:05+03:00   INFO    Public server started on [::]:28967
2024-07-07T05:35:05+03:00   INFO    Private server started on 127.0.0.1:7778
2024-07-07T05:35:05+03:00   INFO    bandwidth   Persisting bandwidth usage cache to db
2024-07-07T05:35:05+03:00   INFO    trust   Scheduling next refresh {"after": "6h26m36.718354629s"}
2024-07-07T05:35:05+03:00   INFO    pieces:trash    emptying trash started  {"Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-07-07T05:35:05+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker starting subprocess {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-07-07T05:35:05+03:00   INFO    retain  Prepared to run a Retain request.   {"cachePath": "D:\\Storj_Data\\Storage Node/retain", "Created Before": "2024-06-29T20:59:59+03:00", "Filter Size": 10861233, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T05:35:05+03:00   WARN    piecestore:monitor  Disk space is less than requested. Allocated space is   {"bytes": 5305067636272}
2024-07-07T05:35:05+03:00   INFO    lazyfilewalker.used-space-filewalker    starting subprocess {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-07T05:35:05+03:00   INFO    piecestore  download started    {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.226.100:54196"}
2024-07-07T05:35:05+03:00   INFO    piecestore  download started    {"Piece ID": "O4YOG4ZJUGK53W55FQ3I3TWS4P6XCGJL2CQEUFSLIFKY2ZDS5LJA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 9984, "Remote Address": "109.61.92.78:57744"}
2024-07-07T05:35:05+03:00   INFO    lazyfilewalker.used-space-filewalker    subprocess started  {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-07T05:35:05+03:00   INFO    piecestore  download started    {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.226.100:54208"}
2024-07-07T05:35:05+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker subprocess started  {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-07-07T05:35:05+03:00   ERROR   piecestore  download failed {"Piece ID": "O4YOG4ZJUGK53W55FQ3I3TWS4P6XCGJL2CQEUFSLIFKY2ZDS5LJA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 9984, "Remote Address": "109.61.92.78:57744", "error": "trust: rpc: tcp connector failed: rpc: dial tcp 34.150.199.48:7777: operation was canceled", "errorVerbose": "trust: rpc: tcp connector failed: rpc: dial tcp 34.150.199.48:7777: operation was canceled\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190"}
2024-07-07T05:35:05+03:00   ERROR   piecestore  download failed {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.226.100:54196", "error": "trust: rpc: tcp connector failed: rpc: dial tcp 34.159.134.91:7777: operation was canceled", "errorVerbose": "trust: rpc: tcp connector failed: rpc: dial tcp 34.159.134.91:7777: operation was canceled\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190"}
2024-07-07T05:35:05+03:00   INFO    lazyfilewalker.gc-filewalker    starting subprocess {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T05:35:05+03:00   ERROR   piecestore  download failed {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.226.100:54208", "error": "trust: rpc: tcp connector failed: rpc: dial tcp 34.159.134.91:7777: operation was canceled", "errorVerbose": "trust: rpc: tcp connector failed: rpc: dial tcp 34.159.134.91:7777: operation was canceled\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190"}
2024-07-07T05:35:05+03:00   INFO    lazyfilewalker.gc-filewalker    subprocess started  {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T05:35:05+03:00   INFO    piecestore  download started    {"Piece ID": "Z34OT3LZDEYU7S7NMGZSJD2GJCIOHAOKJNII5KKRZFRVFDGIB2VA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_AUDIT", "Offset": 1358080, "Size": 256, "Remote Address": "34.124.189.53:52220"}
2024-07-07T05:35:05+03:00   INFO    piecestore  download started    {"Piece ID": "MQN2PAHQJAUTWJFZAB3PS64LHKD43CIAYZDFLTK5SNOYX3JF2EGA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_AUDIT", "Offset": 109824, "Size": 256, "Remote Address": "34.124.189.53:33711"}
2024-07-07T05:35:05+03:00   INFO    lazyfilewalker.used-space-filewalker.subprocess Database started    {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Process": "storagenode"}
2024-07-07T05:35:05+03:00   INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started   {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Process": "storagenode"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  trash-filewalker started    {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "dateBefore": "2024-06-30T05:35:05+03:00", "Process": "storagenode"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  Database started    {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Process": "storagenode"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  trash-filewalker completed  {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Process": "storagenode", "bytesDeleted": 0, "numKeysDeleted": 0}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker subprocess finished successfully    {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-07-07T05:35:06+03:00   INFO    pieces:trash    emptying trash finished {"Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "elapsed": "448.0257ms"}
2024-07-07T05:35:06+03:00   INFO    pieces:trash    emptying trash started  {"Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker starting subprocess {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker subprocess started  {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  trash-filewalker started    {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Process": "storagenode", "dateBefore": "2024-06-30T05:35:06+03:00"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  Database started    {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Process": "storagenode"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  trash-filewalker completed  {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Process": "storagenode", "bytesDeleted": 0, "numKeysDeleted": 0}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker subprocess finished successfully    {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-07T05:35:06+03:00   INFO    pieces:trash    emptying trash finished {"Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "elapsed": "251.0143ms"}
2024-07-07T05:35:06+03:00   INFO    pieces:trash    emptying trash started  {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker starting subprocess {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker subprocess started  {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T05:35:06+03:00   INFO    piecestore  downloaded  {"Piece ID": "MQN2PAHQJAUTWJFZAB3PS64LHKD43CIAYZDFLTK5SNOYX3JF2EGA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_AUDIT", "Offset": 109824, "Size": 256, "Remote Address": "34.124.189.53:33711"}
2024-07-07T05:35:06+03:00   INFO    piecestore  download started    {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.219.46:58210"}
2024-07-07T05:35:06+03:00   INFO    piecestore  downloaded  {"Piece ID": "Z34OT3LZDEYU7S7NMGZSJD2GJCIOHAOKJNII5KKRZFRVFDGIB2VA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_AUDIT", "Offset": 1358080, "Size": 256, "Remote Address": "34.124.189.53:52220"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  trash-filewalker started    {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode", "dateBefore": "2024-06-30T05:35:06+03:00"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  Database started    {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  trash-filewalker completed  {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "numKeysDeleted": 0, "Process": "storagenode", "bytesDeleted": 0}
2024-07-07T05:35:06+03:00   INFO    piecestore  downloaded  {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.219.46:58210"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker subprocess finished successfully    {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T05:35:06+03:00   INFO    pieces:trash    emptying trash finished {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "elapsed": "331.019ms"}
2024-07-07T05:35:06+03:00   INFO    pieces:trash    emptying trash started  {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker starting subprocess {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker subprocess started  {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.gc-filewalker.subprocess Database started    {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker started   {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode", "createdBefore": "2024-06-29T20:59:59+03:00", "bloomFilterSize": 10861233}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  trash-filewalker started    {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Process": "storagenode", "dateBefore": "2024-06-30T05:35:06+03:00"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  Database started    {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Process": "storagenode"}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess  trash-filewalker completed  {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Process": "storagenode", "bytesDeleted": 0, "numKeysDeleted": 0}
2024-07-07T05:35:06+03:00   INFO    lazyfilewalker.trash-cleanup-filewalker subprocess finished successfully    {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-07-07T05:35:06+03:00   INFO    pieces:trash    emptying trash finished {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "elapsed": "267.0152ms"}
2024-07-07T05:35:07+03:00   INFO    piecestore  download started    {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.219.37:44700"}
2024-07-07T05:35:07+03:00   INFO    piecestore  downloaded  {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.219.37:44700"}
2024-07-07T05:35:07+03:00   INFO    piecestore  download started    {"Piece ID": "VCBCOUI4ZXAAPT4EXULQTWVIMZJUS5JM7HLXMDLGEITNWS5QI5LA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 6912, "Remote Address": "109.61.92.73:36386"}

I also recoeded full trace of disk IO of GC processed (two PIDs - before and after restart) by using Process Monitor. Attach it as zipped .csv as it large (log of ~5k IOP)

Garbage_Collector_IO.zip

MaxMaximus commented 4 weeks ago

Update. Strangely, I tried to reproduce the problem according to my own sequence from the 1st post. And on the 2nd (as well as the 3rd) restart of the node, GC successfully continued working from the saved position!

I haven't changed anything in the config or setup between them. The only three things that come to mind so far of how the first restart differed from the subsequent ones: 1 - node worked for a long time without restarts at the time of the first restart (more than a week, if necessary, I can later find the exact time of operation before restarting by scanning the log), and the 2nd and 3rd restarts were only after a few dozen minutes of work 2 - GC progress, judging by the number of sub-directories already created in \trash\ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxv5qaaaaaa\2024-07-06\ was almost finished (there were 845 of 1024 prefixes done) while at the 2nd and 3rd restarts it managed to process only a few prefixes. 3- Also, the difference will be that due to the reset of progress on the 2nd and 3rd passes GC did NOT move any files from \blobs\ to \trash\ because all files matching Bloom Filter were already moved during the first pass.

P.S. This is not the first time I notice that GC loses progress when restarting a node. Just only the first time I decided to document it. And the first time I saw that saving progress worked after all.

MaxMaximus commented 4 weeks ago

UPDATE2 I wanted to catch a moment at the end of the GC run and restart node there to check if that was the case. But I missed the right moment - the re-run of GC was completed significantly faster than I expected. And below it will be clear why it completed the work so "quickly" (in ~1.5 hours). But still, I managed to catch something very interesting: in addition to losing progress on first restart, the lazy GC also incorrectly transmitted information to the parent process about the amount of garbage collected:

2024-07-07T09:44:17+03:00   INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker completed {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode", "piecesCount": 7408653, "trashPiecesCount": 0, "piecesTrashed": 0, "piecesSkippedCount": 0}
2024-07-07T09:44:17+03:00   INFO    lazyfilewalker.gc-filewalker    subprocess finished successfully    {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-07T09:44:17+03:00   INFO    retain  Moved pieces to trash during retain {"cachePath": "D:\\Storj_Data\\Storage Node/retain", "Deleted pieces": 0, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 7408653, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "1h32m50.0489607s", "Retain Status": "enabled"}

So it report ZERO deleted/trashed pieces count. While the actual count is 1 684 268 pieces / 270 465 252 064 bytes were moved from \blobs\ to \trash\ I count this by size of \trash\ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa\2024-07-06\ folder. Wow, it's pretty extreme! That's more than 20% of all files stored on the node for this satellite were deleted at one pass of GC.

And it also give hint to source of the problem with wrong stat - because \2024-07-06\ folder contain only 845 sub-folders/prefixes. So the GC actually collected ZERO in the second pass. All these ~1.7M pieces were collected in the first pass. Which he just "forgot" about. I think this is one of the important reasons for the continuing large discrepancies in the statistics of disk space usage. Which a lot of SNOs continue to complain about. Such or similar losses of stats during garbage collection or subsequent trash deletion could easily be the reason.

So we have a whole big package of problems here at once due to the incorrect (in some cases, it is not yet clear which) operation of the save-state-resume function of storj filewalkers.

storjrobot commented 4 weeks ago

This issue has been mentioned on Storj Community Forum (official). There might be relevant details there:

https://forum.storj.io/t/two-weeks-working-for-free-in-the-waste-storage-business/26854/61