spacemeshos / post

Spacemesh POST protocol implementation
MIT License
19 stars 20 forks source link

Post initialization stops at wrong places causing the post to be invalid. #118

Closed pigmej closed 1 year ago

pigmej commented 1 year ago

Environment is macbook pro M1

Consider following logs:

3476:2023-03-29T15:01:06.595+0200   INFO    7b957.app.7b957.post    initialization: file #28 completed; number of labels written: 0 {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23477:2023-03-29T15:01:06.595+0200  INFO    7b957.app.7b957.post    initialization: starting to write file #29; target number of labels: 625000, start position: 18125000   {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23478:2023-03-29T15:01:06.595+0200  INFO    7b957.app.7b957.post    initialization: file #29 completed; number of labels written: 0 {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23479:2023-03-29T15:01:06.595+0200  INFO    7b957.app.7b957.post    initialization: starting to write file #30; target number of labels: 625000, start position: 18750000   {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23480:2023-03-29T15:01:06.595+0200  INFO    7b957.app.7b957.post    initialization: file #30 completed; number of labels written: 0 {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23481:2023-03-29T15:01:06.595+0200  INFO    7b957.app.7b957.post    initialization: starting to write file #31; target number of labels: 625000, start position: 19375000   {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
[cut cut cut all was the same]
23600:2023-03-29T15:01:06.605+0200  INFO    7b957.app.7b957.post    initialization: file #90 completed; number of labels written: 0 {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23601:2023-03-29T15:01:06.605+0200  INFO    7b957.app.7b957.post    initialization: starting to write file #91; target number of labels: 625000, start position: 56875000   {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23602:2023-03-29T15:01:06.605+0200  INFO    7b957.app.7b957.post    initialization: file #91 completed; number of labels written: 0 {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23603:2023-03-29T15:01:06.605+0200  INFO    7b957.app.7b957.post    initialization: starting to write file #92; target number of labels: 625000, start position: 57500000   {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23604:2023-03-29T15:01:06.605+0200  INFO    7b957.app.7b957.post    initialization: file #92 completed; number of labels written: 0 {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23605:2023-03-29T15:01:06.605+0200  INFO    7b957.app.7b957.post    initialization: starting to write file #93; target number of labels: 625000, start position: 58125000   {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23606:2023-03-29T15:01:06.606+0200  INFO    7b957.app.7b957.post    initialization: file #93 completed; number of labels written: 0 {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23607:2023-03-29T15:01:06.606+0200  INFO    7b957.app.7b957.post    initialization: starting to write file #94; target number of labels: 625000, start position: 58750000   {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23608:2023-03-29T15:01:06.606+0200  INFO    7b957.app.7b957.post    initialization: file #94 completed; number of labels written: 0 {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23609:2023-03-29T15:01:06.606+0200  INFO    7b957.app.7b957.post    initialization: starting to write file #95; target number of labels: 625000, start position: 59375000   {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23610:2023-03-29T15:01:06.606+0200  INFO    7b957.app.7b957.post    initialization: file #95 completed; number of labels written: 0 {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23611:2023-03-29T15:01:06.606+0200  INFO    7b957.app.7b957.post    initialization: starting to write file #96; target number of labels: 625000, start position: 60000000   {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23612:2023-03-29T15:01:06.606+0200  INFO    7b957.app.7b957.post    initialization: file #96 completed; number of labels written: 0 {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23613:2023-03-29T15:01:06.606+0200  INFO    7b957.app.7b957.post    initialization: starting to write file #97; target number of labels: 625000, start position: 60625000   {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23614:2023-03-29T15:01:06.606+0200  INFO    7b957.app.7b957.post    initialization: file #97 completed; number of labels written: 0 {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23615:2023-03-29T15:01:06.606+0200  INFO    7b957.app.7b957.post    initialization: starting to write file #98; target number of labels: 625000, start position: 61250000   {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23616:2023-03-29T15:01:06.606+0200  INFO    7b957.app.7b957.post    initialization: file #98 completed; number of labels written: 0 {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23617:2023-03-29T15:01:06.606+0200  INFO    7b957.app.7b957.post    initialization: starting to write file #99; target number of labels: 625000, start position: 61875000   {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23618:2023-03-29T15:01:06.606+0200  INFO    7b957.app.7b957.post    initialization: file #99 completed; number of labels written: 0 {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23619:2023-03-29T15:01:06.606+0200  INFO    7b957.app.7b957.post    initialization: starting to write file #100; target number of labels: 414560, start position: 62500000  {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23620:2023-03-29T15:01:06.607+0200  INFO    7b957.app.7b957.post    initialization: file #100 completed; number of labels written: 0    {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23621:2023-03-29T15:01:06.607+0200  INFO    7b957.app.7b957.post    post setup completed    {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "commitment_atx": "0657ad4e89", "data_dir": "../pos_data/0", "num_units": "3", "labels_per_unit": "20971520", "provider": "1", "name": "post"}
23622:2023-03-29T15:01:06.607+0200  ERROR   7b957.app.7b957.atxBuilder  Failed to generate proof: %!w(*fmt.wrapError=&{post execution: generate proof: not completed 0x14000c556c0})    {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "atxBuilder"}

When I restart the node it starts to initailize again with the following log

46428:2023-03-29T15:11:29.205+0200  INFO    7b957.app.7b957.post    initialization: file #17 already initialized; number of labels: 625000, start position: 10625000    {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
46429:2023-03-29T15:11:29.205+0200  INFO    7b957.app.7b957.post    initialization: file #18 already initialized; number of labels: 625000, start position: 11250000    {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
46430:2023-03-29T15:11:29.205+0200  INFO    7b957.app.7b957.post    initialization: file #19 already initialized; number of labels: 625000, start position: 11875000    {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
46431:2023-03-29T15:11:29.205+0200  INFO    7b957.app.7b957.post    initialization: file #20 already initialized; number of labels: 625000, start position: 12500000    {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
46432:2023-03-29T15:11:29.205+0200  INFO    7b957.app.7b957.post    initialization: continuing to write file #21; current number of labels: 573440, target number of labels: 625000, start position: 13125000   {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}

The network config is:

            "post-labels-per-unit": 20971520,
            "post-max-numunits": 5000
        }

node config for smashing is:

            "smeshing-opts-maxfilesize": 10000000,
            "smeshing-opts-numunits": 3,
            "smeshing-opts-provider": 1,
            "smeshing-opts-throttle": false

before the first log what happens is:

2023-03-29T15:01:06.593+0200    INFO    7b957.app.7b957.post    initialization: file #21 completed; number of labels written: 573440    {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
2023-03-29T15:01:06.593+0200    INFO    7b957.app.7b957.post    initialization: starting to write file #22; target number of labels: 625000, start position: 13750000   {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
2023-03-29T15:01:06.593+0200    INFO    7b957.app.7b957.post    initialization: file #22 completed; number of labels written: 0 {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
2023-03-29T15:01:06.593+0200    INFO    7b957.app.7b957.post    initialization: starting to write file #23; target number of labels: 625000, start position: 14375000   {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
2023-03-29T15:01:06.594+0200    INFO    7b957.app.7b957.post    initialization: file #23 completed; number of labels written: 0 {"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
pigmej commented 1 year ago

[mvk-error] VK_ERROR_DEVICE_LOST: MTLCommandBuffer "vkQueueSubmit CommandBuffer on Queue 3-0" execution failed (code 1): Internal Error (0000000e:Internal Error) that's an error there.

pigmej commented 1 year ago

Bigger log file err.txt

Short log above. Clearly visible what crashes and that when it crashes that we start to generate 0 everywhere. It's also visible that already the first file is corrupted. It happens in different moments etc.

That's the "device lost" https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#devsandqueues-lost-device

pigmej commented 1 year ago

That's the exact sequence:

[mvk-error] VK_ERROR_DEVICE_LOST: MTLCommandBuffer "vkQueueSubmit CommandBuffer on Queue 3-0" execution failed (code 1): Internal Error (0000000e:Internal Error)
2023-03-30T18:09:39.809+0200    INFO    0a0c7.app.0a0c7.post    initialization: file #5 completed; number of labels written: 15360  {"node_id": "0a0c7f128ed9507ee3d8cc05b613d9800dde44a3a530d5db74ee06e93a904f04", "module": "app", "node_id": "0a0c7f128ed9507ee3d8cc05b613d9800dde44a3a530d5db74ee06e93a904f04", "module": "post"}
2023-03-30T18:09:39.809+0200    INFO    0a0c7.app.0a0c7.post    post setup completed    {"node_id": "0a0c7f128ed9507ee3d8cc05b613d9800dde44a3a530d5db74ee06e93a904f04", "module": "app", "node_id": "0a0c7f128ed9507ee3d8cc05b613d9800dde44a3a530d5db74ee06e93a904f04", "module": "post", "node_id": "0a0c7f128ed9507ee3d8cc05b613d9800dde44a3a530d5db74ee06e93a904f04", "commitment_atx": "dc8053f637", "data_dir": "../pos_data/0", "num_units": "5", "labels_per_unit": "1310720", "provider": "1", "name": "post"}
2023-03-30T18:09:39.809+0200    ERROR   0a0c7.app.0a0c7.atxBuilder  Failed to generate proof: %!w(*fmt.wrapError=&{post execution: generate proof: not completed 0x1400109f700}){"node_id": "0a0c7f128ed9507ee3d8cc05b613d9800dde44a3a530d5db74ee06e93a904f04", "module": "app", "node_id": "0a0c7f128ed9507ee3d8cc05b613d9800dde44a3a530d5db74ee06e93a904f04", "module": "atxBuilder"}
pigmej commented 1 year ago

m2 pro behaves the same way, on m1 max did not reproduced the error.

pigmej commented 1 year ago

This one is connected to: https://github.com/spacemeshos/gpu-post/issues/93

pigmej commented 1 year ago

Most of the stuff was fixed there are some gpus misbehaving but let's keep them as separate issues.