madMAx43v3r / chia-plotter

Apache License 2.0
2.27k stars 662 forks source link

thread failed with: thread failed with: small_delta >= 256 #979

Open ALardu opened 2 years ago

ALardu commented 2 years ago

Debian 11 without Gui + Mad Max Chia Plotter without Gui + what(): thread failed with: thread failed with: small_delta

cd chia-plotter sudo ./build/chia_plot -n 1 -r 30 -u 128 -t /mnt/nvme/ -d /mnt/ssd/share/ -c xch1xxxxx -f 8bebxxxxx size NVME 943Gb size SSD 235Gb size RAM DDR4 64 Gb CPU xeon 2*2667v3

Problem: P1-P2 the Mad Max writes full disk the /mnt/nvme/ (943 Gb!!!) and the beginning of the third stage P3 is interrupted by an error:

Wrote plot header with 252 bytes [P3-1] Table 2 took 112.406 sec, wrote 3429424212 right entries terminate called after throwing an instance of 'std::runtime_error' what(): thread failed with: thread failed with: small_delta >= 256 (34292629513) Aborted

RAM memtest PERFECT NVME Health PERFECT

Help me please!!!

ALardu commented 2 years ago

image

ALardu commented 2 years ago

image

ALardu commented 2 years ago

image

bladeuserpi commented 2 years ago

The same error message occured here:

    1   Multi-threaded pipelined Chia k34 plotter - ecec17d
     2  (Sponsored by Flexpool.io - Check them out if you're looking for a secure and scalable Chia pool)
     3  
     4  Network Port: 8444 [chia]
     5  Final Directory: /farm
     6  Number of Plots: 40
     7  Crafting plot 1 out of 40 (2022/04/22 12:38:43)
     8  Process ID: 10272
     9  Number of Threads: 36
    10  Number of Buckets P1:    2^8 (256)
    11  Number of Buckets P3+P4: 2^8 (256)
    12  Pool Puzzle Hash:  ...
    13  Farmer Public Key: ...
    14  Working Directory:   /plot4/
    15  Working Directory 2: /plot4/
    16  Plot Name: plot-k34-2022-04-22-12-38-...
    17  [P1] Table 1 took 80.5456 sec
    18  [P1] Table 2 took 421.855 sec, found 17179895526 matches
    19  [P1] Table 3 took 692.764 sec, found 17180041892 matches
    20  [P1] Table 4 took 849.356 sec, found 17180079257 matches
    21  [P1] Table 5 took 850.095 sec, found 17180234707 matches
    22  [P1] Table 6 took 808.098 sec, found 17180657540 matches
    23  [P1] Table 7 took 632.536 sec, found 17181473914 matches
    24  Phase 1 took 4335.28 sec
    25  [P2] max_table_size = 17181473914
    26  [P2] Table 7 scan took 64.9589 sec
    27  [P2] Table 7 rewrite took 350.204 sec, dropped 0 entries (0 %)
    28  [P2] Table 6 scan took 116.656 sec
    29  [P2] Table 6 rewrite took 173.673 sec, dropped 2324880433 entries (13.532 %)
    30  [P2] Table 5 scan took 111.72 sec
    31  [P2] Table 5 rewrite took 168.228 sec, dropped 3047672881 entries (17.7394 %)
    32  [P2] Table 4 scan took 108.128 sec
    33  [P2] Table 4 rewrite took 164.688 sec, dropped 3315222793 entries (19.2969 %)
    34  [P2] Table 3 scan took 107.269 sec
    35  [P2] Table 3 rewrite took 165.895 sec, dropped 3420069244 entries (19.9072 %)
    36  [P2] Table 2 scan took 108.445 sec
    37  [P2] Table 2 rewrite took 164.671 sec, dropped 3462035936 entries (20.1517 %)
    38  Phase 2 took 1865.99 sec
    39  Wrote plot header with 252 bytes
    40  [P3-1] Table 2 took 257.428 sec, wrote 13717859590 right entries
    41  [P3-2] Table 2 took 214.897 sec, wrote 13717859590 left entries, 13717859590 final
    42  [P3-1] Table 3 took 263.115 sec, wrote 13759972648 right entries
    43  [P3-2] Table 3 took 213.227 sec, wrote 13759972648 left entries, 13759972648 final
    44  [P3-1] Table 4 took 267.454 sec, wrote 13864856464 right entries
    45  [P3-2] Table 4 took 412.839 sec, wrote 13864856464 left entries, 13864856464 final
    46  [P3-1] Table 5 took 288.155 sec, wrote 14132561826 right entries
    47  [P3-2] Table 5 took 401.173 sec, wrote 14132561826 left entries, 14132561826 final
    48  [P3-1] Table 6 took 290.407 sec, wrote 14855777107 right entries
    49  [P3-2] Table 6 took 385.246 sec, wrote 14855777107 left entries, 14855777107 final
    50  [P3-1] Table 7 took 305.812 sec, wrote 17181473914 right entries
    51  terminate called after throwing an instance of 'std::runtime_error'
    52    what():  thread failed with: thread failed with: small_delta >= 256 (407)
    53  Command terminated by signal 6
    54  198313.73user 15600.11system 2:43:47elapsed 2176%CPU (0avgtext+0avgdata 125113576maxresident)k
bladeuserpi commented 2 years ago

In my case I root-caused this to:
- I recently enabled XFS "discard" option, default for RHEL-8.5 is "no discard"
- Also I am using Raid0 with 3x NVME + 1x Raid0 with 2 Sata SSD
- initially I suspected using a P3700 engineering sample with older firmware was 
  a contributing factor, but it also reproduced when replacing this with other SSD

There are also other error messges when running multiple times
P3-1] Table 7 took 344.503 sec, wrote 17183710592 right entries
terminate called after throwing an instance of 'std::runtime_error'
  what():  thread failed with: thread failed with: small_delta >= 256 (950)

[P3-1] Table 7 took 343.122 sec, wrote 17182007329 right entries
free(): invalid next size (normal)
Command terminated by signal 6

[P1] Table 2 took 790.174 sec, found 17179640565 matches
terminate called after throwing an instance of 'std::runtime_error'
  what():  thread failed with: input not sorted

[P1] Table 2 took 711.174 sec, found 17180065469 matches
Command terminated by signal 11

[P3-1] Table 7 took 343.122 sec, wrote 17182007329 right entries
free(): invalid next size (normal)
Command terminated by signal 6

These were sometimes logged (but not for all crashes) with P3700 ES,
so initially I suspected that would be contributing factor:
blk_update_request: critical target error, dev nvme2n1, sector 1558722560 op 0x3:(DISCARD) flags 0x0 phys_seg 1 prio class 0
blk_update_request: critical target error, dev nvme2n1, sector 1560120320 op 0x3:(DISCARD) flags 0x0 phys_seg 1 prio class 0

Conclusion:
- When mounting without discard, the problem goes away, madmax runs without errors.
- When mounting with discard, the problem comes back for my configuration stacked mdraid configuration
bladeuserpi commented 2 years ago

A closer look shows this might be related to mdadm-Raid0 with different size SSDs:
-1.2TB
-750 GB
-1.6 TB

I expected mdadm-Raid0 would then use the smallest size, e.g. 3x750GB,
but it actually uses the full amount 1.2+0.75+1.6TB.
From my understanding it stripes across all 3 disks until it reaches
the capacity of the smallest disk, then it stripes over the left-over disks etc.

My current assumption is "mdadm-raid0+discard" has a data corruption bug when combining
different size disks (at least for RHEL-8.5 kernel in my testing; did not yet test newer/upstream kernel).