Open ALardu opened 2 years ago
The same error message occured here:
1 Multi-threaded pipelined Chia k34 plotter - ecec17d
2 (Sponsored by Flexpool.io - Check them out if you're looking for a secure and scalable Chia pool)
3
4 Network Port: 8444 [chia]
5 Final Directory: /farm
6 Number of Plots: 40
7 Crafting plot 1 out of 40 (2022/04/22 12:38:43)
8 Process ID: 10272
9 Number of Threads: 36
10 Number of Buckets P1: 2^8 (256)
11 Number of Buckets P3+P4: 2^8 (256)
12 Pool Puzzle Hash: ...
13 Farmer Public Key: ...
14 Working Directory: /plot4/
15 Working Directory 2: /plot4/
16 Plot Name: plot-k34-2022-04-22-12-38-...
17 [P1] Table 1 took 80.5456 sec
18 [P1] Table 2 took 421.855 sec, found 17179895526 matches
19 [P1] Table 3 took 692.764 sec, found 17180041892 matches
20 [P1] Table 4 took 849.356 sec, found 17180079257 matches
21 [P1] Table 5 took 850.095 sec, found 17180234707 matches
22 [P1] Table 6 took 808.098 sec, found 17180657540 matches
23 [P1] Table 7 took 632.536 sec, found 17181473914 matches
24 Phase 1 took 4335.28 sec
25 [P2] max_table_size = 17181473914
26 [P2] Table 7 scan took 64.9589 sec
27 [P2] Table 7 rewrite took 350.204 sec, dropped 0 entries (0 %)
28 [P2] Table 6 scan took 116.656 sec
29 [P2] Table 6 rewrite took 173.673 sec, dropped 2324880433 entries (13.532 %)
30 [P2] Table 5 scan took 111.72 sec
31 [P2] Table 5 rewrite took 168.228 sec, dropped 3047672881 entries (17.7394 %)
32 [P2] Table 4 scan took 108.128 sec
33 [P2] Table 4 rewrite took 164.688 sec, dropped 3315222793 entries (19.2969 %)
34 [P2] Table 3 scan took 107.269 sec
35 [P2] Table 3 rewrite took 165.895 sec, dropped 3420069244 entries (19.9072 %)
36 [P2] Table 2 scan took 108.445 sec
37 [P2] Table 2 rewrite took 164.671 sec, dropped 3462035936 entries (20.1517 %)
38 Phase 2 took 1865.99 sec
39 Wrote plot header with 252 bytes
40 [P3-1] Table 2 took 257.428 sec, wrote 13717859590 right entries
41 [P3-2] Table 2 took 214.897 sec, wrote 13717859590 left entries, 13717859590 final
42 [P3-1] Table 3 took 263.115 sec, wrote 13759972648 right entries
43 [P3-2] Table 3 took 213.227 sec, wrote 13759972648 left entries, 13759972648 final
44 [P3-1] Table 4 took 267.454 sec, wrote 13864856464 right entries
45 [P3-2] Table 4 took 412.839 sec, wrote 13864856464 left entries, 13864856464 final
46 [P3-1] Table 5 took 288.155 sec, wrote 14132561826 right entries
47 [P3-2] Table 5 took 401.173 sec, wrote 14132561826 left entries, 14132561826 final
48 [P3-1] Table 6 took 290.407 sec, wrote 14855777107 right entries
49 [P3-2] Table 6 took 385.246 sec, wrote 14855777107 left entries, 14855777107 final
50 [P3-1] Table 7 took 305.812 sec, wrote 17181473914 right entries
51 terminate called after throwing an instance of 'std::runtime_error'
52 what(): thread failed with: thread failed with: small_delta >= 256 (407)
53 Command terminated by signal 6
54 198313.73user 15600.11system 2:43:47elapsed 2176%CPU (0avgtext+0avgdata 125113576maxresident)k
In my case I root-caused this to:
- I recently enabled XFS "discard" option, default for RHEL-8.5 is "no discard"
- Also I am using Raid0 with 3x NVME + 1x Raid0 with 2 Sata SSD
- initially I suspected using a P3700 engineering sample with older firmware was
a contributing factor, but it also reproduced when replacing this with other SSD
There are also other error messges when running multiple times
P3-1] Table 7 took 344.503 sec, wrote 17183710592 right entries
terminate called after throwing an instance of 'std::runtime_error'
what(): thread failed with: thread failed with: small_delta >= 256 (950)
[P3-1] Table 7 took 343.122 sec, wrote 17182007329 right entries
free(): invalid next size (normal)
Command terminated by signal 6
[P1] Table 2 took 790.174 sec, found 17179640565 matches
terminate called after throwing an instance of 'std::runtime_error'
what(): thread failed with: input not sorted
[P1] Table 2 took 711.174 sec, found 17180065469 matches
Command terminated by signal 11
[P3-1] Table 7 took 343.122 sec, wrote 17182007329 right entries
free(): invalid next size (normal)
Command terminated by signal 6
These were sometimes logged (but not for all crashes) with P3700 ES,
so initially I suspected that would be contributing factor:
blk_update_request: critical target error, dev nvme2n1, sector 1558722560 op 0x3:(DISCARD) flags 0x0 phys_seg 1 prio class 0
blk_update_request: critical target error, dev nvme2n1, sector 1560120320 op 0x3:(DISCARD) flags 0x0 phys_seg 1 prio class 0
Conclusion:
- When mounting without discard, the problem goes away, madmax runs without errors.
- When mounting with discard, the problem comes back for my configuration stacked mdraid configuration
A closer look shows this might be related to mdadm-Raid0 with different size SSDs:
-1.2TB
-750 GB
-1.6 TB
I expected mdadm-Raid0 would then use the smallest size, e.g. 3x750GB,
but it actually uses the full amount 1.2+0.75+1.6TB.
From my understanding it stripes across all 3 disks until it reaches
the capacity of the smallest disk, then it stripes over the left-over disks etc.
My current assumption is "mdadm-raid0+discard" has a data corruption bug when combining
different size disks (at least for RHEL-8.5 kernel in my testing; did not yet test newer/upstream kernel).
Debian 11 without Gui + Mad Max Chia Plotter without Gui + what(): thread failed with: thread failed with: small_delta
cd chia-plotter sudo ./build/chia_plot -n 1 -r 30 -u 128 -t /mnt/nvme/ -d /mnt/ssd/share/ -c xch1xxxxx -f 8bebxxxxx size NVME 943Gb size SSD 235Gb size RAM DDR4 64 Gb CPU xeon 2*2667v3
Problem: P1-P2 the Mad Max writes full disk the /mnt/nvme/ (943 Gb!!!) and the beginning of the third stage P3 is interrupted by an error:
Wrote plot header with 252 bytes [P3-1] Table 2 took 112.406 sec, wrote 3429424212 right entries terminate called after throwing an instance of 'std::runtime_error' what(): thread failed with: thread failed with: small_delta >= 256 (34292629513) Aborted
RAM memtest PERFECT NVME Health PERFECT
Help me please!!!