madMAx43v3r / chia-plotter

Apache License 2.0
2.27k stars 662 forks source link

Massive drop in CPU usage when computing P1 Table 4. #746

Closed johnyoungberg closed 3 years ago

johnyoungberg commented 3 years ago

During P1 Table 1,2,3 my CPU utilization is very high. Something is happening when it gets to Table 4 and the CPU utilization drops to less than 10 on each core and my plotting times increase significantly.

Plotting on Ubuntu 20.04. System specs: 2x E5-2650 v4 32GB DDR4 1tb NVME connected via PCIe tmp1 and tmp2 are both on the NVMe destination on the NVMe as well during testing

Running plotter with this command: ./chia-plotter/build/chia_plot -n 1 -r 24 -t /mnt/chia_tmp/tmp1/ -2 /mnt/chia_tmp/tmp2/ -d /mnt/chia_tmp/chia_plots/

Output (some unnecessary lines removed)

Multi-threaded pipelined Chia k32 plotter - 17fd633

Final Directory: /mnt/chia_tmp/chia_plots/
Number of Plots: 1
Crafting plot 1 out of 1
Process ID: 9294
Number of Threads: 24
Number of Buckets P1:    2^8 (256)
Number of Buckets P3+P4: 2^8 (256)
Working Directory:   /mnt/chia_tmp/tmp1/
Working Directory 2: /mnt/chia_tmp/tmp2/
Plot Name: plot-k32-2021-07-07-05-04-61efd6475cc7b24fe77614f36477e23f09976f0867e438376737a90f0a2763b5
[P1] Table 1 took 18.7 sec
[P1] Table 2 took 100.447 sec, found 4294908851 matches
[P1] Table 3 took 120.906 sec, found 4294773715 matches
[P1] Table 4 took 1349.4 sec, found 4294600475 matches
[P1] Table 5 took 1256.3 sec, found 4294290395 matches
[P1] Table 6 took 1076.9 sec, found 4293689380 matches
[P1] Table 7 took 725.934 sec, found 4292468802 matches
Phase 1 took 4648.64 sec
[P2] max_table_size = 4294967296
[P2] Table 7 scan took 73.8469 sec
[P2] Table 7 rewrite took 388.723 sec, dropped 0 entries (0 %)
[P2] Table 6 scan took 76.8732 sec
[P2] Table 6 rewrite took 325.186 sec, dropped 581453851 entries (13.5421 %)
[P2] Table 5 scan took 72.0342 sec
[P2] Table 5 rewrite took 308.24 sec, dropped 762099415 entries (17.7468 %)
[P2] Table 4 scan took 71.4251 sec
[P2] Table 4 rewrite took 303.067 sec, dropped 828957993 entries (19.3023 %)
[P2] Table 3 scan took 68.3854 sec
[P2] Table 3 rewrite took 298.946 sec, dropped 855131329 entries (19.911 %)
[P2] Table 2 scan took 68.7507 sec
[P2] Table 2 rewrite took 298.131 sec, dropped 865663893 entries (20.1556 %)
Phase 2 took 2379.23 sec
Wrote plot header with 268 bytes
[P3-1] Table 2 took 420.394 sec, wrote 3429244958 right entries
[P3-2] Table 2 took 412.605 sec, wrote 3429244958 left entries, 3429244958 final
[P3-1] Table 3 took 428.575 sec, wrote 3439642386 right entries
[P3-2] Table 3 took 427.051 sec, wrote 3439642386 left entries, 3439642386 final
[P3-1] Table 4 took 420.703 sec, wrote 3465642482 right entries
[P3-2] Table 4 took 414.693 sec, wrote 3465642482 left entries, 3465642482 final
[P3-1] Table 5 took 410.38 sec, wrote 3532190980 right entries
[P3-2] Table 5 took 432.01 sec, wrote 3532190980 left entries, 3532190980 final
[P3-1] Table 6 took 439.167 sec, wrote 3712235529 right entries
[P3-2] Table 6 took 441.107 sec, wrote 3712235529 left entries, 3712235529 final
[P3-1] Table 7 took 501.625 sec, wrote 4292468802 right entries
[P3-2] Table 7 took 508.74 sec, wrote 4292468802 left entries, 4292468802 final
Phase 3 took 5262.13 sec, wrote 21871425137 entries to final plot
[P4] Starting to write C1 and C3 tables
[P4] Finished writing C1 and C3 tables
[P4] Writing C2 table
[P4] Finished writing C2 table
Phase 4 took 174.012 sec, final plot size is 108801169664 bytes
Total plot creation time was 12464.1 sec (207.735 min)

I haven't let the full plot finish yet running like this but I will add the remainder of the output when this plot completes. The full log has been added.

gryan315 commented 3 years ago

Why are you using both -t, -2, and -d if they are on the same drive? Just use -t. It looks to me like your drive can't handle the IO, and has exhausted its cache. What model are you using? Is it TLC NAND?

johnyoungberg commented 3 years ago

Yes it is TLC NAND (Inland). While I believe you are right, I'm trying to make sense of why the same issue isn't occurring on my other plotter that is using the same SSD and has 32GB RAM but uses a Ryzen 3700x single CPU

johnyoungberg commented 3 years ago

This issue has now spread to the Ryzen 3700x machine as well after updating to the current build. Using the windows binary 0.0.7 I was able to build plots with this exact setup in about 50-55 min. After updating to 0.1.1 for pool compatibility this system exhibits the same symptom.

Both plotters are plotting to the Inland 1TB NVMe which is TLC NAND.

madMAx43v3r commented 3 years ago

can you check top and see what the wa (waiting) percentage is? (next to idle percentage)

johnyoungberg commented 3 years ago

It is less than 1% before the slow down and after it fluctuates between 5% and 15%

stevekm commented 3 years ago

possibly related to #785

madMAx43v3r commented 3 years ago

It is less than 1% before the slow down and after it fluctuates between 5% and 15%

That would indicate a problem with your SSDs

johnyoungberg commented 3 years ago

Did something change between v .0.0.6 and 0.1.1 that would have caused this SSD issue? Multiple systems with no hardware change and significant increases in plotting time.

johnyoungberg commented 3 years ago

Closing due to likely being hardware related.