madMAx43v3r / chia-gigahorse

221 stars 31 forks source link

Farm crashing after 14.5 PiB #187

Open mrwarp opened 1 year ago

mrwarp commented 1 year ago

We've been running gigahorse for months with no issues. Once we added enough plots to break the ~14.5PiB (physical) threshold, some of our harvesters start submitting progressively later and later partials. If we remove plots to get back under 14.5, everything normalizes and runs fine. It does not matter what harvester you add/remove plots from nor does it matter how many harvesters you use. At first, we thought it may have been the farmer had exceeded its hardware limits. We moved the farmer to a much larger machine and the problem still persists. I am happy to provide whatever information you require to look into this issue, just ask! CPU info for the new farmer posted below running with 256GB of RAM.

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 45 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) CPU E5-4640 v2 @ 2.20GHz CPU family: 6 Model: 62 Thread(s) per core: 1 Core(s) per socket: 32 Socket(s): 2 Stepping: 4 BogoMIPS: 4399.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse 4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust smep arat md_clear flush_l1d arch_capabilities Virtualization features: Hypervisor vendor: VMware Virtualization type: full Caches (sum of all): L1d: 2 MiB (64 instances) L1i: 2 MiB (64 instances) L2: 16 MiB (64 instances) L3: 40 MiB (2 instances) NUMA: NUMA node(s): 4 NUMA node0 CPU(s): 0-15 NUMA node1 CPU(s): 16-31 NUMA node2 CPU(s): 32-47 NUMA node3 CPU(s): 48-63

madMAx43v3r commented 1 year ago

hmm I manage a 16 PiB physical farm without issues, but that's using flexfarmer.

are you using remote compute to share the load between harvesters?

mrwarp commented 1 year ago

We figured out the issue. Some machines were using NFSv3 instead of v4. We were hitting v3 limits and causing bottlenecks.

On Mon, Aug 28, 2023 at 9:44 PM Max @.***> wrote:

hmm I manage a 16 PiB physical farm without issues, but that's using flexfarmer.

are you using remote compute to share the load between harvesters?

— Reply to this email directly, view it on GitHub https://github.com/madMAx43v3r/chia-gigahorse/issues/187#issuecomment-1696686946, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALZGWS5OWN4FJAQKWHPTL5TXXVJSBANCNFSM6AAAAAA3NOYK2Y . You are receiving this because you authored the thread.Message ID: @.***>