madMAx43v3r / chia-plotter

Apache License 2.0
2.27k stars 662 forks source link

P3 "left entry" very slow #44

Open therealflinchy opened 3 years ago

therealflinchy commented 3 years ago

32 threads on a 2990wx, ramdisk, 970 evo+ in raid0

others seem to be getting more consistent speeds between left and right entries, 50s+/- for either one.

Screenshot from 2021-06-09 01-45-29

madMAx43v3r commented 3 years ago

P3-2 is writing the left entries to RAM disk, but also the final plot to tmp_dir. Usually P3-2 is faster because it's not reading from tmp_dir, like P3-1. P3-1 Table 2 is always a bit slower because it has to do phase 2 work on table 1.

It's possible that during phase 1 your SSDs were able to trim enough space for phase 2, but then in phase 3 they ran out of trimmed space which will cause write speeds to go down a lot...

Also it's possible you're running out of RAM, phase 3 needs about 80GB of that RAM disk (40GB thereof is table 7 from phase 2, which is read at the end of phase 3).

therealflinchy commented 3 years ago

P3-2 is writing the left entries to RAM disk, but also the final plot to tmp_dir. Usually P3-2 is faster because it's not reading from tmp_dir, like P3-1. P3-1 Table 2 is always a bit slower because it has to do phase 2 work on table 1.

It's possible that during phase 1 your SSDs were able to trim enough space for phase 2, but then in phase 3 they ran out of trimmed space which will cause write speeds to go down a lot...

Also it's possible you're running out of RAM, phase 3 needs about 80GB of that RAM disk (40GB thereof is table 7 from phase 2, which is read at the end of phase 3).

SSD are empty. Sustained write speeds for each separately ~1700mbps after cache, so I'm fairly certain it isn't that. IOTOP is only showing ~50mbps of writes at any point in p3 that I've observed so they're far from saturated

Not running out of ram, 188gb installed, 110 for the ramdisk

Edit: the ssd's are actually a bit slower than they should be, but still,. 700mbps, should be ok?

madMAx43v3r commented 3 years ago

Can you run top -H during P3-2 and show me a snapshot?

JUNQINGV587 commented 3 years ago
微信图片_20210609014622
madMAx43v3r commented 3 years ago

That's from P3-1 of the next table 3, improving the merge thread is on the ToDo list. When you see P3-1 on the terminal, that means it's currently processing P3-2.

therealflinchy commented 3 years ago

Can you run top -H during P3-2 and show me a snapshot?

i've checked my NVME are performing as expected, maybe some of them aren't quite, but i've tried it on a few temp drives, raid and non-raid and performance is... consistent in as much as it's bad on all of them, no matter how fast or slow they benchmark post-cache.

it ended in a 300 second P3-2 just after i took this screenshot.

ss

madMAx43v3r commented 3 years ago

That's pretty bad indeed, those two threads at the top are the bottleneck of phase 3 stage 2, one of them should be close to 100% with that many threads. But they aren't, which means the bottleneck is somewhere else...

Try reducing the number of threads, maybe the other threads are stealing CPU time from those two critical ones.

madMAx43v3r commented 3 years ago

Ah shit, your CPU is 95% idle... and load average is ~5, something is up...

therealflinchy commented 3 years ago

It's a 2990wx so it's... It's its own thing, 32c/64t, the 2 NUMA node tempers performance expectations somewhat, but Linux is supposed to be good at allocating threads. P1+2 I'm pretty happy with, performance is consistent enough there even if I start a plot back to back after one fails/I end one.

My whole plotting experience with this system is pretty mediocre, so I was hesitant to post this assuming it's likely my hardware somewhere, but I'm pretty stumped. I'm 95% sure I've eliminated it being nvme related, tried it on 2 separate (different model) single drives, a 2 drive raid and a 4 drive raid which benched as being at least 2500-3000mbps sustsined write with dd, and that's not helping

madMAx43v3r commented 3 years ago

Yeah, top shows 0% wa, ie. 0% waiting for disks...

therealflinchy commented 3 years ago

Ahhhh right yep thanks, wasn't sure what to look for but that answers that for me!

i might also try swapping cpu's back to my 16 core 1950x to rule out silly 2990wx issues.

therealflinchy commented 3 years ago

Also, experiencing same sub for phase 4, others get ~60 seconds, mine takes some 430 seconds, at about 25-30mbps write

madMAx43v3r commented 3 years ago

yeah phase 4 is very similar to phase 3...

madMAx43v3r commented 3 years ago

what seems to be slow for you is the final plot creation, which happens in phase 3 stage 2 and phase 4... there is a lot going on there, which I didn't code myself, just copied from chiapos.

therealflinchy commented 3 years ago

I guess I'll be able to test if it's CPU related tomorrow, hopefully that's the answer even though it means I'll be selling it. No idea why it would be, but would explain the somewhat disappointing overall plotting I've experienced if that's what it turns out to be.

madMAx43v3r commented 3 years ago

89 might fix your performance issue

madMAx43v3r commented 3 years ago

P3-2 is now 30% faster on my machine. But it could make a bigger difference on your machine, since now there is much less context switching going on.

madMAx43v3r commented 3 years ago

Lol, did another test and this time it was slower... I think there is trap, once it falls into it it never recovers. Fixing it now..

therealflinchy commented 3 years ago

P3-2 is now 30% faster on my machine. But it could make a bigger difference on your machine, since now there is much less context switching going on.

Oh man it's so much better! Only run one plot but so far looks great

P3-1 50s, p3-2 35s, 630 total, shaved off about 10 mins. P3-1 table 7 was a little off at 110 but don't have screenshots of if that was consistently similar before or not

P4 still abysmal, almost 600 seconds

madMAx43v3r commented 3 years ago

yeah P4 doesn't have the optimization yet

madMAx43v3r commented 3 years ago

P3-1 table 7 is a bit special, it reads about 2.5 times more data from phase 2 output (40GB vs <16GB for the other tables).

madMAx43v3r commented 3 years ago

Ah I know what might be happening... since I split the "slice" thread into two threads (besides another optimization, passing data in much bigger chunks) it's now possible for them to end up on two different CPUs! And that is going to kill performance... but if you're lucky it's a lot faster.

therealflinchy commented 3 years ago

As in, 2 different CPU in a multi socket system? Probably a similarish potential pitfall for me with NUMA nodes on a single socket , but not an issue for most people with a normal setup?

madMAx43v3r commented 3 years ago

Yes

madMAx43v3r commented 3 years ago

check latest master, should be even faster now, P4 also optimized

therealflinchy commented 3 years ago

check latest master, should be even faster now, P4 also optimized

much faster, about 50 more seconds shaved off P3, and P4 from 650 seconds to 180 (writing at about 100mbps average, not sure what others are getting here, but not a big deal if i can just queue another plot to start when P4 starts, P4 time is almost irrelevant.)

looking at IO, some phases are maxing out my single nvme, see if i can save any time with a raid0 now

can't thank you enough for the hard work you're putting in and these massive improvements, now i've just gotta become chia rich and give some back lol Screenshot from 2021-06-11 05-23-38

madMAx43v3r commented 3 years ago

awesome times, P3-1 is next for an update, but it won't be as much I think

therealflinchy commented 3 years ago

i think my P4 is still a little behind others i've asked but at this point i'm putting that down to my finicky hardware configuration that i can keep tinkering with, given you managed to shave 70% off previous time, basically a miracle worker already lol. damn near similar plots per day to parallel plotting for me, likely a lot faster for others with more normal hardware, with a bunch of upsides anyway.

madMAx43v3r commented 3 years ago

yeah I've noticed, P4 is a bit complex to optimize, there is still potential

therealflinchy commented 3 years ago

yeah I've noticed, P4 is a bit complex to optimize, there is still potential

turns out i'm mostly talking to ballers with big ram servers plotting both tmp and tmp2 entirely to ramdisk so that explains the 60s times