Open therealflinchy opened 3 years ago
P3-2 is writing the left entries to RAM disk, but also the final plot to tmp_dir
. Usually P3-2 is faster because it's not reading from tmp_dir
, like P3-1. P3-1 Table 2 is always a bit slower because it has to do phase 2 work on table 1.
It's possible that during phase 1 your SSDs were able to trim
enough space for phase 2, but then in phase 3 they ran out of trimmed space which will cause write speeds to go down a lot...
Also it's possible you're running out of RAM, phase 3 needs about 80GB of that RAM disk (40GB thereof is table 7 from phase 2, which is read at the end of phase 3).
P3-2 is writing the left entries to RAM disk, but also the final plot to
tmp_dir
. Usually P3-2 is faster because it's not reading fromtmp_dir
, like P3-1. P3-1 Table 2 is always a bit slower because it has to do phase 2 work on table 1.It's possible that during phase 1 your SSDs were able to
trim
enough space for phase 2, but then in phase 3 they ran out of trimmed space which will cause write speeds to go down a lot...Also it's possible you're running out of RAM, phase 3 needs about 80GB of that RAM disk (40GB thereof is table 7 from phase 2, which is read at the end of phase 3).
SSD are empty. Sustained write speeds for each separately ~1700mbps after cache, so I'm fairly certain it isn't that. IOTOP is only showing ~50mbps of writes at any point in p3 that I've observed so they're far from saturated
Not running out of ram, 188gb installed, 110 for the ramdisk
Edit: the ssd's are actually a bit slower than they should be, but still,. 700mbps, should be ok?
Can you run top -H
during P3-2 and show me a snapshot?
That's from P3-1 of the next table 3, improving the merge
thread is on the ToDo list.
When you see P3-1 on the terminal, that means it's currently processing P3-2.
Can you run
top -H
during P3-2 and show me a snapshot?
i've checked my NVME are performing as expected, maybe some of them aren't quite, but i've tried it on a few temp drives, raid and non-raid and performance is... consistent in as much as it's bad on all of them, no matter how fast or slow they benchmark post-cache.
it ended in a 300 second P3-2 just after i took this screenshot.
That's pretty bad indeed, those two threads at the top are the bottleneck of phase 3 stage 2, one of them should be close to 100% with that many threads. But they aren't, which means the bottleneck is somewhere else...
Try reducing the number of threads, maybe the other threads are stealing CPU time from those two critical ones.
Ah shit, your CPU is 95% idle... and load average is ~5, something is up...
It's a 2990wx so it's... It's its own thing, 32c/64t, the 2 NUMA node tempers performance expectations somewhat, but Linux is supposed to be good at allocating threads. P1+2 I'm pretty happy with, performance is consistent enough there even if I start a plot back to back after one fails/I end one.
My whole plotting experience with this system is pretty mediocre, so I was hesitant to post this assuming it's likely my hardware somewhere, but I'm pretty stumped. I'm 95% sure I've eliminated it being nvme related, tried it on 2 separate (different model) single drives, a 2 drive raid and a 4 drive raid which benched as being at least 2500-3000mbps sustsined write with dd, and that's not helping
Yeah, top
shows 0% wa
, ie. 0% waiting for disks...
Ahhhh right yep thanks, wasn't sure what to look for but that answers that for me!
i might also try swapping cpu's back to my 16 core 1950x to rule out silly 2990wx issues.
Also, experiencing same sub for phase 4, others get ~60 seconds, mine takes some 430 seconds, at about 25-30mbps write
yeah phase 4 is very similar to phase 3...
what seems to be slow for you is the final plot creation, which happens in phase 3 stage 2 and phase 4... there is a lot going on there, which I didn't code myself, just copied from chiapos.
I guess I'll be able to test if it's CPU related tomorrow, hopefully that's the answer even though it means I'll be selling it. No idea why it would be, but would explain the somewhat disappointing overall plotting I've experienced if that's what it turns out to be.
P3-2 is now 30% faster on my machine. But it could make a bigger difference on your machine, since now there is much less context switching going on.
Lol, did another test and this time it was slower... I think there is trap, once it falls into it it never recovers. Fixing it now..
P3-2 is now 30% faster on my machine. But it could make a bigger difference on your machine, since now there is much less context switching going on.
Oh man it's so much better! Only run one plot but so far looks great
P3-1 50s, p3-2 35s, 630 total, shaved off about 10 mins. P3-1 table 7 was a little off at 110 but don't have screenshots of if that was consistently similar before or not
P4 still abysmal, almost 600 seconds
yeah P4 doesn't have the optimization yet
P3-1 table 7 is a bit special, it reads about 2.5 times more data from phase 2 output (40GB vs <16GB for the other tables).
Ah I know what might be happening... since I split the "slice" thread into two threads (besides another optimization, passing data in much bigger chunks) it's now possible for them to end up on two different CPUs! And that is going to kill performance... but if you're lucky it's a lot faster.
As in, 2 different CPU in a multi socket system? Probably a similarish potential pitfall for me with NUMA nodes on a single socket , but not an issue for most people with a normal setup?
Yes
check latest master, should be even faster now, P4 also optimized
check latest master, should be even faster now, P4 also optimized
much faster, about 50 more seconds shaved off P3, and P4 from 650 seconds to 180 (writing at about 100mbps average, not sure what others are getting here, but not a big deal if i can just queue another plot to start when P4 starts, P4 time is almost irrelevant.)
looking at IO, some phases are maxing out my single nvme, see if i can save any time with a raid0 now
can't thank you enough for the hard work you're putting in and these massive improvements, now i've just gotta become chia rich and give some back lol
awesome times, P3-1 is next for an update, but it won't be as much I think
i think my P4 is still a little behind others i've asked but at this point i'm putting that down to my finicky hardware configuration that i can keep tinkering with, given you managed to shave 70% off previous time, basically a miracle worker already lol. damn near similar plots per day to parallel plotting for me, likely a lot faster for others with more normal hardware, with a bunch of upsides anyway.
yeah I've noticed, P4 is a bit complex to optimize, there is still potential
yeah I've noticed, P4 is a bit complex to optimize, there is still potential
turns out i'm mostly talking to ballers with big ram servers plotting both tmp and tmp2 entirely to ramdisk so that explains the 60s times
32 threads on a 2990wx, ramdisk, 970 evo+ in raid0
others seem to be getting more consistent speeds between left and right entries, 50s+/- for either one.