Can plotter utilise multiple NUMA nodes?

madMAx43v3r / chia-plotter

Apache License 2.0

2.27k stars 662 forks source link

Can plotter utilise multiple NUMA nodes? #490

Open athena9 opened 3 years ago

athena9 commented 3 years ago

I'm using the windows compiled binary from stotiks with a TR 3995wx.

Only cores from Node 0 are used.

Is there a way to utilize the full CPU?

Thanks

scerbera commented 3 years ago

the 3990 is sometimes quicker with smt off. look for a program called process lasso. it's essential to get the most out of these processors. you can set the cpu affinity to whatever you want using it.

if you do want to use then id suggest copy and pasting the chia_plot.exe and renaming to chia_plot_2.exe for example. then run your script to call them as 2 different processes and assign chia_plot.exe to run on node 1 cores 0-63, and chia_plot_2.exe to run on cores 64-127.

Also be aware that these chips are made up into chiplets which complicates further.

I'd be really interested in your results given you have twice the memory bandwidth of my 3990.

lordzamura commented 3 years ago

I'm using the windows compiled binary from stotiks with a TR 3995wx.

Only cores from Node 0 are used.

Is there a way to utilize the full CPU?

Thanks

you will see group 1 and group 2. you can set manually

athena9 commented 3 years ago

I'd be really interested in your results given you have twice the memory bandwidth of my 3990.

Running two plotters and pinning each to one node with windows affinity the two plots completed in about 40 minutes, so 1 per 20 minutes. I think that's no different to a 3990x. I've got lots of setting variants to run through, and will test with smt options and process lasso etc.

One plot per 20 minutes is significantly slower than parallel output using vanilla CLI, so I'm hoping to shave a chunk off the times on madmax.

What works best on your 3990x?

scerbera commented 3 years ago

I have yet to be able to beat my vanilla plotting output. I use that rig for other things not just plotting so settled on a stagger of 1050 secs which ran really stable. I was able to push to 10TB a day if i didn't use it. My best times so far are 1600secs, using primo cache as a ram cache. I suspect that setting up 8 processes and setting affinity to the 8 ccd's with your mem io would yield the best result. probably with -r at 6 ish.

scerbera commented 3 years ago

also -u at 256 is working well, 512 is much quicker in phase 1, but slower phase 3, with larger ram might work better though.