smartbitcoin / arc_plot

CPU & GPU plotter
20 stars 1 forks source link

arc_plot crashes in phase two after ~25 plots #11

Closed WillBVT closed 1 year ago

WillBVT commented 1 year ago

Hi there,

First off thanks so much for your work on arc_plot, the results are very impressive so far!

I've noticed when running arc_plot that it crashes with an error after about 25 plots. I've attached a couple screen shots of the errors that I've gotten so far. This always seems to happen in phase 2. So far I've only been cpu plotting, but I'm going to test gpu plotting soon. Do you have any idea what could be causing these errors? Is it something in my setup that needs to be optimized?

System Info: Ubuntu 22.10 R9 5900 128GB Ram Temp: 2x 980 Pro Raid 0 Array Stage: 2x 870 Pro Raid 0 Array

Thanks again for your work on this!

Arc_plot error_2 Arc_plot error

DudSeb commented 1 year ago

Hello,

I ve got same issue here. I ve did the following things:

-memtest - no errors -discard option enabled in etc/fstab -fresh install -few combinations of settings arc_plot like changing buckets, gpu threads, etc.

Noone of those helped.

My setup:

Ubuntu 22.04 3970X 256 Gb Ram Temp & WD: WD_BLACK SN850X HS 1000 GB

But what i've observed: when i ve started with 63 threads - program crashed in phase 2 during plot 1,than i ve lowered the threads to 60, - crushed during plot 8, last setup was 58 threads, - program crashed on plot number 25. So it seems that if you set up less threads it can make more plots.

Another observation it seems that -2 arg not working, i ve setup another nvme drive and during plotting it s empty.

Im really curious what could be wrong :/

smartbitcoin commented 1 year ago

If it's bitfield assert crash , it's more like hardware instability issue. if program remove that assert, it will runs but the plot generated will be corrupted. Here hardware specific to ram or SSD. ( exclude cpu except overclocked ). Less ram overclock, use ECC ram, more stable SSD will helps in those case. ( less threads also have less tension on memory operation , which make less ram corruption too.)

SSD instability will occur more often when you copy plot file out and plotting next one at meanwhile.

one solution is write a shell script , arc_plot -n 1 , only create one plot , then call move plot out before run arc_plot again if your SSD really too slow. another way is build a multiple SSD as raid0 . so each SSD have enough resource for parallel workflow. Some time. so fstrim /your_ssd_mount_point after every plot helps too.

for -t, if you run in -R full ram model, -2 's temp sorting file will create in ram directly , that's why -2 's folder is empty. ( GPU plotting same behaviour).

For threadripper 3970X, it's 32 core, 64 threads, the best perf was set -r 32 instead of anything > 32. plot time is best and more stable.

DudSeb commented 1 year ago

Hello,

Many thanks for your reply and great work with the app.

I ve made RAID 0 from two 1TB WD Black SN750. Bios is on default setting so no oc of ram no oc of cpu. I ve also set up 32 cores in command line of arc_plot.

All of this didnt resolve the issue. Program made 23 plots, and crashed on number 24.

I think im not gonna buy 256 GB ECC ram, cause i cant be sure if it resolve the problem.

Im not so good at linux, could u please send me the script which u mentioned in first e-mail ?

Best Regards, Sesbatian

W dniu niedz., 25.12.2022 o 05:22 SmartBitCoin @.***> napisał(a):

If it's bitfield assert crash , it's more like hardware instability issue. if program remove that assert, it will runs but the plot generated will be corrupted. Here hardware specific to ram. ( exclude cpu except overclocked ). Less ram overclock, use ECC ram, more stable SSD will helps in those case. ( less threads also have less tension on memory operation , which make less ram corruption too.)

SSD instability will occur more often when you copy plot file out and plotting next one at meanwhile.

one solution is write a shell script , arc_plot -n 1 , only create one plot , then call cp plot out before run arc_plot again if your SSD really too slow. another way is build a multiple SSD as raid0 . so each SSD have enough resource for parallel workflow.

for -t, if you run in -R full ram model, -2 's temp sorting file will create in ram directly , that's why -2 's folder is empty. ( GPU plotting same behaviour).

— Reply to this email directly, view it on GitHub https://github.com/smartbitcoin/arc_plot/issues/11#issuecomment-1364624585, or unsubscribe https://github.com/notifications/unsubscribe-auth/A457DTQUG2YXEHHKHSVY3GLWO7DX3ANCNFSM6AAAAAATINOVFY . You are receiving this because you commented.Message ID: @.***>

smartbitcoin commented 1 year ago

Could you try without -R and point -2 to your ramdisk before switch to script solution? create ramdisk: sudo mount -t tmpfs -o size=120G tmpfs /mnt/you_ramdisk_path/ ./arc_plot ... -2 /mnt/your_ramdisk_path/

DudSeb commented 1 year ago

Yes i will test the configuration that u sent, but earlier i ve started like this:

./arc_plot -r 32 -G -n 165 -c -f -t /mnt/RAID/ -2 /mnt/RAID/ -d /mnt/DYSK/

I'm plotting at 3090ti.

However now the result is that process crashed on plot number 19 with error:

thread failed with: thread failed with: small_delta >= 256 (30601641984)

W dniu niedz., 25.12.2022 o 17:11 SmartBitCoin @.***> napisał(a):

Could you try without -R and point -2 to your ramdisk before switch to script solution? create ramdisk: sudo mount -t tmpfs -o size=120G tmpfs /mnt/you_ramdisk_path/ ./arc_plot ... -2 /mnt/your_ramdisk_path/

— Reply to this email directly, view it on GitHub https://github.com/smartbitcoin/arc_plot/issues/11#issuecomment-1364704092, or unsubscribe https://github.com/notifications/unsubscribe-auth/A457DTUBR4TX5QURDF4NQHLWPBW2LANCNFSM6AAAAAATINOVFY . You are receiving this because you commented.Message ID: @.***>

DudSeb commented 1 year ago

More results:

During plot number 1:

arc_plot: /home/jin/source/arc_plotter/include/chia/bitfield.hpp:156: int64_t bitfield::count(uint64_t, uint64_t) const: Assertion `start_bit <= end_bit' failed. ./arc.sh: linia 1: 4637 Przerwane (zrzut pamięci) ./arc_plot -r 63 -n 125 -c -f -t /mnt/RAID/ -2 /mnt/ramdisk/ -d /mnt/DYSK/

During plot number 8:

arc_plot: /home/jin/source/arc_plotter/include/chia/bitfield.hpp:156: int64_t bitfield::count(uint64_t, uint64_t) const: Assertion `start_bit <= end_bit' failed. ./arc.sh: linia 1: 7854 Przerwane (zrzut pamięci) ./arc_plot -n 125 -c -f -t /mnt/RAID/ -2 /mnt/ramdisk/ -d /mnt/DYSK/

smartbitcoin commented 1 year ago

it's highly possible now that your /mnt/RAID which store the tmp file have instability issue if the plot file pile up in that driver. how fast your copy plot file from /mnt/RAID to your dst /mnt/DYSK? for regular SSD, like you have 2x 1T, if it's occupancy is high, then it's performance slow down which cause the write error and later read fail. so please make sure your plot file mv out from /mnt/RAID before you continue plot.
Staging design on your system is the key to get a stable plotting. If you have 3090ti, the new version 0.7 will be much faster. and also require even higher continuous SSD sequential R/W performance. I use 3 pcie 4 SSD as raid0 and fstrim it every time after a plot move out. btw, GPU plotting don't need ramdisk so -2 can skip. If your SSD is heavily used before , you can also check it health status by sudo smartctl -a /dev/nvme1n1 etc to confirm the SSD have no errors.

if smartctl don't detect any error, the way to make your SSD ready for plot is: sudo rm /mnt/RAID/* ; sudo fstrim /mnt/RAID ;

you can put all those into a bash shell like this:

#!/bin/bash
for i in {1..10}
do 
  # just run one plot and leave final plot file in tmp drive.
  echo "start plot " ${i}
  ./arc_plot -r 32 -G -n 1 -c  -f  -t /mnt/RAID/ -d /mnt/RAID
  # sleep few second
  sleep 3
  # move plot file to destination.
  echo "move plot " ${i}
  mv /mnt/RAID/*.plot /mnt/DYSK/
  sleep 3
  # clean up tmp drive for next plot.
  echo "clean up tmp drive"
  rm /mnt/RAID/* ; sudo fstrim /mnt/RAID ;
done

above example is just for reference, you can add more stuff into it. also keep in mind , if you wana run a shell with sudo, you should add your self into sudo group and set it to ignore the password, or just run it as root.

WillBVT commented 1 year ago

Hey there,

Thanks a lot for your responses, much appreciated.

So far I haven't detected any issues with my memory and cpu stability, running the usual tests. I haven't had issues running hundreds of plots with BB or MMX on the same system. Arc_plot times are quite a bit better though for cpu plotting.

As you suggested I tried running with a mounted Ram disk rather then the -R parameter but it doesn't seem to make a difference. I also tried setting final directory to my second SSD array and using my -t as stage so the plot transfer was only taking about 1.5 minutes. This also doesn't seem to help.

When I get some time I may try a version of the script you suggested above and report back.

Thanks again!

smartbitcoin commented 1 year ago

Over burned SSD is like a HDD have bad sector. most plotting instability is b/c aged SSD have issue. above script is for help refresh your old SSD and remove all fragment and re-trim it for plotting task. but SSD will still eventually die if aged too bad.