mdadm Raid0 gives fread error, but individual drives works fine and no error

vvavepacket commented 3 years ago

I have 2 SSD, and using them individually (2 parallel madmax plots), it completes

I raided them in Raid 0 using Ubuntu software raid (mdadm), but it seems I'm getting this error once in a while (random)..

People say drive is dying... But if its dying, how come it works just fine individually?

People say it has to do something with trim.. Should we disable trim? How do we do proper trim on Ubuntu for raided devices?

Plot Name: plot-k32-2021-06-15-23-40-madapaka
[P1] Table 1 took 20.6869 sec
[P1] Table 2 took 158.572 sec, found 4294899676 matches
[P1] Table 3 took 248.297 sec, found 4294794896 matches
terminate called after throwing an instance of 'std::runtime_error'
  what():  thread failed with: fread() failed
Aborted (core dumped)

GTANAdam commented 3 years ago

Avoid using mdadm, there's a software bug somewhere, it seems like it can't handle multiple I/O, If possible, use hardware accelerated RAID0.

vvavepacket commented 3 years ago

Whats the alternative?

On Wed, Jun 16, 2021, 13:09 Adam @.***> wrote:

Avoid using mdadm, there's a software bug somewhere, it seems like it can't handle multiple I/O.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/madMAx43v3r/chia-plotter/issues/518#issuecomment-862557483, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7JJWQ6TC7FKY5DH5KIVG3TTDLFTANCNFSM46ZQF7XA .

GTANAdam commented 3 years ago

Whats the alternative? … On Wed, Jun 16, 2021, 13:09 Adam @.***> wrote: Avoid using mdadm, there's a software bug somewhere, it seems like it can't handle multiple I/O. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#518 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7JJWQ6TC7FKY5DH5KIVG3TTDLFTANCNFSM46ZQF7XA .

H/W RAID

SNGDude commented 3 years ago

hardware raid card

cyperbg commented 3 years ago

I did fake RAID in the BIOS (X570 Aorus Ultra) and it didn't show in Ubuntu. Disks didn't show the raid disk, just individual disks. Any idea why?

aznboy84 commented 3 years ago

I got this problem as well, but only on 1/5 of the plotting machines. Cleaned memory sticks with 90% alcohol and the problem gone (memory test didn't report anything anyway)

andyvk85 commented 3 years ago

@vvavepacket Do you use external USB SSDs for ur RAID0 with mdadm?

vvavepacket commented 3 years ago

I use Internal NVMEs.

On Wed, Jun 16, 2021, 19:49 Andy Koch @.***> wrote:

@vvavepacket https://github.com/vvavepacket Do you use external USB SSDs for ur RAID0 with mdadm?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/madMAx43v3r/chia-plotter/issues/518#issuecomment-862806389, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7JJWQXZ3LHLN7TLZU2UQTTTE2APANCNFSM46ZQF7XA .

andyvk85 commented 3 years ago

I use Internal NVMEs. … On Wed, Jun 16, 2021, 19:49 Andy Koch @.***> wrote: @vvavepacket https://github.com/vvavepacket Do you use external USB SSDs for ur RAID0 with mdadm? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#518 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7JJWQXZ3LHLN7TLZU2UQTTTE2APANCNFSM46ZQF7XA .

Ah okay, then I cannot help, unfortunately :/

vvavepacket commented 3 years ago

What if its USBs? what would be your advise or any tweaks?

[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5& 06/16/21, 07:57:05 PM

On Wed, Jun 16, 2021 at 7:56 PM Andy Koch @.***> wrote:

I use Internal NVMEs. … <#m-662601863954108449> On Wed, Jun 16, 2021, 19:49 Andy Koch @.***> wrote: @vvavepacket https://github.com/vvavepacket https://github.com/vvavepacket Do you use external USB SSDs for ur RAID0 with mdadm? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#518 (comment) https://github.com/madMAx43v3r/chia-plotter/issues/518#issuecomment-862806389>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7JJWQXZ3LHLN7TLZU2UQTTTE2APANCNFSM46ZQF7XA .

Ah okay, then I cannot help, unfortunately :/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/madMAx43v3r/chia-plotter/issues/518#issuecomment-862808943, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7JJWXD6NM3DFJUG2E6QU3TTE22HANCNFSM46ZQF7XA .

andyvk85 commented 3 years ago

At the beginning, I had the same errors with a lot of usb drives connected to usb hubs aso.. it was very important to consider usb channels, usb ports (assigned to a certain channel), usb hub devices, kind of usb drives, ..but in your case you use internal drives so you do not have to care about it..

aznboy84 commented 3 years ago

@vvavepacket Do you use external USB SSDs for ur RAID0 with mdadm?

I did something even weirder, trying to raid the hdd with ramdisk to increase hdd performance. it works well and give like 20% performance boost on 23G ramdisk

[root@localhost build]# parted print deGNU Parted 3.1 Using /dev/sda Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) print devices /dev/sda (80.0GB) /dev/sdb (500GB) /dev/sdc (500GB) /dev/sdd (500GB) /dev/sde (500GB) /dev/mapper/centos-home (18.5GB) /dev/mapper/centos-root (38.0GB) /dev/md0 (123GB) /dev/md1 (1901GB) (parted) [1]+ Stopped parted [root@localhost build]# cat /proc/mdstat Personalities : [raid0] md0 : active raid0 ram0[4] sde1[3] sdd1[2] sdc1[1] sdb1[0] 120499200 blocks super 1.2 512k chunks

md1 : active raid0 sdd2[2] sdc2[1] sdb2[0] sde2[3] 1856532480 blocks super 1.2 512k chunks

unused devices:

vvavepacket commented 3 years ago

Do you think I can raid a ramdisk with an HDD then? I have 64GB of ram. Will it offer the same 20% performance improvement?

On Wed, Jun 16, 2021, 20:22 aznboy84 @.***> wrote:

@vvavepacket https://github.com/vvavepacket Do you use external USB SSDs for ur RAID0 with mdadm?

I did something even weirder, trying to raid the hdd with ramdisk to increase hdd performance. it works well and give like 20% performance boost on 23G ramdisk

@. build]# parted │ATOP - localhost 2021/06/17 07:09:39 ---------------- 1s elapsed print deGNU Parted 3.1 │PRC | sys 0.22s | user 1.04s | #proc 171 | #tslpu 4 | #zombie 0 | #exit 0 | Using /dev/sda │CPU | sys 21% | user 107% | irq 0% | idle 208% | wait 64% | ipc 1.47 | Welcome to GNU Parted! Type 'help' to view a list of commands. │cpu | sys 4% | user 31% | irq 0% | idle 57% | cpu003 w 8% | ipc 1.07 | (parted) print devices │cpu | sys 6% | user 26% | irq 0% | idle 13% | cpu000 w 55% | ipc 1.41 | /dev/sda (80.0GB) │cpu | sys 3% | user 28% | irq 0% | idle 69% | cpu001 w 0% | ipc 1.71 | /dev/sdb (500GB) │cpu | sys 7% | user 23% | irq 0% | idle 70% | cpu002 w 0% | ipc 1.74 | /dev/sdc (500GB) │CPL | avg1 7.06 | avg5 7.03 | avg15 6.74 | csw 1259 | intr 1899 | numcpu 4 | /dev/sdd (500GB) │MEM | tot 31.3G | free 537.2M | cache 4.5G | buff 2.1M | slab 520.2M | hptot 0.0M | /dev/sde (500GB) │SWP | tot 0.0M | free 0.0M | swcac 0.0M | | vmcom 554.9M | vmlim 15.6G | /dev/mapper/centos-home (18.5GB) │PAG | scan 9261 | steal 9261 | stall 0 | | swin 0 | swout 0 | /dev/mapper/centos-root (38.0GB) │MDD | md0 | busy 0% | read 163 | write 304 | MBw/s 114.5 | avio 0.0 ns | /dev/md0 (123GB) │DSK | sdb | busy 93% | read 33 | write 62 | MBw/s 26.7 | avio 9.80 ms | /dev/md1 (1901GB) │DSK | sdd | busy 79% | read 34 | write 57 | MBw/s 17.1 | avio 8.70 ms | (parted) ^Z │DSK | sdc | busy 78% | read 36 | write 63 | MBw/s 21.8 | avio 7.83 ms | [1]+ Stopped parted │DSK | sde | busy 76% | read 36 | write 61 | MBw/s 21.5 | avio 7.89 ms | @. build]# cat /proc/mdstat │NET | transport | tcpi 2 | tcpo 1 | udpi 0 | udpo 0 | tcpao 0 | Personalities : [raid0] │NET | network | ipi 2 | ipo 1 | ipfrw 0 | deliv 2 | icmpo 0 | md0 : active raid0 ram0[4] sde1[3] sdd1[2] sdc1[1] sdb1[0] │NET | enp2s0 ---- | pcki 2 | pcko 1 | sp 0 Mbps | si 0 Kbps | so 10 Kbps | 120499200 blocks super 1.2 512k chunks │ │ PID SYSCPU USRCPU RDELAY VGROW RGROW RDDSK WRDSK RUID ST EXC THR S CPUNR CPU CMD 1/4 md1 : active raid0 sdd2[2] sdc2[1] sdb2[0] sde2[3] │ 1484 0.15s 1.02s 1.25s -4K 11324K 77936K 49176K root -- - 20 S 2 117% chia_plot 1856532480 blocks super 1.2 512k chunks │ 1553 0.03s 0.02s 0.00s 0K 0K 0K 0K root -- - 1 R 3 5% atop │ 1665 0.02s 0.00s 0.01s 0K 0K 0K 0K root -- - 1 D 0 2% kworker/u8:1 unused devices: │ 45 0.01s 0.00s 0.00s 0K 0K 0K 0K root -- - 1 S 2 1% kswapd0 @.*** build]#

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/madMAx43v3r/chia-plotter/issues/518#issuecomment-862819788, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7JJWW6K6O4FTP2EKKG7W3TTE55XANCNFSM46ZQF7XA .

aznboy84 commented 3 years ago

Technically you can, the mdadm works with partition as well, for example make 2 partitions with the size of 40GB, then a ramdisk with 40GB size too, then just make a raid 0 out of them

sudo modprobe brd rd_nr=1 rd_size=41943040 (1024x1024x40)

and then something like

sudo mdadm --create --verbose /dev/md0 --level=0 --raid-devices=3 /dev/sdb2 /dev/sdb3 /dev/ram0 (where sdb2 and sdb3 are the two partitions on the hdd/ssd - can also use ramdisk raid to reduce ssd wear / increase speed, but it's performance is nowhere near the full ram plotting)

and then something like

sudo mkfs -t xfs -f /dev/md0 sudo mkdir -p /mnt/md0 sudo mount /dev/md0 /mnt/md0 sudo chmod 777 /mnt/md0

and then you have a raid0 disk at /mnt/md0

carlfarrington commented 3 years ago

I did fake RAID in the BIOS (X570 Aorus Ultra) and it didn't show in Ubuntu. Disks didn't show the raid disk, just individual disks. Any idea why?

I think it's dmraid that supports this, vs md-raid for linux software raid. you might have to modprobe dmraid, or pass a kernel param like dmraid=true

andyvk85 commented 3 years ago

Technically you can, the mdadm works with partition as well, for example make 2 partitions with the size of 40GB, then a ramdisk with 40GB size too, then just make a raid 0 out of them

At the first glance it seems to be a good idea, but with a deeper understanding of Linux and RAID-0 you will recognize that the idea is not really suitable!

1. You don't need a ramdisk to bring in a performance boost in plotting

just use the Linux page cache at all for file write/read handles

sudo sysctl -w vm.vfs_cache_pressure=0
sudo sysctl -w vm.swappiness=0
sudo swapoff -a

from now on, your swap should be disabled and the Linux page cache will use all your available RAM dynamically!
when starting a new plot, check if the buffer/cache is really used:
```
watch -n1 "free -h"
```

2. In RAID-0 the weakest / slowest device dictates the speed of the others

obviously, a very fast device combined with two slower ones does not make sense at all for RAID-0
in RAID-1 it would be make sense, there you would have also the --write-mostly option in creating your raid

vvavepacket commented 3 years ago

I'd like to correct you for the swappiness.. If you have lots of ram, then dont bother configuring or turning off swap since it will never actually used the swap files for any file handles... It will just rely on ram.. but if you get close to the edge, then setting it off might help but at the cost of crashing your system just in case you go over it.

[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5& 06/17/21, 06:58:10 AM

On Thu, Jun 17, 2021 at 6:28 AM Andy Koch @.***> wrote:

Technically you can, the mdadm works with partition as well, for example make 2 partitions with the size of 40GB, then a ramdisk with 40GB size too, then just make a raid 0 out of them

At the first glance it seems to be a good idea, but with a deeper understanding of Linux and RAID-0 you will recognize that the idea is not really suitable!

1. You don't need a ramdisk to bring in a performance boost in plotting

just use the Linux page cache at all for file write/read handles

sudo sysctl -w vm.vfs_cache_pressure=0 sudo sysctl -w vm.swappiness=0 sudo swapoff -a

from now on, your swap should be disabled and the Linux page cache will use all your available RAM dynamically!

when starting a new plot, check if the buffer/cache is really used:

watch -n1 "free -h"

2. In RAID-0 the weakest / slowest device dictates the speed of the others

obviously, a very fast device combined with two slower ones does not make sense at all for RAID-0

in RAID-1 it would be make sense, there you would have also the --write-mostly option in creating your raid

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/madMAx43v3r/chia-plotter/issues/518#issuecomment-863124129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7JJWURWXEKZURSRFSRE6DTTHE4DANCNFSM46ZQF7XA .

andyvk85 commented 3 years ago

that's right, if you have more RAM than the task requires, then you don't have to care about swapping. I expected that not everyone here have at least 128GB of RAM available in a single system ;)

andyvk85 commented 3 years ago

btw: if you really really really want to have more control about your RAM resources explicitly, then use a LVM cache which is assigned to a LV. It works quite well, when you use your system also for other things which requires file writes / reads ops. So, it could be beneficial in such cases.

(a German tutorial, sry^^, instead of using a ssd cache you can use ur ramdisk) https://www.thomas-krenn.com/de/wiki/LVM_Caching_mit_SSDs_einrichten

EDIT: use write-back mode instead of writethrough mode!

therealflinchy commented 3 years ago

+1 mdadm makes my pc crash after a few hours. pity.

vvavepacket commented 3 years ago

Whats a good alternative to mdadm?

Hardware raid is not an option :/

[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5& 06/17/21, 09:53:28 AM

On Thu, Jun 17, 2021 at 9:53 AM therealflinchy @.***> wrote:

+1 mdadm makes my pc crash after a few hours. pity.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/madMAx43v3r/chia-plotter/issues/518#issuecomment-863258620, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7JJWXVCGSH5Z4HTQ5CAI3TTH44PANCNFSM46ZQF7XA .

wallentx commented 3 years ago

Everyone overlooks it:

sudo mkfs.f2fs -f \
    -l f2fs-collection \
    -O extra_attr,inode_checksum,sb_checksum \
    /dev/sdd \
    -c /dev/sde \
    -c /dev/sdf \
    -c /dev/sdg \
    -c /dev/sdh \
    -c /dev/sdi \
    -c /dev/sdj \
    -c /dev/sdk

In the above command, you are setting /dev/sdd as the meta device, which is the device that you mount. You ignore the others.

From man mkfs.f2fs

-c device-list Build f2fs with these additional comma separated devices, so that the user can see all the devices as one big volume. Supports up to 7 devices except meta device.

The part about passing the devices in a comma separated list is actually incorrect. You have to do it as I had shown.

An f2fs disk collection is basically RAID0, without mdadm, except that it doesn't care what size the individual volumes are, or even what those volumes are. You can combine a HDD, a hardware RAID0 volume, an mdadm volume, SSDs, and thumb drives into a single volume. Not that there'd ever be a reason to do that, as it would perform horribly, but there's no function stopping you from doing it.

I've been running an f2fs disk collection instead of mdadm purely due to getting better IOPS.

It has several quirks of its own, however. It doesn't like it when you build a volume under one kernel version, then switch to an older kernel. Requires some experimenting.

Bonus f2fs fact! I picked up 14 10TB HDDs for $100 each - last month - because they were host-managed SMR zoned drives. Not gonna lie, I had no idea how to use these, or how they work, but sudo mkfs.f2fs -m /dev/sdbx and mount it as any normal HDD.

-m
Specify f2fs filesystem to supports the block zoned feature. Without it, the filesystem doesn't support the feature.

vvavepacket commented 3 years ago

Ive been using mdadm and formatted it as f2fs and performance is good.

Yiu are saying f2fs can run raid on it withoutdadm and is even better.

Do you have benchmarks to back this up? How many mins of plotting time did you save?

On Fri, Jun 18, 2021, 03:36 William Allen @.***> wrote:

Everyone overlooks it:

sudo mkfs.f2fs -f \ -l f2fs-collection \ -O extra_attr,inode_checksum,sb_checksum \ /dev/sdd \ -c /dev/sde \ -c /dev/sdf \ -c /dev/sdg \ -c /dev/sdh \ -c /dev/sdi \ -c /dev/sdj \ -c /dev/sdk

In the above command, you are setting /dev/sdd as the meta device, which is the device that you mount. You ignore the others.

From man mkfs.f2fs

-c device-list Build f2fs with these additional comma separated devices, so that the user can see all the devices as one big volume. Supports up to 7 devices except meta device.

The part about passing the devices in a comma separated list is actually incorrect. You have to do it as I had shown.

An f2fs disk collection is basically RAID0, without mdadm, except that it doesn't care what size the individual volumes are, or even what those volumes are. You can combine a HDD, a hardware RAID0 volume, an mdadm volume, SSDs, and thumb drives into a single volume. Not that there'd ever be a reason to do that, as it would perform horribly, but there's no function stopping you from doing it.

I've been running an f2fs disk collection instead of mdadm purely due to getting better IOPS.

It has several quirks of its own, however. It doesn't like it when you build a volume under one kernel version, then switch to an older kernel. Requires some experimenting.

Bonus f2fs fact! I picked up 14 10TB HDDs for $100 each - last month - because they were host-managed SMR zoned drives. Not gonna lie, I had no idea how to use these, or how they work, but sudo mkfs.f2fs -m /dev/sdbx and mount it as any normal HDD.

-m Specify f2fs filesystem to supports the block zoned feature. Without it, the filesystem doesn't support the feature.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/madMAx43v3r/chia-plotter/issues/518#issuecomment-863827833, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7JJWWFSTLU4MLAFE5OCHLTTLZRLANCNFSM46ZQF7XA .

wallentx commented 3 years ago

Yes. I had a bunch of stuff written here, but I accidentally closed the tab.. but I've been plotting since December, and I'll have over 2k plots in a few hours. I've basically been testing various things the whole time. Just on its own, f2fs is pretty fast. I don't have any screenshots of my f2fs disk collection benchmarks though. 2021-05-28_18-39-1

Then recently, I discovered the speed benefits of a well-tuned XFS filesystem is on a hardware RAID0 array. XFS is what I use now, and has given me the best performance. You can probably see here where I switched to xfs. And then I ran the multithreaded chiapos

2021-jun07-window_2

2021-jun08-window

2021-jun09-window_2

This plotter turns everything on its head. Hardly needs any space at all, so my 11.5TB RAID0 SSD array I was using as -t with 24// plots over 48 cores is now quiet, and I just do 1 plot at a time. Not using a ram tmpfs because I get slower times for whatever reason. Screenshot_20210618-024907 Fastest time I've gotten is ~27m

If your -t is your bottleneck, and you want an alternative to mdadm, make a f2fs disk collection and run a plot and see what you get. Phoronix has some benchmark data comparing it with mdadm and various other filesystems.

andyvk85 commented 3 years ago

Not sure if your "bench-chia" script uses "direct access" to your drives or if it uses also Linux page cache (RAM) for a certain time to handle the writes and reads? (I often use the dd tool with oflag=direct, which gives me reliable results).

XFS has some nice features, the most important for RAID users is the automatic alignment of physical, chunk and logical sizes. So, you don't have to care about the stride and stripe width, in most cases (not always!!). (https://wiki.archlinux.org/title/RAID#Calculating_the_stride_and_stripe_width)

aznboy84 commented 3 years ago

Technically you can, the mdadm works with partition as well, for example make 2 partitions with the size of 40GB, then a ramdisk with 40GB size too, then just make a raid 0 out of them

At the first glance it seems to be a good idea, but with a deeper understanding of Linux and RAID-0 you will recognize that the idea is not really suitable!

1. You don't need a ramdisk to bring in a performance boost in plotting
* just use the Linux page cache at all for file write/read handles
sudo sysctl -w vm.vfs_cache_pressure=0
sudo sysctl -w vm.swappiness=0
sudo swapoff -a
* from now on, your swap should be disabled and the Linux page cache will use all your available RAM dynamically!

* when starting a new plot, check if the buffer/cache is really used:
watch -n1 "free -h"
2. In RAID-0 the weakest / slowest device dictates the speed of the others
* obviously, a very fast device combined with two slower ones does not make sense at all for RAID-0

* in RAID-1 it would be make sense, there you would have also the `--write-mostly` option in creating your raid

Chia plotting does gain any benefit from disk caching, if you really want to know how the plotting process works, observe it with FileActivityWatch. While it is true that raid0 a very fast device with a very slow device will slow the overall performance, but it still faster than the slow device. I managed to reduce plotting time from 3 hours to 2 hours on my ssd potatoes, and something like 20% with the hdd potatoes. Also you can configure mdadm to work in linear mode, like using the first 40/110GB from a ramdisk, and 70/110GB from a SSD.

andyvk85 commented 3 years ago

Technically you can, the mdadm works with partition as well, for example make 2 partitions with the size of 40GB, then a ramdisk with 40GB size too, then just make a raid 0 out of them

At the first glance it seems to be a good idea, but with a deeper understanding of Linux and RAID-0 you will recognize that the idea is not really suitable! 1. You don't need a ramdisk to bring in a performance boost in plotting
* just use the Linux page cache at all for file write/read handles
sudo sysctl -w vm.vfs_cache_pressure=0
sudo sysctl -w vm.swappiness=0
sudo swapoff -a
* from now on, your swap should be disabled and the Linux page cache will use all your available RAM dynamically!

* when starting a new plot, check if the buffer/cache is really used:
watch -n1 "free -h"
2. In RAID-0 the weakest / slowest device dictates the speed of the others
* obviously, a very fast device combined with two slower ones does not make sense at all for RAID-0

* in RAID-1 it would be make sense, there you would have also the `--write-mostly` option in creating your raid
Chia plotting does gain any benefit from disk caching, if you really want to know how the plotting process works, observe it with FileActivityWatch. While it is true that raid0 a very fast device with a very slow device will slow the overall performance, but it still faster than the slow device. I managed to reduce plotting time from 3 hours to 2 hours on my ssd potatoes, and something like 20% with the hdd potatoes. Also you can configure mdadm to work in linear mode, like using the first 40/110GB from a ramdisk, and 70/110GB from a SSD.

I used "dstat" tool to check the final writes to my hdd RAID-0, and I used "free -h" to check what really happend with my RAM (page cache), so with the new madmax plotter I could achieve 44mins for a single plot.

The reason why you get a little(!) performance boost is that, that in RAID-0 the effort for each device will decrease with the amount of devices in the RAID-0. Using a RAM drive in this way is highly inefficient, you can have better performance in using LVM cache or Linux page cache.

The linear mode of mdadm sounds bad, seems so that you are using just only a device at the same time, where is the benefit!?

EDIT: my specs: Intel 11900, 64GB 3200 RAM, 10x external USB HDDs with mdadm RAID-0 as XFS (with normal mount options, not tuned!) LinuxMint 20.1, Kernel 5.10 OEM, Linux Page Cache

aznboy84 commented 3 years ago

Do you have a way to see the actual "hit" rate of linux cache thing ? last time i checked using primo cache in windows with ~20GB cache, the actual hit rate duing plotting process is below 0.1% (and yes, primocache does increase plotting performance by maximizing hdd activities). The ramdisk raid doesn't seem bad as you said, with a single HDD/SSD you can get double read/write performance, the more HDDs you add, the less performance you gain ... and it is true that 10+ HDD raid with a single ramdisk, you gain pretty much nothing :D

andyvk85 commented 3 years ago

As I said, I just used the "free -h" command, observing the column "buff/cache" in a "watch -n1" process. At the beginning of plotting, it will increase by ~2gb/s up to the maximal, available RAM (in my case ~62GB, because I use another page cache regime then the standard one!!). The nice thing is, that a page cache reacts on the available memory dynamically, so if the plotting process needs more RAM then it can get it and the page cache decreases dynamically and vice versa. (Btw: In Linux when using the Gnome system monitor you will not see the page cache.)

example: "free -h" with vm.vfs_cache_pressure=0 by using "sysctl" :

aznboy84 commented 3 years ago

ssd_potato hdd_potato

atop+htop+lm_sensors+hddtemp ... average buckets size for tables in each phase are ~100GB, except for P1T1 at ~70GB, any caches that smaller than the buckets size won't get hit because it's constantly update with new buckets.

andyvk85 commented 3 years ago

You are confusing me :D that's good, offers a chance to learn! :D

You mentioned that the cache is not hit? Maybe it's because it's not observed by atop? Let's look into MAN of atop... "(...) the memory for the page cache ('cache' and 'buff' in the MEM-line) is not implied! (...)" I think that you cannot see/measure hits with your method.. and your argument is a little bit inconsistent (to me).. when new buckets are written to the page cache, then it's hit, right?

andyvk85 commented 3 years ago

maybe this helps you a little bit: https://www.thomas-krenn.com/en/wiki/Linux_Page_Cache_Basics https://learning.oreilly.com/library/view/understanding-the-linux/0596005652/ch15s01.html

aznboy84 commented 3 years ago

You are confusing me :D that's good, offers a chance to learn! :D

You mentioned that the cache is not hit? Maybe it's because it's not observed by atop? Let's look into MAN of atop... "(...) the memory for the page cache ('cache' and 'buff' in the MEM-line) is not implied! (...)" I think that you cannot see/measure hits with your method.. and your argument is a little bit inconsistent (to me).. when new buckets are written to the page cache, then it's hit, right?

I don't have a way see the amount of cache hit on linux, but it's not important, let's have a look how the plotting process works, for example with 32 buckets :

At P1T1 : the plotting software generates 32 buckets with ~2400MB each, or 76800MB
At P1T2 : the plotting software take bucket0 of the T1 to generate 32 buckets of T2, then bucket1,2,3 ... of the T1 to fill the T2 slowly. The average size of a single bucket in this and later phase is over 3300MB, or over 100GB total.
At P1T3 : the plotting software take bucket0 of the T2 to generate 32 buckets of T3, then bucket1,2,3 ... of the T2 to fill the T3 slowly.
.... repeat until the end of phase 1.
Not sure about phase 2 and phase 3 because i lost my analysis data, but my guess ? it's the same thing.

There isn't a caching software smart enough to handle the plotting process, not if the cache size smaller than the total buckets size of each phase. They can't keep the generated bucket inside the memory until it gets hit, because some of the generated data never get hit and they stuck with it. If they update the cache with new generated buckets, then it also never gets hit because the data that gets to read is > 100GB behind the last data was written.

andyvk85 commented 3 years ago

There isn't a caching software smart enough to handle the plotting process

That's right, but it does not have to be smart to get a better benefit. It's a write-back cache in my setup.

At P1T1 : the plotting software generates 32 buckets with ~2400MB each, or 76800MB

So the first ~62000MB are written to the page cache with >2gb/s, after reaching the available RAM the page cache writes it to my RAID-0 with about 1.2gb/s. Sometimes, old buckets in the page cache are removed, because they are not used anymore. The page cache receives again new writes with 2gb/s (instead of 1.2gb/s), because RAM is available again... aso.

For my setup it's more beneficial than using a ramdisk of 62gb. The results are great with 44min instead of ~2h.

andyvk85 commented 3 years ago

44min_single_plot

aznboy84 commented 3 years ago

@andyvk85 Did you try difference writeback cache size ? in my potatoes it getting worse when i increase or decrease the cache size, may be i'll disable the ramdisk and give it another try ...

the default settings is 20% of the free memory

echo 20 > /proc/sys/vm/dirty_ratio echo 10 > /proc/sys/vm/dirty_background_ratio

andyvk85 commented 3 years ago

I'm on LinuxMint 20.1 and I don't use swap files / partitions and an absolute aggressive page cache by doing the following:

sudo sysctl -w vm.vfs_cache_pressure=0
sudo sysctl -w vm.swappiness=0    # I know this is not really required if you don't have swap areas!

and then please set the final plot dir to another drive

would be nice to hear from you again after a test.. I came up with antoher idea, what will happen when k32 is not anymore valid for the chia network.. then you would need a lot more of RAM to use a ramdisk to handle k33/34 right? So, this solution with an aggressive page cache could be applied too..

aznboy84 commented 3 years ago

@andyvk85 The test is still running but it seems better than the ramdisk thing, cpu utilization level is higher, table 2/3 of phase 1 finished 1 or 2 minutes faster .... Maybe you just saved my day .... too bad a lot of my brain cells fried for the ramdisk thing and it didn't work as expected :(

andyvk85 commented 3 years ago

That's sounds great for you! :D Some months ago, I started also with different kinds of ramdisks (like rapiddisk) aso.. So, I did a lot of research and I believe it's a better solution ;) I'm happy for you! Do u can post a comparision of the plots here pls, so everyone is informed ;)

wallentx commented 3 years ago

nch-chia" script uses "direct access" to your drives or if it uses also Linux page cache (RAM) for a certain time to handle the writes and reads? (I often use the dd tool with oflag=direct, which gives me reliable results).

https://github.com/wallentx/farm-and-ranch-supply-depot/blob/main/extra/bench-chia which is just an execution of a fio "profile" taken from https://github.com/Chia-Network/chia-blockchain/wiki/Storage-Benchmarks @jmhands mentioned that he has an even better version of this that closer simulates the plot creation cycle, but I haven't seen him share it anywhere.

XFS has some nice features, the most important for RAID users is the automatic alignment of physical, chunk and logical sizes. So, you don't have to care about the stride and stripe width, in most cases (not always!!).

For me, it was actually the attention put toward specifying the details of the underlying stripe width and stripe unit that made all the difference.

xfs-bench

andyvk85 commented 3 years ago

nch-chia" script uses "direct access" to your drives or if it uses also Linux page cache (RAM) for a certain time to handle the writes and reads? (I often use the dd tool with oflag=direct, which gives me reliable results).

https://github.com/wallentx/farm-and-ranch-supply-depot/blob/main/extra/bench-chia which is just an execution of a fio "profile" taken from https://github.com/Chia-Network/chia-blockchain/wiki/Storage-Benchmarks @jmhands mentioned that he has an even better version of this that closer simulates the plot creation cycle, but I haven't seen him share it anywhere.

XFS has some nice features, the most important for RAID users is the automatic alignment of physical, chunk and logical sizes. So, you don't have to care about the stride and stripe width, in most cases (not always!!).

For me, it was actually the attention put toward specifying the details of the underlying stripe width and stripe unit that made all the difference.

the XFS tuning is interesting! thanks for sharing!

aznboy84 commented 3 years ago

@andyvk85, can you check the amount of memory that actually used by write-back cache ?

using this command watch grep -e Dirty: -e Writeback: /proc/meminfo

Also, can you check the write-back parameters on your plotting machine ? cat /proc/sys/vm/dirty_ratio cat /proc/sys/vm/dirty_background_ratio cat /proc/sys/vm/dirty_expire_centisecs cat /proc/sys/vm/dirty_writeback_centisecs

I use these, telling the system that it's fine to delay disk write for ~10 minutes. sudo sysctl -w vm.dirty_ratio=100 sudo sysctl -w vm.dirty_background_ratio=60 sudo sysctl -w vm.dirty_expire_centisecs=60000 sudo sysctl -w vm.dirty_writeback_centisecs=500

Anyway my potatoes performance isn't stable with either centos / ubuntu, i ended up with tinycore linux and got ~1000s reduced ... they used weird IO scheduler but it seems more stable, and this linux distro didn't use hdd at all. It bothers me a lot when tmux border didn't drawn properly .... hdd_potato2

EDIT : Potatoes spec : i5-3570 / 32GB RAM @1333MHz / 4x500GB WD

andyvk85 commented 3 years ago

@andyvk85, can you check the amount of memory that actually used by write-back cache ?

using this command watch grep -e Dirty: -e Writeback: /proc/meminfo

Also, can you check the write-back parameters on your plotting machine ? cat /proc/sys/vm/dirty_ratio cat /proc/sys/vm/dirty_background_ratio cat /proc/sys/vm/dirty_expire_centisecs cat /proc/sys/vm/dirty_writeback_centisecs

I use these, telling the system that it's fine to delay disk write for ~10 minutes. sudo sysctl -w vm.dirty_ratio=100 sudo sysctl -w vm.dirty_background_ratio=60 sudo sysctl -w vm.dirty_expire_centisecs=60000 sudo sysctl -w vm.dirty_writeback_centisecs=500

Anyway my potatoes performance isn't stable with either centos / ubuntu, i ended up with tinycore linux and got ~1000s reduced ... they used weird IO scheduler but it seems more stable, and this linux distro didn't use hdd at all. It bothers me a lot when tmux border didn't drawn properly ....

EDIT : Potatoes spec : i5-3570 / 32GB RAM @1333Mhz / 4x500GB WD

Hi! Yes, I aslo used these "dirty_*" params finally! They have a great impact on my system, my params: sudo sysctl -w vm.dirty_ratio=99 sudo sysctl -w vm.dirty_background_ratio=99 sudo sysctl -w vm.dirty_expire_centisecs=25000 sudo sysctl -w vm.dirty_writeback_centisecs=25000 # 250secs, setting above the step in plotting that takes most time (it's only my guess!)

I also experienced some instabilities in the plotting process.. after ~8 single plots I got a read/write error, but the raid-0 was working well at this moment, so maybe I have to set the ratios to a lower level..

I will check the "watch grep -e Dirty: -e Writeback: /proc/meminfo" command later.

madMAx43v3r / chia-plotter

mdadm Raid0 gives fread error, but individual drives works fine and no error #518