raspiblitz / raspiblitz

Get your own Bitcoin & Lightning Node running - on a RaspberryPi with a nice LCD
MIT License
2.45k stars 520 forks source link

BTRFS with optional USB thumb drive as RAID1 #329

Closed rootzoll closed 4 years ago

rootzoll commented 5 years ago

To protect LND data better against data curruption and data loss (because thats were your funds are) its to research the idea to add two 4GB micro usb sticks as a RAID to the raspberryPi - like this:

20190218_233414

Its a super cheap option (both sticks for around 10 USD) and flash drive should be more secure against undervolstages.

I Raid USB setup on the Pi is new to me .. so all research and tips are welcome. I have this as a starting point: https://unix.stackexchange.com/questions/120874/how-to-setup-a-raid-system-using-usb-sticks-as-storage-media

rootzoll commented 5 years ago

see ;) https://twitter.com/CandleHater/status/1097671449207324672

seth586 commented 5 years ago

Hardware improvement - I've ran a FreeNAS server for ~5 years now booting off a ZFS mirror volume on USB flash drives. They tend to fail between 6-18 months. I got tired of replacing them and bought cheap used intel SSDs.

This might be an affordable solution:

USB to m.2 SATA adaptor ($15 each)

https://www.amazon.com/SHINESTAR-Adapter-Portable-Performance-Samsung/dp/B0768V1SK7#customerReviews

m.2 SATA SSD ($22 each)

https://www.amazon.com/Transcend-MTS400-Solid-State-TS64GMTS400S/dp/B077H276GQ/

seth586 commented 5 years ago

Software improvement - software raid should feature a modern file system (ZFS or btrfs). These file systems checksum data, and can fix corruption on the fly by comparing with mirror/parity devices. Raid alone is obsolete (start at 5:05).

ZFS on Pi:

https://github.com/zfsonlinux/pkg-zfs/wiki/HOWTO-install-Raspbian-to-a-Native-ZFS-Root-Filesystem,-or,-How-I-Learned-to-Love-Data-Integrity

Btrfs on Pi:

https://hackmd.io/FP-7sHiPTJGaJvzSa3nw8A

lnd static channel backup

In the end, even local redundancy & integrity measures does not prevent theft, catastrophic loss, fire, flood, etc. Once static channel backups are implemented we can make accurate live backups. https://github.com/lightningnetwork/lnd/pull/2313

raumi75 commented 5 years ago

I run a raspberrypi 2 with Btrfs raid for years. Scrubbing takes a long time, but otherwise it works great.

ZFS or Btrfs would provide snapshots without the need to shut down lnd or bitcoin demon. This way we could have versioned backups in case of file corruption.

With tools like btrbk, we could manage backups to network or attached storage.

openoms commented 5 years ago

$50 for this 128 GB USB SSD: https://www.sandisk.com/home/usb-flash/extremepro-usb. There is no smaller size available unfortunately. A good alternative to used SSD-s and messy USB adapters.

openoms commented 5 years ago

or these are cheap, small and more error prone: SanDisk Cruzer Fit 16GB USB 2.0 Flash Drive https://www.amazon.co.uk/dp/B07MDXBT87/ref=cm_sw_r_cp_apa_i_UBIBCbAHGSSEZ. Storing only the LND dir on the flash RAID is nowhere near as a heavy write load as booting from it or using as swap. They would probably last that 6-18 months until channel state backups are sorted out.

thelwyn commented 5 years ago

Is there any documentation about SSD being significantly more reliable and durable than USB sticks though? Last time I looked for data between HDD and SSD and didn't find any reliable data about this. If the SSD if 3 times more reliable than a USB stick but 5 times more expensive, what's the point. Also USB sticks take less space.

rootzoll commented 5 years ago

Info on how to make a ZFS raid: https://tutorials.ubuntu.com/tutorial/setup-zfs-storage-pool#0

rootzoll commented 5 years ago

FYI: I am going at the moment with a BTRFS Raid1 with those two small USB thumb drives from the picture on the first post. BTRFS looks good, because it also has self-healing features.

raumi75 commented 5 years ago

Yay! I'd love to experiment with that. I have a little experience with Btrfs. Maybe the setup could detect if there is one or two usb-sticks present and offer to use them. We could make guesses based on the size of the disks. If we have 500 GB or more, that's your /mnt/hdd. If we have less then it is an sd-device.

How would you recommend changing the lnd path? Should /mnt/hdd/lnd be the mount point, so that all the scripts run without modification? Or is there one central setting to announce that e.g. /mnt/ssd/lnd is the new home?

rootzoll commented 5 years ago

@raumi75 the detection works already quite well in my test script - I search for two devcies with the same size :) I am fresh new to the BTRFS so it would be great if someone can take a look once I have a first prototype and knows how to optimize it (versioning, etc.).

For the beginning/testing I want to make it as a extra option you do after the setup and make it possible in the config script that you can switch between HDD and USB-DATASTORAGE. When the testing gets positive then this can be worked into the setup/update process.

I think I will check all scripts & config again that they use the home/bitcoin/.lnd path for LND. Thats already a linked directory to the HDD folder. So when the USB-DATASTORAGE gets switchen on/off it just needs to update that one link and we should be good.

Also I am thinking that it would make sense to move more personal data (tor, config, etc.) over to the USB-DATASTORAGE .. so that the HDD is just a large public data store (blockchain, torrents, etc). So that in a worst case scenario: The HDD, sd-card and even the RaspberryPi board it self can broken and you just take your USB-DATASTORAGE plug it into a fresh setup and it will recover and take off from where it stopped.

raumi75 commented 5 years ago

I will take a look. Just tell me when you are ready.

Btrfs opens many possibilities. We could take hourly snapshots without stopping lnd. Could even copy them to the hdd in a Btrfs send/receive type format. There are scripts that do that (btrbk for example)

If I understand correctly, old channel-state doesn't do any good. Maybe we could even create snapshots every 10 minutes and delete them soon.

openoms commented 5 years ago

This is great, my two 16 GB USB drives are waiting to be tested. One comment about the channel states: currently it is not useful to have a snapshot without stopping LND. Just lost my channels due to this problem, could not restore despite having multiple on-the-go scp backups. It is the right thing that the channel.db won't get corrupted due to having it mirrored, but some information is in the RAM and not written until LND is stopped. For snapshots LND must be stopped, wallet.db and channel.db saved and restarted. It could be done after every state change if possible. Found this good collection on the issues: https://gist.github.com/bretton/22f628caffde79390a796e75ea528053

raumi75 commented 5 years ago

Are you sure this is true for zfs and btrfs? I'm not an expert and don't mean to spread half-truths, but my understanding is that you can't backup virtual machine images, database files while the application is running, because the files might change DURING the copying. A (zfs-/btrfs-)snapshot is created in that instant, because it's a CoW (Copy on Write)-filesystem. The snapshot (which is practically forzen in time) can then be copied to a different device.

Not sure how lnd behaves so please please don't be reckless. All I'm saying is this needs more research, but we should not dismiss the possibility that this works. We could stress test this on testnet.

openoms commented 5 years ago

Are you sure this is true for zfs and btrfs? I'm not an expert and don't mean to spread half-truths, but my understanding is that you can't backup virtual machine images, database files while the application is running, because the files might change DURING the copying. A (zfs-/btrfs-)snapshot is created in that instant, because it's a CoW (Copy on Write)-filesystem. The snapshot (which is practically forzen in time) can then be copied to a different device.

Not sure how lnd behaves so please please don't be reckless. All I'm saying is this needs more research, but we should not dismiss the possibility that this works. We could stress test this on testnet.

Thanks for explaining this. The problem with the on-the-go SCP backup might be indeed that the data changes while being copied. Restoring a frozen snapshot should not be different than powering up after a power outage, which is usually not a problem. Need to be tested in any case and I am up for that too!

rootzoll commented 5 years ago

The USB-Raid should focus on providing reliable hardware storage first ... the Backup is a feature hopefully comming from LND itself soon: https://github.com/lightningnetwork/lnd/pull/2313

openoms commented 5 years ago

Just thinking, possibly too far: once the LND directory and other personal data is moved from the HDD, setting up a mirrored filesystem of the blockchaindata would make it really easy to clone these devices (even without stopping). Or just could leave two HDD-s running mirrored as well adding further redundancy.

rootzoll commented 5 years ago

To investigate: check comments on LND data storage in this thread: https://twitter.com/AllYourBanks/status/1100883762261475328 and research "write barrier" https://twitter.com/AllYourBanks/status/1100882870787338240

openoms commented 5 years ago

Linking a different approach mirroting an extra partition on the HDD and one on the SDcard. https://twitter.com/vindaRd/status/1103327828546859008?s=19 Might be useful for the OdroidHC1/HC2 which only have 1 USB port to avoid using a USB hub.

rootzoll commented 5 years ago

TODO: Maybe even move the chainstate to the usb-raid: https://github.com/rootzoll/raspiblitz/issues/413

rootzoll commented 5 years ago

TODO: Scrub data with background task every hour:

btrfs scrub start /mnt/raid/is better than fsck. Will start a check on a mounted partition. It's done in the background and checks each checksum of every block. If one is wrong, Btrfs will fix it by taking the block from the good disk and copy it to the bad one. You could run this periodically with cron or as a menu item.

openoms commented 5 years ago

@rootzoll I have built from the 1.1 dev branch to my Odroid XU4 on DietPi and had some errors installing btrfs at

# prepare for BTRFS data drive raid   
sudo apt-get install -y btrfs-tools

Unpacking btrfs-tools (4.7.3-1) ... Processing triggers for initramfs-tools (0.130) ... ln: failed to create hard link '/boot/initrd.img-4.14.66+.dpkg-bak' => '/boot/initrd.img-4.14.66+': Operation not permitted

and

Processing triggers for initramfs-tools (0.130) ... ln: failed to create hard link '/boot/initrd.img-4.14.66+.dpkg-bak' => '/boot/initrd.img-4.14.66+': Operation not permitted

See the full output of the build_sdcard.sh: https://gist.github.com/openoms/a0f4fd750e3c24c123b8eea0a7b0dbb0 Not sure if it is causing any problems, but will just leave this here for future reference.

fluidvoice commented 5 years ago

Related to this and "fsck" of the data drive which was recently added - ie, the bigger subject of data corruption it might help to add additional checks for inadequate power noted here: https://github.com/rootzoll/raspiblitz/issues/474#issue-427212699

ThomasKaiser commented 5 years ago

Just a quick note about btrfs, ZFS, journaled filesystems and USB storage in general. What you need is correct write/flush barrier semantics otherwise you'll get data corruption or loose your btrfs filesystem or ZFS pools in case of crashes or power losses. Some information (garnished with a lot of FUD/BS) here

ThomasKaiser commented 5 years ago

ln: failed to create hard link '/boot/initrd.img-4.14.66+.dpkg-bak' => '/boot/initrd.img-4.14.66+': Operation not permitted

DietPi folks (well, Daniel -- it's more or less a one man show) are ignorant. They put /boot/ on a FAT partition for no reason which breaks Debian package management in general.

RPi Trading employees since having to deal with a FAT partition (the VideoCore IV can't cope with Linux filesystems) therefore use a tool called dpkg-divert to manage their bootloader and kernel package updates on the RPi which makes all those updates insanely slow (though they prefer to blame dpkg instead).

There's no reason to put /boot/ on a FAT partition (not even on the RPi but there RPi folks decided to combine primary OS -- ThreadX -- and settings for the secondary OS in one single partition to not overwhelm their user base and so ended up with all the stuff on a FAT partition) and as such what you see is as expected. It's not related to btrfs but the result of anything calling update-initramfs since /boot on non POSIX compliant filesystems is broken by design with Debian and derivates.

rootzoll commented 5 years ago

Thanks for all the input. Will need some more time to process the details. To not delay the release of v1.2 I will move the issue to a future release. But its still on high prio.

normandmickey commented 5 years ago

Creating a RAID1 array is relatively easy with a package called "mdadm". But now that LND 0.6 has it's own channel state backup this may not be as important. https://github.com/lightningnetwork/lnd/releases

  1. apt install mdadm
  2. mdadm --create /dev/md0 --level=mirror --raid-devices=2 /dev/sda1 /dev/sdb1
  3. mkdir -p /mnt/raid1
  4. mkfs.ext4 /dev/md0
  5. mount /dev/md0 /mnt/raid1/
  6. echo "/dev/md0 /mnt/raid1/ ext4 defaults,noatime 0 1" | sudo tee -a /etc/fstab
  7. mdadm --detail --scan | sudo tee -a /etc/mdadm/mdadm.conf
  8. cd /mnt/raid1
  9. ls
openoms commented 5 years ago

Restoring the SCB in lnd0.6 still means force closing all the channels and paying the on-chain transaction fees. It is still important to avoid data loss.

Creating a BTRFS RAID is equally simple and possibly superior due to the ability to create versioned atomic snapshots (https://github.com/digint/btrbk thanks @raumi75 ).

I am currently running the 1.1 RaspiBlitz on an Odroid XU4 Cloudshell2 with two 500gb SATA HDD-s in BTRFS RAID1. Aiming to create a routing grade node.

Have no experience with ZFS and did not find an similarly easy way to set up yet, but could be interesting.

openoms commented 5 years ago

Looking at these lately: https://www.ebay.co.uk/itm/Maiwo-K25682-portable-USB3-0-Raid-Enclosure-for-2-5Inch-SATA-HDD-Windows-Linux/123657290594

Option to connect two 2.5" HDD/SSD through one USB3 connector. With PM mode disks appear as two individual disks and can be run as a software BTRFS RAID1. Needs it`s own extra power supply. No SBC can power two HDD-s from USB only.

fluidvoice commented 5 years ago

Nice find! That is a very cool product for the $. Other links for it: www.gearbest.com/hdd-enclosure/pp_1042363.html https://www.amazon.com/MAIWO-USB3-0-2-5inch-Enclosure-K25682/dp/B01M7PPCRH

rootzoll commented 5 years ago

I still think we can keep the HDD single for blockchain storage when we add two USB thumb drives as RAID for LND and other critical data. First of all this is the cheapest option and does not need additional power supply - as two HDD would need.

The only problem left with one HDD is when we run into a data corruption of the blockchain the redownload/sync of the blockchain is needed which is a long off-time for the node. But this could be fixed with the "Background Torrent Seed" feature, which will help seeding on the one hand and keeps a backup version of the blockchain ready to go/replace in the case of blockchain data corruption and would minimize downtime in such cases. A first experimental version of "Background Torrent Seed" feature will be part of v1.2 - but it needs to be optimized to run in the background. If we can get that feature working for v1.3 release it would be the perfect companion for this 2-Thumb-Drive-Raid feature that is also on high prio for the v1.3 release.

fluidvoice commented 5 years ago

I still think we can keep the HDD single for blockchain storage when we add two USB thumb drives as RAID for LND and other critical data. First of all this is the cheapest option and does not need additional power supply - as two HDD would need.

The only problem left with one HDD is when we run into a data corruption of the blockchain the redownload/sync of the blockchain is needed which is a long off-time for the node. But this could be fixed with the "Background Torrent Seed" feature, which will help seeding on the one hand and keeps a backup version of the blockchain ready to go/replace in the case of blockchain data corruption and would minimize downtime in such cases. A first experimental version of "Background Torrent Seed" feature will be part of v1.2 - but it needs to be optimized to run in the background. If we can get that feature working for v1.3 release it would be the perfect companion for this 2-Thumb-Drive-Raid feature that is also on high prio for the v1.3 release.

I'm wondering if there might be a better, more resource efficient way to create a blockchain backup. Basically if there is enough space, lets say 1TB or more, then cannot a background script/process just rsync to a backup partition lets say once per hour or once per day? rsync is ifficient that it will see and copy only changed files and can retain all file/directory names, rights, timestamps, etc. I'm pretty sure that rsync also might not require the source to be unmounted/un-used. Ideally, if corruption is detected and the script can see there is a backup partition, then it can automate switching over to run Raspiblitz on the other/backup partition - and notify user of this. This does not preclude the possible torrenting of the blockchain - say from the backup partition/copy. But it seems using rsync is less "heavy" and does not require running a torrent server/daemon.

rootzoll commented 5 years ago

@fluidvoice such a blockchain backup with rsync (stopping bitcoind, syncing, restarting bitcoind) is a good idea. Will keep that in mind on this issue.

openoms commented 5 years ago

IDEA: Switch to BTRFS as the default file system for the HDD.

BTRFS could achieve easy and cheap data redundancy for LND needing only one extra USB drive.

The LND RAID1 setup would never rely completely on flash drives as it would be the case when mirroring between 2 USB-s.

Having the blockchaindata (+ optional databases like electrs) on BTRFS would make it possible to clone a RaspiBlitz HDD on-the-fly by adding a new HDD as a mirrored drive. This could be an alternative to scp/rsync copy of the blockchain from one RaspiBlitz to an other.

openoms commented 5 years ago

@fluidvoice until the X86 support is merged in by @rootzoll we should continue to discuss about the VM build here: https://github.com/openoms/raspiblitz/issues/46

openoms commented 5 years ago

IDEA: Switch to BTRFS as the default file system for the HDD.

* HDD-s from older RaspiBlitz versions would need to be converted to BTRFS from EXT4 to benefit from the added features.
  This is possible with `btrfs-convert` (https://www.howtoforge.com/how-to-convert-an-ext3-ext4-root-file-system-to-btrfs-on-ubuntu-12.10) although file system operations are risky.
  A reliable complete backup of the LND dir is a must before starting a conversion, we can use:
  `/home/admin/config.scripts/lnd.rescue.sh backup` (https://github.com/rootzoll/raspiblitz/blob/master/FAQ.md#1-recover-lnd-data.)

* once the HDD is converted to BTRFS a resize is quick and straightforward:
  `sudo btrfs filesystem resize -20g /mnt/hdd`

* a second 15GB BTRFS partition could be used for LND. This can be mirrorred as a RAID1 to an extra 16GB USB drive.

* a small third EXT4 partition would be needed for SWAP because BTRFS does not support a swap function yet and also the swap should not be mirrored.

BTRFS could achieve easy and cheap data redundancy for LND needing only one extra USB drive.

The LND RAID1 setup would never rely completely on flash drives as it would be the case when mirroring between 2 USB-s.

Having the blockchaindata (+ optional databases like electrs) on BTRFS would make it possible to clone a RaspiBlitz HDD on-the-fly by adding a new HDD as a mirrored drive. This could be an alternative to scp/rsync copy of the blockchain from one RaspiBlitz to an other.

Thinking further: 2 partitions on the HDD would be the most simple and beneficial.

Could keep the EXT4 filesystem for the blockchaindata and optionally have an mdamd RAID1. The mdamd RAID1 would achieve:

A new setup could set two partitions during the 30initHDD.sh and existing BLOCKCHAIN partitions could be shrinked (http://www.microhowto.info/howto/reduce_the_size_of_an_ext2_ext3_or_ext4_filesystem.html) to give space for 15 GB BTRFS partition which then can be mirrored to a cheap USBdrive or the second HDD.

The BTRFS RAID1 for the LND dir would achieve:

openoms commented 5 years ago

Converting file systems is not an option: https://askubuntu.com/questions/1073428/where-to-get-btrfs-convert-on-18-04. convert-btrfs is not even shipped with Debian any more, because it was producing broken filesystems.

Converting to both mdamd or btrfs RAID1 would require:

To have the RAID1 option the HDD needs to be created as a single disk (degraded) mdamd RAID1 or use BTRFS from start.

openoms commented 5 years ago

A freshly updated lengthy comparison and test between mdadm, btrfs and zfs: http://www.unixsheikh.com/articles/battle-testing-data-integrity-verification-with-zfs-btrfs-and-mdadm-dm-integrity.html

rootzoll commented 5 years ago

@openoms thats a great testing report. I still lean to the BTRFS because on my first try it seems easier to setup on Raspbian. Would you agree fo that decision or is there a real downside compared with ZFS that could bite us in the long way?

Also I dont think to provide a "conversion" on updating your RaspiBlitz from ext4 to BTRFS. If someone wants to go this way - I think its cleaner make a Backup (LNDRESCUE LND tar.gz-Backupfile), format the HDD and start a fresh RaspiBlitz (witch then will start by default with a BTRFS) and importing the Backupdata during the setup.

openoms commented 5 years ago

@openoms thats a great testing report. I still lean to the BTRFS because on my first try it seems easier to setup on Raspbian. Would you agree fo that decision or is there a real downside compared with ZFS that could bite us in the long way?

Completely agree that setting up BTRFS is much more user simple as it is included in Debian by default. It seems to be actively developed as well (https://btrfs.wiki.kernel.org/index.php/Status), can`t see any significant downsides vs ZFS (especially that we are not running on BSD). If we switch to any of those a separate SWAP partition will be needed on the HDD/SSD. This make sense anyway as it would be unnecessary to mirror the swap.

Also I dont think to provide a "conversion" on updating your RaspiBlitz from ext4 to BTRFS. If someone wants to go this way - I think its cleaner make a Backup (LNDRESCUE LND tar.gz-Backupfile), format the HDD and start a fresh RaspiBlitz (witch then will start by default with a BTRFS) and importing the Backupdata during the setup.

Sure, forcing the backup is the way to go when playing with the file system. Starting from scratch guarantees a clean disk as well. +1

rootzoll commented 5 years ago

To make multiple partitions of the HDD and combine one of that (that with the raspiblitz/lnd data) as part of a RAID with a USB 3.0 stick seems to be the best option to go. Lets try to get this into the next release.

rootzoll commented 4 years ago

TODO: When BTRFS dont offer torrent or move torrent download to storage drive.

rootzoll commented 4 years ago

TODO: When BTRFS with RAID1 - check if volume can still be temp mounted (without USB thumb drive) on a fresh setup

rootzoll commented 4 years ago

This feature is still EXPERIMENTAL for the v1.4 release. It can be played with, but just for research - dont use in production yet.

zanderisrael commented 3 years ago

Why did this get closed? Not possible? Compatibility issues?

rootzoll commented 3 years ago

Its implemented and in a experimental stage (BTRFS and Raid1 with a USB drive) ... but it seemed a bit early and needs more testing. USB thumb drives may not be best candidates to work with a SSD in a RAID 1. Its not high on the prio list for the coming releases to finish this (other things are more pressuring) ... but of anybody wants to catch up on this, the experimental implementation is there to test & improve.