lrq3000 / pyFileFixity

📂🛡️Suite of tools for file fixity (data protection for long term storage⌛) using redundant error correcting codes, hash auditing and duplications with majority vote, all in pure Python🐍
MIT License
133 stars 9 forks source link
archival data-archival data-protection data-repairing duplication error-correcting-codes long-term reed-solomon reed-solomon-codes

pyFileFixity

|PyPI-Status| |PyPI-Versions| |PyPI-Downloads|

|Build-Status| |Coverage|

|LICENCE|

pyFileFixity provides a suite of open source, cross-platform, easy to use and easy to maintain (readable code) to protect and manage data for long term storage/archival, and also test the performance of any data protection algorithm.

The project is done in pure-Python to meet those criteria, although cythonized extensions are available for core routines to speed up encoding/decoding, but always with a pure python specification available so as to allow long term replication.

Here is an example of what pyFileFixity can do:

|Example|

On the left, this is the original image.

At the center, the same image but with a few symbols corrupted (only 3 in header and 2 in the rest of the file, which equals to 5 bytes corrupted in total, over 19KB which is the total file size). Only a few corrupted bytes are enough to make the image looks like totally unrecoverable, and yet we are lucky, because the image could be unreadable at all if any of the "magic bytes" were to be corrupted!

At the right, the corrupted image was repaired using pff header command of pyFileFixity. This repaired only the image header (ie, the first part of the file), so only the first 3 corrupted bytes were repaired, not the 2 bytes in the rest of the file, but we can see the image looks indistinguishable from the untampered original! And the best thing is that it only costed the generation of a "ecc repair file" for the header, which size is only a constant 3.3KB per file, regardless of the protected file's size!

This works because most files will store the most important information to read them at their beginning, also called "file's header", so repairing this part will almost always ensure the possibility to read the file (even if the rest of the file is still corrupted, if the header is safe, you can read it). This works especially well for images, compressed files, formatted documents such as DOCX and ODT, etc.

Of course, you can also protect the whole file, not only the header, using pyFileFixity's pff whole command. You can also detect any corruption using pff hash.


.. contents:: Table of contents :backlinks: top

Quickstart

Runs on Python 3 up to Python 3.12-dev. PyPy 3 is also supported.

pip install --upgrade pyfilefixity

pip install --upgrade pyfilefixity==3.0.2 reedsolo==1.7.0 unireedsolomon==1.0.5

pff --help

You should see:

::

usage: pff [-h]
           {hash,rfigc,header,header_ecc,hecc,whole,structural_adaptive_ecc,saecc,protect,repair,recover,repair_ecc,recc,dup,replication_repair,restest,resilience_tester,filetamper,speedtest,ecc_speedtest}
           ...

positional arguments:
  {hash,rfigc,header,header_ecc,hecc,whole,structural_adaptive_ecc,saecc,protect,repair,recover,repair_ecc,recc,dup,replication_repair,restest,resilience_tester,filetamper,speedtest,ecc_speedtest}
    hash (rfigc)        Check files integrity fast by hash, size, modification date or by data structure integrity.
    header (header_ecc, hecc)
                        Protect/repair files headers with error correction codes
    whole (structural_adaptive_ecc, saecc, protect, repair)
                        Protect/repair whole files with error correction codes
    recover (repair_ecc, recc)
                        Utility to try to recover damaged ecc files using a failsafe mechanism, a sort of recovery
                        mode (note: this does NOT recover your files, only the ecc files, which may then be used to
                        recover your files!)
    dup (replication_repair)
                        Repair files from multiple copies of various storage mediums using a majority vote
    restest (resilience_tester)
                        Run tests to quantify robustness of a file protection scheme (can be used on any, not just
                        pyFileFixity)
    filetamper          Tamper files using various schemes
    speedtest (ecc_speedtest)
                        Run error correction encoding and decoding speedtests

options:
  -h, --help            show this help message and exit

pff hash --help

pff hash -i "your_folder" -d "dbhash.csv" -g -f -l "log.txt"

Note: this also works for a single file, just replace "your_folder" by "your_file.ext".

pff hash -i "your_folder -d "dbhash.csv" --update --append

pff hash -i "your_folder" -d "dbhash.csv" -l log.txt -s -e errors.csv

pff hash -i "your_folder" -d "dbhash.csv" -l "log.txt" -o "output_folder" --filescraping_recovery

pff header -i "your_folder" -d "hecc.txt" -l "log.txt" -g -f --ecc_algo 3

pff header -i "your_folder" -d "hecc.txt" -o "output_folder" -l "log.txt" -c -v --ecc_algo 3

pff whole -i "your_folder" -d "ecc.txt" -l "log.txt" -g -f -v --ecc_algo 3

pff whole -i "your_folder" -d "ecc.txt" -o "output_folder" -l "log.txt" -c -v --ecc_algo 3

Note that header and whole can also detect corrupted files and even which blocks inside a file, but they are much slower than hash.

pff recovery -i "ecc.txt" --index "ecc.txt.idx" -o "ecc_repaired.txt" -l "log.txt" -v -f

pff recovery -i "ecc.txt" -o "ecc_repaired.txt" -l "log.txt" -v -f -t 0.4

pff dup -i "path/to/dir1" "path/to/dir2" "path/to/dir3" -o "path/to/output" --report "rlog.csv" -f -v

pff dup -i "path/to/dir1" "path/to/dir2" "path/to/dir3" -o "path/to/output" -d "dbhash.csv" --report "rlog.csv" -f -v

pff restest -i "your_folder" -o "test_folder" -c "resiliency_tester_config.txt" -m 3 -l "testlog.txt" -f

pff speedtest

The problem of long term storage

Why are data corrupted with time? One sole reason: entropy. Entropy refers to the universal tendency for systems to become less ordered over time. Data corruption is exactly that: a disorder in bits order. In other words: the Universe hates your data.

Long term storage is thus a very difficult topic: it's like fighting with death (in this case, the death of data). Indeed, because of entropy, data will eventually fade away because of various silent errors such as bit rot or cosmic rays. pyFileFixity aims to provide tools to detect any data corruption, but also fight data corruption by providing repairing tools.

The only solution is to use a principle of engineering that is long known and which makes bridges and planes safe: add some redundancy.

There are only 2 ways to add redundancy:

Error correction can seem a bit magical, but for a reasonable intuition, it can be seen as a way to average the corruption error rate: on average, a bit will still have the same chance to be corrupted, but since you have more bits to represent the same data, you lower the overall chance to lose this bit.

The problem is that most theoretical and pratical works on error correcting codes has been done almost exclusively on channel transmission (such as 4G, internet, etc.), but not on data storage, which is very different for one reason: whereas in a channel we are in a spatial scheme (both the sender and the receiver are different entities in space but working at the same timescale), in data storage this is a temporal scheme: the sender was you storing the data on your medium at time t, and the receiver is again you but now retrieving the data at time t+x. Thus, the sender does not exist anymore, thus you cannot ask the sender to send again some data if it's too much corrupted: in data storage, if a data is corrupted, it's lost for good, whereas in channel theory, parts of the data can be submitted again if necessary.

Some attempts were made to translate channel theory and error correcting codes theory to data storage, the first being Reed-Solomon which spawned the RAID schema. Then CIRC (Cross-interleaved Reed-Solomon coding) was devised for use on optical discs to recover from scratches, which was necessary for the technology to be usable for consumers. Since then, new less-optimal but a lot faster algorithms such as LDPC, turbo-codes and fountain codes such as RaptorQ were invented (or rediscovered), but they are still marginally researched for data storage.

This project aims to, first, implement easy tools to evaluate strategies (filetamper.py) and file fixity (ie, detect if there are corruptions), and then the goal is to provide an open and easy framework to use different kinds of error correction codes to protect and repair files.

Also, the ecc file specification is made to be simple and resilient to corruption, so that you can process it by your own means if you want to, without having to study for hours how the code works (contrary to PAR2 format).

In practice, both approaches are not exclusive, and the best is to combine them: protect the most precious data with error correction codes, then duplicate them as well as less sensitive data across multiple storage mediums. Hence, this suite of data protection tools, just like any other such suite, is not sufficient to guarantee your data is protected, you must have an active (but infrequent and hence not time consuming) data curation strategy that includes regularly checking your data and replacing copies that are damaged every few years.

For a primer on storage mediums and data protection strategies, see this post I wrote <https://web.archive.org/web/20220529125543/https://superuser.com/questions/374609/what-medium-should-be-used-for-long-term-high-volume-data-storage-archival/873260>_.

Why not just use RAID ?

RAID is clearly insufficient for long-term data storage, and in fact it was primarily meant as a cheap way to get more storage (RAID0) or more availability (RAID1) of data, not for archiving data, even on a medium timescale:

On the opposite, ECC can correct n-k disks (or files). You can configure n and k however you want, so that for example you can set k = n/2, which means that you can recover all your files from only half of them! (once they are encoded with an ecc file of course).

There also are new generation RAID solutions, mainly software based, such as SnapRAID or ZFS, which allow you to configure a virtual RAID with the value n-k that you want. This is just like an ecc file (but a bit less flexible, since it's not a file but a disk mapping, so that you can't just copy it around or upload it to a cloud backup hosting). In addition to recover (n-k) disks, they can also be configured to recover from partial, sectors failures inside the disk and not just the whole disk (for a more detailed explanation, see Plank, James S., Mario Blaum, and James L. Hafner. "SD codes: erasure codes designed for how storage systems really fail." FAST. 2013.).

The other reason RAID is not adapted to long-term storage, is that it supposes you store your data on hard-drives exclusively. Hard drives aren't a good storage medium for the long term, for two reasons:

| 1- they need a regular plug to keep the internal magnetic disks electrified (else the data will just fade away when there's no residual electricity). | 2- the reading instrument is directly included and merged with the data (this is the green electronic board you see from the outside, and the internal head). This is good for quick consumer use (don't need to buy another instrument: the HDD can just be plugged and it works), but it's very bad for long term storage, because the reading instrument is bound to fail, and a lot faster than the data can fade away: this means that even if your magnetic disks inside your HDD still holds your data, if the controller board or the head doesn't work anymore, your data is just lost. And a head (and a controller board) are almost impossible to replace, even by professionals, because the pieces are VERY hard to find (different for each HDD production line) and each HDD has some small physical defects, thus it's impossible to reproduce that too (because the head is so close to the magnetic disk that if you try to do that manually you'll probably fail).

In the end, it's a lot better to just separate the storage medium of data, with the reading instrument.

We will talk later about what storage mediums can be used instead.

Applications included

The pyFileFixity suite currently include the following pure-python applications:

Note that all tools are primarily made for command-line usage (type pff --help to get extended info about the accepted arguments)

IMPORTANT: it is CRITICAL that you use the same parameters for correcting mode as when you generated the database/ecc files (this is true for all scripts in this bundle). Of course, some options must be changed: -g must become -c to correct, and --update is a particular case. This works this way on purpose for mainly two reasons: first because it is very hard to autodetect the parameters from a database file alone and it would produce lots of false positives, and secondly (the primary reason) is that storing parameters inside the database file is highly unresilient against corruption (if this part of the database is tampered, the whole becomes unreadable, while if they are stored outside or in your own memory, the database file is always accessible). Thus, it is advised to write down the parameters you used to generate your database directly on the storage media you will store your database file on (eg: if it's an optical disk, write the parameters on the cover or directly on the disk using a marker), or better memorize them by heart. If you forget them, don't panic, the parameters are always stored as comments in the header of the generated ecc files, but you should try to store them outside of the ecc files anyway.

For users: what are the advantages of pyFileFixity?

Pros:

Cons:

Note that the tools were meant for data archival (protect files that you won't modify anymore), not for system's files watching nor to protect all the files on your computer. To do this, you can use a filesystem that directly integrate error correction code capacity, such as ZFS.

Recursive/Relative Files Integrity Generator and Checker in Python (aka RFIGC)

Recursively generate or check the integrity of files by MD5 and SHA1 hashes, size, modification date or by data structure integrity (only for images).

This script is originally meant to be used for data archival, by allowing an easy way to check for silent file corruption. Thus, this script uses relative paths so that you can easily compute and check the same redundant data copied on different mediums (hard drives, optical discs, etc.). This script is not meant for system files corruption notification, but is more meant to be used from times-to-times to check up on your data archives integrity (if you need this kind of application, see avpreserve's fixity <https://github.com/avpreserve/fixity>_).

Example usage


-  To generate the database (only needed once):

``pff hash -i "your_folder" -d "dbhash.csv" -g``

-  To check:

``pff hash -i "your_folder" -d "dbhash.csv" -l log.txt -s``

-  To update your database by appending new files:

``pff hash -i "your_folder" -d "dbhash.csv" -u -a``

-  To update your database by appending new files AND removing
   inexistent files:

``pff hash -i "your_folder" -d "dbhash.csv" -u -a -r``

Note that by default, the script is by default in check mode, to avoid
wrong manipulations. It will also alert you if you generate over an
already existing database file.

Arguments

::

  -h, --help            show a help message and exit
  -i /path/to/root/folder, --input /path/to/root/folder
                        Path to the root folder from where the scanning will occ
ur.
  -d /some/folder/databasefile.csv, --database /some/folder/databasefile.csv
                        Path to the csv file containing the hash informations.
  -l /some/folder/filename.log, --log /some/folder/filename.log
                        Path to the log file. (Output will be piped to both the
stdout and the log file)
  -s, --structure_check
                        Check images structures for corruption?
  -e /some/folder/errorsfile.csv, --errors_file /some/folder/errorsfile.csv
                        Path to the error file, where errors at checking will be
 stored in CSV for further processing by other softwares (such as file repair so
ftwares).
  -m, --disable_modification_date_checking
                        Disable modification date checking.
  --skip_missing        Skip missing files when checking (useful if you split yo
ur files into several mediums, for example on optical discs with limited capacit
y).
  -g, --generate        Generate the database? (omit this parameter to check ins
tead of generating).
  -f, --force           Force overwriting the database file even if it already e
xists (if --generate).
  -u, --update          Update database (you must also specify --append or --rem
ove).
  -a, --append          Append new files (if --update).
  -r, --remove          Remove missing files (if --update).

  --filescraping_recovery          Given a folder of unorganized files, compare to the database and restore the filename and directory structure into the output folder.
  -o, --output          Path to the output folder where to output the files reorganized after --recover_from_filescraping.

Header Error Correction Code script

This script was made to be used in combination with other more common file redundancy generators (such as PAR2, I advise MultiPar). This is an additional layer of protection for your files: by using a higher resiliency rate on the headers of your files, you ensure that you will be probably able to open them in the future, avoiding the "critical spots", also called "fracture-critical" in redundancy engineering (where if you modify just one bit, your whole file may become unreadable, usually bits residing in the headers - in other words, a single blow makes the whole thing collapse, just like non-redundant bridges).

An interesting benefit of this approach is that it has a low storage (and computational) overhead that scales linearly to the number of files, whatever their size is: for example, if we have a set of 40k files for a total size of 60 GB, with a resiliency_rate of 30% and header_size of 1KB (we limit to the first 1K bytes/characters = our file header), then, without counting the hash per block and other meta-data, the final ECC file will be about 2 * resiliency_rate * number_of_files * header_size = 24.5 MB. This size can be lower if there are many files smaller than 1KB. This is a pretty low storage overhead to backup the headers of such a big number of files.

The script is pure-python as are its dependencies: it is thus completely cross-platform and open source. The default ecc algo (eccalgo=3 uses reedsolo <https://github.com/tomerfiliba-org/reedsolomon>) also provides a speed-optimized C-compiled implementation (creedsolo) that will be used if available for the user's platform, so pyFileFixity should be fast by default. Alternatively, it's possible to use a JIT compiler such as PyPy, although this means that creedsolo will not be useable, so PyPy may accelerate other functions but slower ecc encoding/decoding.

Structural Adaptive Error Correction Encoder

This script implements a variable error correction rate encoder: each file is ecc encoded using a variable resiliency rate -- using a high constant resiliency rate for the header part (resiliency rate stage 1, high), then a variable resiliency rate is applied to the rest of the file's content, with a higher rate near the beginning of the file (resiliency rate stage 2, medium) which progressively decreases until the end of file (resiliency rate stage 3, the lowest).

The idea is that the critical parts of files usually are placed at the top, and data becomes less and less critical along the file. What is meant by critical is both the critical spots (eg: if you tamper only one character of a file's header you have good chances of losing your entire file, ie, you cannot even open it) and critically encoded information (eg: archive formats usually encode compressed symbols as they go along the file, which means that the first occurrence is encoded, and then the archive simply writes a reference to the symbol. Thus, the first occurrence is encoded at the top, and subsequent encoding of this same data pattern will just be one symbol, and thus it matters less as long as the original symbol is correctly encoded and its information preserved, we can always try to restore the reference symbols later). Moreover, really redundant data will be placed at the top because they can be reused a lot, while data that cannot be too much compressed will be placed later, and thus, corruption of this less compressed data is a lot less critical because only a few characters will be changed in the uncompressed file (since the data is less compressed, a character change on the not-so-much compressed data won't have very significant impact on the uncompressed data).

This variable error correction rate should allow to protect more the critical parts of a file (the header and the beginning of a file, for example in compressed file formats such as zip or jpg this is where the most importantly strings are encoded) for the same amount of storage as a standard constant error correction rate.

Of course, you can set the resiliency rate for each stage to the values you want, so that you can even do the opposite: setting a higher resiliency rate for stage 3 than stage 2 will produce an ecc that is greater towards the end of the contents of your files.

Furthermore, the currently designed format of the ecc file would allow two things that are not available in all current file ecc generators such as PAR2:

  1. it allows to partially repair a file, even if not all the blocks can be corrected (in PAR2, a file is repaired only if all blocks can be repaired, which is a shame because there are still other blocks that could be repaired and thus produce a less corrupted file) ;

  2. the ecc file format is quite simple and readable, easy to process by any script, which would allow other softwares to also work on it (and it was also done in this way to be more resilient against error corruptions, so that even if an entry is corrupted, other entries are independent and can maybe be used, thus the ecc is very error tolerant. This idea was implemented in repair_ecc.py but it could be extended, especially if you know the pattern of the corruption).

The script structural-adaptive-ecc.py implements this idea, which can be seen as an extension of header-ecc.py (and in fact the idea was the other way around: structural-adaptive-ecc.py was conceived first but was too complicated, then header-ecc.py was implemented as a working lessened implementation only for headers, and then structural-adaptive-ecc.py was finished using header-ecc.py code progress). It works, it was a quite well tested for my own needs on datasets of hundred of GB, but it's not foolproof so make sure you test the script by yourself to see if it's robust enough for your needs (any feedback about this would be greatly appreciated!).

ECC Algorithms

You can specify different ecc algorithms using the --ecc_algo switch.

For the moment, only Reed-Solomon is implemented, but it's universal so you can modify its parameters in lib/eccman.py.

Two Reed-Solomon codecs are available, they are functionally equivalent and thoroughly unit tested.

Note about speed: Also, use a smaller --max_block_size to greatly speedup the operations! That's the trick used to compute very quickly RS ECC on optical discs. You give up a bit of resiliency of course (because blocks are smaller, thus you protect a smaller number of characters per ECC. In the end, this should not change much about real resiliency, but in case you get a big bit error burst on a contiguous block, you may lose a whole block at once. That's why using RS255 is better, but it's very time consuming. However, the resiliency ratios still hold, so for any other case of bit-flipping with average-sized bursts, this should not be a problem as long as the size of the bursts is smaller than an ecc block.)

In case of a catastrophic event

TODO: write more here

In case of a catastrophic event of your data due to the failure of your storage media (eg: your hard drive crashed), then follow the following steps:

1- use dd_rescue to make a full bit-per-bit verbatim copy of your drive before it dies. The nice thing with dd_rescue is that the copy is exact, and also that it can retries or skip in case of bad sectors (it won't crash on your suddenly at half the process).

2- Use testdisk to restore partition or to copy files based on partition filesystem informations.

3- If you could not recover your files, you can try file scraping using photorec <http://www.cgsecurity.org/wiki/PhotoRec> or plaso <http://plaso.kiddaland.net/> other similar tools as a last resort to extract data based only from files content (no filename, often uncorrect filetype, file boundaries may be wrong so some data may be cut off, etc.).

4- If you used pyFileFixity before the failure of your storage media, you can then use your pre-computed databases to check that files are intact (rfigc.py) and if they aren't, you can recover them (using header_ecc.py and structural_adaptive_ecc.py). It can also help if you recovered your files via data scraping, because your files will be totally unorganized, but you can use a previously generated database file to recover the full names and directory tree structure using rfigc.py --filescraping_recover.

Also, you can try to fix some of your files using specialized repairing tools (but remember that such tool cannot guarantee you the same recovering capacity as an error correction code - and in addition, error correction code can tell you when it has recovered successfully). For example:

Protecting directory tree meta-data

One main current limitation of pyFileFixity is that it cannot protect the directory tree meta-data. This means that in the worst case, if a silent error happens on the inode pointing to the root directory that you protected with an ecc, the whole directory will vanish, and all the files inside too. In less worst cases, sub-directories can vanish, but it's still pretty bad, and since the ecc file doesn't store any information about inodes, you can't recover the full path.

The inability to store these meta-data is because of two choices in the design:

  1. portability: we want the ecc file to work even if we move the root directory to another place or another storage medium (and of course, the inode would change),

  2. cross-platform compatibility: there's no way to get and store directory meta-data for all platforms, but of course we could implement specific instructions for each main platform, so this point is not really a problem.

To workaround this issue (directory meta-data are critical spots), other softwares use a one-time storage medium (ie, writing your data along with generating and writing the ecc). This way, they can access at the bit level the inode info, and they are guaranted that the inodes won't ever change. This is the approach taken by DVDisaster: by using optical mediums, it can compute inodes that will be permanent, and thus also encode that info in the ecc file. Another approach is to create a virtual filesystem specifically to store just your files, so that you manage the inode yourself, and you can then copy the whole filesystem around (which is really just a file, just like a zip file - which can also be considered as a mini virtual file system in fact) like rsbep <http://users.softlab.ntua.gr/~ttsiod/rsbep.html>_.

Here the portability principle of pyFileFixity prevents this approach. But you can mimic this workaround on your hard drive for pyFileFixity to work: you just need to package all your files into one file. This way, you sort of create a virtual file system: inside the archive, files and directories have meta-data just like in a filesystem, but from the outside it's just one file, composed of bytes that we can just encode to generate an ecc file - in other words, we removed the inodes portability problem, since this meta-data is stored relatively inside the archive, the archive manage it, and we can just encode this info like any other stream of data! The usual way to make an archive from several files is to use TAR, but this will generate a solid archive which will prevent partial recovery. An alternative is to use DAR, which is a non-solid archive version of TAR, with lots of other features too. If you also want to compress, you can just use ZIP (with DEFLATE algorithm) your files (this also generates a non-solid archive). You can then use pyFileFixity to generate an ecc file on your DAR or ZIP archive, which will then protect both your files just like before and the directories meta-data too now.

Which storage medium to use

Since hard drives have a relatively short timespan (5-10 years, often less) and require regular plugging to an electrical outlet to keep the magnetic plates from decaying, other solutions are more advisable.

The medium I used to advise was optical disks (whether it's BluRay, DVD - not CDs!), because the reading instrument is distinct from the storage medium, and the technology (laser reflecting on bumps and/or pits) is kind of universal, so that even if the technology is lost one day (deprecated by newer technologies, so that you can't find the reading instrument anymore because it's not sold anymore), you can probably emulate a laser using some software to read your optical disk, just like what the CAMiLEON project did to recover data from the LaserDiscs of the BBC Domesday Project (see Wikipedia). BluRays have an estimated lifespan of 20-50 years depending on if they are "gold archival grade", whereas DVD should live up from 10-30 years. CDs are only required to live a minimum of 1 year up to 10 years max, hence are not fit for archival. Archival optimized optical discs such as M-Discs boast about being able to live up to 100 years, but there is no independent scientific backing of these claims currently. For more details, you can read a longer explanation I wrote with references on StackOverflow <https://web.archive.org/web/20230424112000/https://superuser.com/questions/374609/what-medium-should-be-used-for-long-term-high-volume-data-storage-archival/873260>__.

However, limitations of optical discs include their limited storage space, low transfer speed, and limited rewriteability.

A more convenient solution is to use magnetic tape, especially with an open standard such as Linear Tape Open (LTO) <https://en.wikipedia.org/wiki/Linear_Tape-Open>__, which ensures interoperability between manufacturers and hence also reduces cost because of competition. LTO works as a two components system: the tape drive, and the cartridges (with the magnetic bands). There are lots of versions of LTO, each generation improving on the previous one. LTO cartridges have a shorter lifespan than optical discs, being 15-30 years on average, but they are much more convenient to use:

Sounds perfect, right? Well, nothing is, LTO also has several disadvantages:

Given all the above characteristics, LTO>=5 appears to be the best practical solution for long term archival, if coupled with an active (but infrequent) curation process.

There is however one exception: if you need to cold store the medium in a non temperate environment (outside of 10-40°C), then using optical discs may be more resilient, although LTO cartridges should also be able to sustain a wider range of temperature but you need to wait while they "warm up" in the environment where the reader is before reading, so that the magnetic elements have time to stabilize at normal temperature.

How to get a LTO tape drive and system running

To get started with LTO tape drives and which one to choose and how to make your own rig, Matthew Millman made an excellent tutorial <https://www.mattmillman.com/attaching-lto-tape-drives-via-usb-or-thunderbolt/>__ on which we build upon below, so you should read this tutorial and then read the instructions below.

The process is as follows: first find a second-hand/refurbished LTO drive with the highest revision you can for your budget, then find a server of a similar generation, or make an eGPU + SAS card of the highest speed the tape drive can support. Generally, you can aim for a LTO drive 3-4 generations older than the latest one (eg, if current is LTO9, you can expect cheap - 150-300 dollars per drive) for a LTO5 or LTO6). Aim only for LTO5+, because only LTFS did not exist before LTO5, but keep in mind some LTO5 drives need a firmware update to support LTFS, whereas all LTO6 drives support out of the box.

Once you find a second-hand LTO drive, consult its user manual beforehand to see what SAS or fibre cable (FC) you need (if SAS, any version should work, even greater versions, but older versions will just limit the read/write speed performance). For example, here is the manual for the HP LTO6 drive <https://docs.oracle.com/cd/E38452_01/en/LTO6_Vol1_E1_D7/LTO6_Vol1_E1_D7.pdf>__. All LTO drives are compatible with all computers provided you have the adequate connectivity (a SAS or FC adapter).

Once you have a LTO drive, then you can look for a computer to plug your LTO to. Essentially, you just need a computer that supports SAS. If not, then at least a free PCIe or mini-PCIe slot to be able to connect a SAS adapter.

The general outline is that you just need to have a computer with a PCIe slot, and get a SAS or FC adapter (depending on whether your LTO drive is SAS or FC) so that you can plug your LTO drive. There is currently no SAS to USB adapter, and only one manufacturer makes LTO drives with USB ports but they are super expensive, so just stick with internal SAS or FC drives (usually you want SAS, FC are better for long range connections, whereas SAS is compatible with SATA and SCSI drives, so you can also plug all your other hard drives plus the LTO tape drive on the same SAS adapter with this protocol).

In practice, there are 2 different available cost-effective approaches:

The consumables, the tapes, can also be easily found second-hand and usually are very cheap, eg, LTO6 tapes are sold at 10-20 euros/dollars one, for a storage space of 3TB to 6.25TB per tape.

With both approaches, expect at the cheapest a total cost of about 500 euros/dollars for the tape drive and attachment system (eGPU casing or dedicated server) as of 2023, which is very good and amortizable very fast with just a few tapes, even compared to the cheapest hard drives!

If you are just starting with professional servers setups, the YouTube channel Art of Server <https://www.youtube.com/channel/UCKHE9DEep52XlmwLbZUKvyw> is highly recommended, providing very helpful tutorials such as the 13 Reasons Why your drives are not showing up in your LSI HBA <https://www.youtube.com/watch?v=1dCd6IepB5s>. To identify SAS cables and connectors, see this guide <https://www.tape-drive-repair.com/sas-connector-guide/>__

Note there is a ltfs software package available on FreeBSD, and it may be better to flash the SAS HBA controller firmware to the latest version and to IT mode to allow the OS to more easily manage the drive, see this guide <https://github.com/lrq3000/lsi_sas_hba_crossflash_guide>__ for more info on how to crossflash HBA firmwares. Flashing a recent firmware may also be necessary if drives bigger than 2TB are not detected. Indeed, SAS drives are usually inexpensive on the 2nd hand market, and they are easily swappable using servers racks, so they can also be a nice additional online backup method (and you can even make a simple home cloud that autobackups your phone's data using Syncthing for example, and sync multiple backup copies using Freefilesync).

A modern data curation strategy for individuals

Here is an example curation strategy, which is accessible to individuals and not just big data centers:

With the above strategy, you should be able to preserve your data for as long as you can actively curate it. In case you want more robustness against accidents or the risk that 2 copies get corrupted under 5 years, then you can make more copies, preferably as LTO cartridges, but it can be other hard drives.

For more information on how to cold store LTO drives, read pp32-33 "Caring for Cartridges" instruction of this user manual <https://docs.oracle.com/cd/E38452_01/en/LTO6_Vol1_E1_D7/LTO6_Vol1_E1_D7.pdf>. For HP LTO6 drives, Matthew Millman made an open-source commandline tool to do advanced LTO manipulations on Windows: ltfscmd <https://github.com/inaxeon/ltfscmd>.

In case you cannot afford a LTO drive, you can replace these by external hard drives, as they are less expensive to start with, but then your curation strategy should be done more frequently (ie, every 2-3 years a small checkup, and every 5 years, a big checkup).

Tools like pyFileFixity (or which can be used as complements)

Here are some tools with a similar philosophy to pyFileFixity, which you can use if they better fit your needs, either as a replacement of pyFileFixity or as a complement (pyFileFixity can always be used to generate an ecc file):

FAQ

As a rule of thumb, you should ALWAYS keep your ecc file in clear text, so under no compression nor encryption. This is because in case the ecc file gets corrupted, if compressed/encrypted, the decompression/decrypting of the corrupted parts may completely flaw the whole structure of the ecc file.

Your data files, that you want to protect, should remain in clear text, but you may choose to compress them if it drastically reduces the size of your files, and if you raise the resilience rate of your ecc file (so compression may be a good option if you have an opportunity to trade the file size reduction for more ecc file resilience). Also, make sure to choose a non-solid compression algorithm like DEFLATE (zip) so that you can still decode correct parts even if some are corrupted (else with a solid archive, if one byte is corrupted, the whole archive may become unreadable).

However, in the case that you compress your files, you should generate the ecc file only after compression, so that the ecc file applies to the compressed archive instead of the uncompressed files, else you risk being unable to correct your files because the uncompression of corrupted parts may output gibberish, and length extended corrupted parts (and if the size is different, Reed-Solomon will just freak out).

NEVER encrypt your ecc file, this is totally useless and counterproductive.

You can encrypt your data files, but choose a non-solid algorithm (like AES if I'm not mistaken) so that corrupted parts do not prevent the decoding of subsequent correct parts. Of course, you're lowering a bit your chances of recovering your data files by encrypting them (the best chance to keep data for the long term is to keep them in clear text), but if it's really necessary, using a non-solid encrypting scheme is a good compromise.

You can generate an ecc file on your encrypted data files, thus after encryption, and keep the ecc file in clear text (never encrypt nor compress it). This is not a security risk at all since the ecc file does not give any information on the content inside your encrypted files, but rather just redundant info to correct corrupted bytes (however if you generate the ecc file on the data files before encryption, then it's clearly a security risk, and someone could recover your data without your permission).

The details are long and a bit complicated (I may write a complete article about it in the future), but the tl;dr answer is that you should use optical disks, because it decouples the storage medium and the reading hardware (eg, at the opposite we have hard drives, which contains both the reading hardware and the storage medium, so if one fails, you lose both) and because it's most likely future-proof (you only need a laser, which is universal, the laser's parameters can always be tweaked).

From scientific studies, it seems that, at the time of writing this (2015), BluRay HTL disks are the most resilient against environmental degradation. To raise the duration, you can also put optical disks in completely opaque boxes (to avoid light degradation) and in addition you can put any storage medium (not only optical disks, but also hard drives and anything really) in completely air-tight and water-tight bags or box and put in a fridge or a freezer. This is a law of nature: lower the temperature, lower will be the entropy, in other words lower will be the degradation over time. It works the same with digital data.

It's difficult to advise a specific format. What we can do is advise the characteristics of a good file format:

There are a few studies about the most resilient file formats, such as:

If you have any question about Reed-Solomon codes, the best place to ask is probably here (with the incredible Dilip Sarwate): http://www.dsprelated.com/groups/comp.dsp/1.php?searchfor=reed%20solomon

Also, you may want to read the following resources:

.. |Example| image:: https://raw.githubusercontent.com/lrq3000/pyFileFixity/master/tux-example.jpg :scale: 60 % :alt: Image corruption and repair example .. |PyPI-Status| image:: https://img.shields.io/pypi/v/pyfilefixity.svg :target: https://pypi.org/project/pyfilefixity .. |PyPI-Versions| image:: https://img.shields.io/pypi/pyversions/pyfilefixity.svg?logo=python&logoColor=white :target: https://pypi.org/project/pyfilefixity .. |PyPI-Downloads| image:: https://img.shields.io/pypi/dm/pyfilefixity.svg?label=pypi%20downloads&logo=python&logoColor=white :target: https://pypi.org/project/pyfilefixity .. |Build-Status| image:: https://github.com/lrq3000/pyFileFixity/actions/workflows/ci-build.yml/badge.svg?event=push :target: https://github.com/lrq3000/pyFileFixity/actions/workflows/ci-build.yml .. |Coverage| image:: https://codecov.io/github/lrq3000/pyFileFixity/coverage.svg?branch=master :target: https://codecov.io/github/lrq3000/pyFileFixity?branch=master .. |LICENCE| image:: https://img.shields.io/pypi/l/pyfilefixity.svg :target: https://raw.githubusercontent.com/lrq3000/pyfilefixity/master/LICENCE