kdave / btrfs-progs

Development of userspace BTRFS tools
GNU General Public License v2.0
533 stars 241 forks source link

Make xxHash64 default; Add Blake3 and xxHash128; Deprecate crc32 #548

Open EldarAgalarov opened 1 year ago

EldarAgalarov commented 1 year ago

Hi btrfs devs. Some suggestions related to hashing algorithms:

Make xxHash 64-bit as default hashing algorithm on new btrfs partition creating.

Add support for:

Deprecate crc32 for new kernels - weak and worst hashing algorithm. xxHash have same performance with a lot better hashing quality.

Forza-tng commented 1 year ago

Hi @EldarAgalarov,

Do you have some data to backup those claims? Also, different aechitectures may behave differently. Intel CPUs do have hw instructions for crc32c, which makes it very fast and cheap. On newer AMD CPUs, the xxhash64 could be faster.

Another question is the definition of "better".

EldarAgalarov commented 1 year ago

Filesystem must care about data integrity first, performance is the second. crc32 is not good enough for that. It's weak, not collision-resistant, with low avalanche effect unlike xxHash for example. xxHash is fast enough and performs pretty well on relatively new hardware. There is no point to use crc32 arguing that it is "cheap" hashing algo. Maybe for data systems with a zillion of write operations crc32 will be good. Most of data storage systems use read operations a lot more often than write ops.

kdave commented 1 year ago

You can already use xxhash for new filesystems, changing defaults has vast consequences and is rarely done and must be backed by an evaluation of "what can go wrong", backward compatibility (what's the oldest kernel that supports the new defaults).

Adding new checksum algorithm is sort of planned and expected as there's still research going on in that area. Blake3 and XXH3 will be considered but this will have to be evaluated. The timeframe for that is 'in a few years'.

Regarding crc32c vs xxhash, it's still sufficient to detect errors for terabyte-sized filesystems, a stronger hash is suitable for larger filesystems (say 10T+). Cryptographic hashes guarantee collision resistance by design but also consume more metadata space in the checksums, so this is up to the user to chose and the mkfs options are there.

It is also possible to switch the checksum on an existing filesystem for "late decisions", but at this point it's not safe as it drops the original checksums and calculates new ones. This could lead to silent errors.

Zesko commented 1 year ago

xxHash have same performance

I do not think xxHash64 and crc32c (SSE4.2) have the same performance. But xxHash64 is twice slower than crc32c (SSE4.2) on Intel 11th and AMD Ryzen 5000 series.

I checked SMHasher

My result:

Intel i7 11800H @ 3GHz AMD Ryzen 5800X @ 3GHz
CRC32 (Software) 710.95 MiB/sec 451.69 MiB/sec
CRC32 (SSE4.2 for hardware) 43749.78 MiB/sec 27406.36 MiB/sec
wyhash (Software) 40202.08 MiB/sec 24205.78 MiB/sec
fletcher4 (Software) 23537.79 MiB/sec 25614.70 MiB/sec
xxHash64 (Software) 22756.69 MiB/sec 10319.78 MiB/sec
xxH3 (SSE2 for hardware) 68520.63 MiB/sec 46489.89 MiB/sec
xxH128 (SSE2 for hardware) 67780.69 MiB/sec 46063.38 MiB/sec

xxH3, xxH128, wyhash have good collision-resistant and good performance for many CPUs x86.

Note: Not all old CPUs x86 support SSE4.2 for CRC32 hardware acceleration but they use slow CRC32 software, because SSE4.2 is newer than SSE2.

I see that xxH3 with SSE2 is the clear winner for many old and new CPUs x86.


Note: MacBook M1 CPU and some CPUs use ARM architecture, I think it does not support SSE for hardware acceleration. wyhash would be interesting for any CPU without SSE.

EldarAgalarov commented 1 year ago

@Zesko As I can see, xxH3 and xxH128 performs better than crc32 with SSE2 instruction set. So, in general crc32 has no performance benefits on relatively modern CPUs for most of users. xxH3 is great replacement of crc32. Paranoid users can use 256-bit cryptographic hashing functions like Blake 2/3

Syderitic commented 3 months ago

Just my 5cents to this discussion. I would really like to see a way to change the checksum on created filesystems and support better hashing functions.

Here are my AMD Ryzen 9 7945HX results using SMhasher

Hash Function Description MiB/sec (Mean ± Std) Cycles/Hash (Mean ± Std)
crc32_hw SSE4.2 crc32 in HW 16316.42 ± 1.16 13.78 ± 2.14
crc32_hw1 Faster Adler SSE4.2 crc32 on Intel HW 47949.31 ± 57.33 16.67 ± 2.14
xxhash64 xxHash, 64-bit 18362.94 ± 9.31 21.15 ± 4.66
xxh3 xxHash v3, 64-bit 80824.93 ± 6050.65 10.52 ± 0.48
xxh128 xxHash v3, 128-bit 77392.25 ± 5925.11 12.39 ± 0.99
blake2b-256 blake2b-256 1167.05 ± 2.71 305.76 ± 5.73
sha2-256 SHA2-256 511.29 ± 8.98 384.40 ± 4.65
sha2ni-256 SHA2_NI-256 (amd64 HW SHA ext) 3066.11 ± 44.96 94.17 ± 0.21
sha2ni-256_64 hardened SHA2_NI-256 (amd64 HW SHA ext), low 64 bits 3125.24 ± 4.19 81.48 ± 0.29

Code to test:

import subprocess
import re
import statistics

def run_smhasher(hash_function):
    # Construct the command to run SMHasher
    command = f"./SMHasher --test=Speed {hash_function}"

    # Run the command and capture the output
    result = subprocess.run(command, shell=True, text=True, capture_output=True)
    output = result.stdout

    # Regex to extract the descriptive name
    name_pattern = r'--- Testing ' + re.escape(hash_function) + r' "(.*?)"'
    name_match = re.search(name_pattern, output, flags=re.IGNORECASE)
    hash_name = name_match.group(1) if name_match else None

    # Regex to find MiB/sec and cycles/hash values
    speed_pattern = r'Alignment\s+\d+\s+-\s+\d+\.\d+\s+bytes/cycle\s+-\s+(\d+\.\d+)\s+MiB/sec'
    cycle_pattern = r'Small key speed test\s+-\s+\d+-byte keys\s+-\s+(\d+\.\d+)\s+cycles/hash'

    # Find all matches
    speeds = [float(match) for match in re.findall(speed_pattern, output)]
    cycles = [float(match) for match in re.findall(cycle_pattern, output)]

    # Calculate statistics
    if speeds:
        avg_speed = statistics.mean(speeds)
        std_speed = statistics.stdev(speeds)
    else:
        avg_speed = std_speed = "No data"

    if cycles:
        avg_cycles = statistics.mean(cycles)
        std_cycles = statistics.stdev(cycles)
    else:
        avg_cycles = std_cycles = "No data"

    return hash_name, avg_speed, std_speed, avg_cycles, std_cycles

def main():
    hash_functions = ["crc32_hw", "crc32_hw1", "xxhash64", "xxh3", "xxh128", "blake2b-256", "sha2-256", "sha2ni-256", "sha2ni-256_64"]
    results = {}

    print(f"{'Hash Function':<14} | {'Description':<70} | {'MiB/sec (Mean ± Std)':<20} | {'Cycles/Hash (Mean ± Std)':<25}")
    print('-' * 137)  # Adjust the total length to fit your terminal if necessary

    for func in hash_functions:
        hash_name, avg_speed, std_speed, avg_cycles, std_cycles = run_smhasher(func)
        results[func] = (avg_speed, std_speed, avg_cycles, std_cycles, hash_name)

    # Print the results
    for func, metrics in results.items():
        avg_speed, std_speed, avg_cycles, std_cycles, hash_name = metrics
        speed_str = f"{avg_speed:>8.2f} ± {std_speed:<5.2f}" if avg_speed != "No data" else "No data"
        cycles_str = f"{avg_cycles:>8.2f} ± {std_cycles:<5.2f}" if avg_cycles != "No data" else "No data"
        print(f"{func:<14} | {hash_name:<70} | {speed_str:<20} | {cycles_str:<25}")

if __name__ == "__main__":
    main()
kdave commented 3 months ago

Change on existing filesystem has been implemented but it's still risky and not enabled outside of experimental build.

$ ./configure --experimental
$ make
$ ./btrfstune --help
...
    EXPERIMENTAL FEATURES:
    --csum CSUM               switch checksum for data and metadata to CSUM 

The XXH3 hashes heavily utilize the SSE instructions, which works well in user space but is problematic in kernel as it needs FPU and switching context and saving state is a bit expensive. Accelerated versions also exist for SHA256 and BLAKE2 (though that one is not in kernel yet), so we'd need to compare base non-accelerated versions along with anything that uses SSE/AVX extensions.

Since progs version 6.5 the CRC32C uses the ultra fast state of the art implementation (same as in kernel) with the PCLMUL and CRC32Q instructions. SHA256 also has the NI support. There is probably some space for optimization, but adding new hashes or changing defaults is a different problem. Last comment from 2022 is "in a few years", so far I don't see a reason to add more.

Syderitic commented 3 months ago

Thank you for your reply.

I guess being able to change to a 64bit hash is enough for me. I was thinking of having a 128bit hash, but probably the only "real" benefit would be for dedup at block level, which is something I'm interested.

I made a simple table with the probability of collision based on the number of unique sectors for a particular storage size and block size according to the birthday paradox.

Storage Size & Block Size 32-bit 64-bit 128-bit 256-bit
0.1 TB & 4 KiB 1.00000e+00 1.95311e-05 1.05879e-24 3.11151e-63
0.1 TB & 16 KiB 1.00000e+00 1.22070e-06 6.61744e-26 1.94469e-64
0.1 TB & 64 KiB 1.00000e+00 7.62939e-08 4.13590e-27 1.21543e-65
1 TB & 4 KiB 1.00000e+00 1.95122e-03 1.05879e-22 3.11151e-61
1 TB & 16 KiB 1.00000e+00 1.22063e-04 6.61744e-24 1.94469e-62
1 TB & 64 KiB 1.00000e+00 7.62937e-06 4.13590e-25 1.21543e-63
10 TB & 4 KiB 1.00000e+00 1.77422e-01 1.05879e-20 3.11151e-59
10 TB & 16 KiB 1.00000e+00 1.21328e-02 6.61744e-22 1.94469e-60
10 TB & 64 KiB 1.00000e+00 7.62648e-04 4.13590e-23 1.21543e-61
100 TB & 4 KiB 1.00000e+00 1.00000e+00 1.05879e-18 3.11151e-57
100 TB & 16 KiB 1.00000e+00 7.04977e-01 6.61744e-20 1.94469e-58
100 TB & 64 KiB 1.00000e+00 7.34562e-02 4.13590e-21 1.21543e-59

Ex: For a 1 TB storage with 4KiB block size, a hash of 64 drops the probability of a collision to less than 0.2 %. Increasing the blocksize can lower that even more.

Thank you kindly for your time.

You can review the code here

 import math

def calculate_collision_probabilities(storage_sizes, block_sizes):
    hash_sizes = [32, 64, 128, 256]

    # Print the header for the markdown table
    print("| Storage Size & Block Size |", end=" ")
    print(" | ".join([f"{size}-bit" for size in hash_sizes]), "|")
    print("|" + " --- |" * (len(hash_sizes) + 1))

    for storage_size in storage_sizes:
        for block_size in block_sizes:
            # Convert storage size from TB to KiB and calculate number of blocks
            data_size_in_KiB = storage_size * 1024**3
            n_blocks = data_size_in_KiB / block_size

            # Calculate the probabilities for each hash size
            probabilities = []
            for bits in hash_sizes:
                m = 2**bits
                # Using log1p to avoid precision issues
                log_p = -n_blocks**2 / (2 * m)
                probability = -math.expm1(log_p)  # more stable for small probabilities
                probabilities.append(f"{probability:.5e}")  # Format as scientific notation

            # Print each row for the markdown table
            row_label = f"{storage_size} TB & {block_size} KiB"
            print("| " + row_label + " | " + " | ".join(probabilities) + " |")

# Example usage
storage_sizes = [0.1, 1, 10, 100]  # List of storage sizes in TB
block_sizes = [4, 16, 64]    # List of block sizes in KiB
calculate_collision_probabilities(storage_sizes, block_sizes)
Zygo commented 3 months ago

For dedupe purposes you don't need a zero collision rate. Dedupe on btrfs enforces a byte-by-byte comparison of the data before removing the duplicate extent, so a few collisions are OK, they just reduce performance. The hash size needs to be just large enough that e.g. less than one in a million blocks collide.

Start with the log2 of the filesystem size, subtract the number of bits for the block size, then add about 20, and that's the minimum hash size. e.g. 100 TiB at 4K block size needs about 54 bits of hash.

adam900710 commented 3 months ago

I guess it's time for us to introduce a comprehensive log-replay-like test for csum conversion so that we can that conversion out of experimental build first.

Saxtr0 commented 2 months ago

If it can help, here the benchmark collected by hash-speedtest.

AMD Ryzen 9 7950X

./hash-speedtest                                                                                                                                                                                                                                                                                          ✔ 
CPU flags: 0x1ff
CPU features: SSE2 SSSE3 SSE41 PCLMUL SSE42 SHA AVX AVX2
Block size:     4096
Iterations:     100000
Implementation: builtin
Units:          CPU cycles

      NULL-NOP: cycles:     17130735, cycles/i      171
   NULL-MEMCPY: cycles:     43030170, cycles/i      430,    40708.402 MiB/s
    CRC32C-ref: cycles:   2631866400, cycles/i    26318,      665.577 MiB/s
     CRC32C-NI: cycles:     55520820, cycles/i      555,    31550.240 MiB/s
        XXHASH: cycles:     99091305, cycles/i      990,    17677.410 MiB/s
    SHA256-ref: cycles:   7556944725, cycles/i    75569,      231.801 MiB/s
     SHA256-NI: cycles:   3214167795, cycles/i    32141,      544.996 MiB/s
    BLAKE2-ref: cycles:   1548024030, cycles/i    15480,     1131.576 MiB/s
   BLAKE2-SSE2: cycles:   1641883545, cycles/i    16418,     1066.890 MiB/s
  BLAKE2-SSE41: cycles:   1517379705, cycles/i    15173,     1154.430 MiB/s
   BLAKE2-AVX2: cycles:   1117633140, cycles/i    11176,     1567.338 MiB/s