Open EldarAgalarov opened 1 year ago
Hi @EldarAgalarov,
Do you have some data to backup those claims? Also, different aechitectures may behave differently. Intel CPUs do have hw instructions for crc32c, which makes it very fast and cheap. On newer AMD CPUs, the xxhash64 could be faster.
Another question is the definition of "better".
Filesystem must care about data integrity first, performance is the second. crc32 is not good enough for that. It's weak, not collision-resistant, with low avalanche effect unlike xxHash for example. xxHash is fast enough and performs pretty well on relatively new hardware. There is no point to use crc32 arguing that it is "cheap" hashing algo. Maybe for data systems with a zillion of write operations crc32 will be good. Most of data storage systems use read operations a lot more often than write ops.
You can already use xxhash for new filesystems, changing defaults has vast consequences and is rarely done and must be backed by an evaluation of "what can go wrong", backward compatibility (what's the oldest kernel that supports the new defaults).
Adding new checksum algorithm is sort of planned and expected as there's still research going on in that area. Blake3 and XXH3 will be considered but this will have to be evaluated. The timeframe for that is 'in a few years'.
Regarding crc32c vs xxhash, it's still sufficient to detect errors for terabyte-sized filesystems, a stronger hash is suitable for larger filesystems (say 10T+). Cryptographic hashes guarantee collision resistance by design but also consume more metadata space in the checksums, so this is up to the user to chose and the mkfs options are there.
It is also possible to switch the checksum on an existing filesystem for "late decisions", but at this point it's not safe as it drops the original checksums and calculates new ones. This could lead to silent errors.
xxHash have same performance
I do not think xxHash64 and crc32c (SSE4.2) have the same performance. But xxHash64 is twice slower than crc32c (SSE4.2) on Intel 11th and AMD Ryzen 5000 series.
I checked SMHasher
My result:
Intel i7 11800H @ 3GHz | AMD Ryzen 5800X @ 3GHz | |
---|---|---|
CRC32 (Software) | 710.95 MiB/sec | 451.69 MiB/sec |
CRC32 (SSE4.2 for hardware) | 43749.78 MiB/sec | 27406.36 MiB/sec |
wyhash (Software) | 40202.08 MiB/sec | 24205.78 MiB/sec |
fletcher4 (Software) | 23537.79 MiB/sec | 25614.70 MiB/sec |
xxHash64 (Software) | 22756.69 MiB/sec | 10319.78 MiB/sec |
xxH3 (SSE2 for hardware) | 68520.63 MiB/sec | 46489.89 MiB/sec |
xxH128 (SSE2 for hardware) | 67780.69 MiB/sec | 46063.38 MiB/sec |
xxH3, xxH128, wyhash have good collision-resistant and good performance for many CPUs x86.
Note: Not all old CPUs x86 support SSE4.2 for CRC32 hardware acceleration but they use slow CRC32 software, because SSE4.2 is newer than SSE2.
I see that xxH3 with SSE2 is the clear winner for many old and new CPUs x86.
Note: MacBook M1 CPU and some CPUs use ARM architecture, I think it does not support SSE for hardware acceleration. wyhash would be interesting for any CPU without SSE.
@Zesko As I can see, xxH3 and xxH128 performs better than crc32 with SSE2 instruction set. So, in general crc32 has no performance benefits on relatively modern CPUs for most of users. xxH3 is great replacement of crc32. Paranoid users can use 256-bit cryptographic hashing functions like Blake 2/3
Just my 5cents to this discussion. I would really like to see a way to change the checksum on created filesystems and support better hashing functions.
Here are my AMD Ryzen 9 7945HX results using SMhasher
Hash Function | Description | MiB/sec (Mean ± Std) | Cycles/Hash (Mean ± Std) |
---|---|---|---|
crc32_hw | SSE4.2 crc32 in HW | 16316.42 ± 1.16 | 13.78 ± 2.14 |
crc32_hw1 | Faster Adler SSE4.2 crc32 on Intel HW | 47949.31 ± 57.33 | 16.67 ± 2.14 |
xxhash64 | xxHash, 64-bit | 18362.94 ± 9.31 | 21.15 ± 4.66 |
xxh3 | xxHash v3, 64-bit | 80824.93 ± 6050.65 | 10.52 ± 0.48 |
xxh128 | xxHash v3, 128-bit | 77392.25 ± 5925.11 | 12.39 ± 0.99 |
blake2b-256 | blake2b-256 | 1167.05 ± 2.71 | 305.76 ± 5.73 |
sha2-256 | SHA2-256 | 511.29 ± 8.98 | 384.40 ± 4.65 |
sha2ni-256 | SHA2_NI-256 (amd64 HW SHA ext) | 3066.11 ± 44.96 | 94.17 ± 0.21 |
sha2ni-256_64 | hardened SHA2_NI-256 (amd64 HW SHA ext), low 64 bits | 3125.24 ± 4.19 | 81.48 ± 0.29 |
Code to test:
import subprocess
import re
import statistics
def run_smhasher(hash_function):
# Construct the command to run SMHasher
command = f"./SMHasher --test=Speed {hash_function}"
# Run the command and capture the output
result = subprocess.run(command, shell=True, text=True, capture_output=True)
output = result.stdout
# Regex to extract the descriptive name
name_pattern = r'--- Testing ' + re.escape(hash_function) + r' "(.*?)"'
name_match = re.search(name_pattern, output, flags=re.IGNORECASE)
hash_name = name_match.group(1) if name_match else None
# Regex to find MiB/sec and cycles/hash values
speed_pattern = r'Alignment\s+\d+\s+-\s+\d+\.\d+\s+bytes/cycle\s+-\s+(\d+\.\d+)\s+MiB/sec'
cycle_pattern = r'Small key speed test\s+-\s+\d+-byte keys\s+-\s+(\d+\.\d+)\s+cycles/hash'
# Find all matches
speeds = [float(match) for match in re.findall(speed_pattern, output)]
cycles = [float(match) for match in re.findall(cycle_pattern, output)]
# Calculate statistics
if speeds:
avg_speed = statistics.mean(speeds)
std_speed = statistics.stdev(speeds)
else:
avg_speed = std_speed = "No data"
if cycles:
avg_cycles = statistics.mean(cycles)
std_cycles = statistics.stdev(cycles)
else:
avg_cycles = std_cycles = "No data"
return hash_name, avg_speed, std_speed, avg_cycles, std_cycles
def main():
hash_functions = ["crc32_hw", "crc32_hw1", "xxhash64", "xxh3", "xxh128", "blake2b-256", "sha2-256", "sha2ni-256", "sha2ni-256_64"]
results = {}
print(f"{'Hash Function':<14} | {'Description':<70} | {'MiB/sec (Mean ± Std)':<20} | {'Cycles/Hash (Mean ± Std)':<25}")
print('-' * 137) # Adjust the total length to fit your terminal if necessary
for func in hash_functions:
hash_name, avg_speed, std_speed, avg_cycles, std_cycles = run_smhasher(func)
results[func] = (avg_speed, std_speed, avg_cycles, std_cycles, hash_name)
# Print the results
for func, metrics in results.items():
avg_speed, std_speed, avg_cycles, std_cycles, hash_name = metrics
speed_str = f"{avg_speed:>8.2f} ± {std_speed:<5.2f}" if avg_speed != "No data" else "No data"
cycles_str = f"{avg_cycles:>8.2f} ± {std_cycles:<5.2f}" if avg_cycles != "No data" else "No data"
print(f"{func:<14} | {hash_name:<70} | {speed_str:<20} | {cycles_str:<25}")
if __name__ == "__main__":
main()
Change on existing filesystem has been implemented but it's still risky and not enabled outside of experimental build.
$ ./configure --experimental
$ make
$ ./btrfstune --help
...
EXPERIMENTAL FEATURES:
--csum CSUM switch checksum for data and metadata to CSUM
The XXH3 hashes heavily utilize the SSE instructions, which works well in user space but is problematic in kernel as it needs FPU and switching context and saving state is a bit expensive. Accelerated versions also exist for SHA256 and BLAKE2 (though that one is not in kernel yet), so we'd need to compare base non-accelerated versions along with anything that uses SSE/AVX extensions.
Since progs version 6.5 the CRC32C uses the ultra fast state of the art implementation (same as in kernel) with the PCLMUL and CRC32Q instructions. SHA256 also has the NI support. There is probably some space for optimization, but adding new hashes or changing defaults is a different problem. Last comment from 2022 is "in a few years", so far I don't see a reason to add more.
Thank you for your reply.
I guess being able to change to a 64bit hash is enough for me. I was thinking of having a 128bit hash, but probably the only "real" benefit would be for dedup at block level, which is something I'm interested.
I made a simple table with the probability of collision based on the number of unique sectors for a particular storage size and block size according to the birthday paradox.
Storage Size & Block Size | 32-bit | 64-bit | 128-bit | 256-bit |
---|---|---|---|---|
0.1 TB & 4 KiB | 1.00000e+00 | 1.95311e-05 | 1.05879e-24 | 3.11151e-63 |
0.1 TB & 16 KiB | 1.00000e+00 | 1.22070e-06 | 6.61744e-26 | 1.94469e-64 |
0.1 TB & 64 KiB | 1.00000e+00 | 7.62939e-08 | 4.13590e-27 | 1.21543e-65 |
1 TB & 4 KiB | 1.00000e+00 | 1.95122e-03 | 1.05879e-22 | 3.11151e-61 |
1 TB & 16 KiB | 1.00000e+00 | 1.22063e-04 | 6.61744e-24 | 1.94469e-62 |
1 TB & 64 KiB | 1.00000e+00 | 7.62937e-06 | 4.13590e-25 | 1.21543e-63 |
10 TB & 4 KiB | 1.00000e+00 | 1.77422e-01 | 1.05879e-20 | 3.11151e-59 |
10 TB & 16 KiB | 1.00000e+00 | 1.21328e-02 | 6.61744e-22 | 1.94469e-60 |
10 TB & 64 KiB | 1.00000e+00 | 7.62648e-04 | 4.13590e-23 | 1.21543e-61 |
100 TB & 4 KiB | 1.00000e+00 | 1.00000e+00 | 1.05879e-18 | 3.11151e-57 |
100 TB & 16 KiB | 1.00000e+00 | 7.04977e-01 | 6.61744e-20 | 1.94469e-58 |
100 TB & 64 KiB | 1.00000e+00 | 7.34562e-02 | 4.13590e-21 | 1.21543e-59 |
Ex: For a 1 TB storage with 4KiB block size, a hash of 64 drops the probability of a collision to less than 0.2 %. Increasing the blocksize can lower that even more.
Thank you kindly for your time.
You can review the code here
import math
def calculate_collision_probabilities(storage_sizes, block_sizes):
hash_sizes = [32, 64, 128, 256]
# Print the header for the markdown table
print("| Storage Size & Block Size |", end=" ")
print(" | ".join([f"{size}-bit" for size in hash_sizes]), "|")
print("|" + " --- |" * (len(hash_sizes) + 1))
for storage_size in storage_sizes:
for block_size in block_sizes:
# Convert storage size from TB to KiB and calculate number of blocks
data_size_in_KiB = storage_size * 1024**3
n_blocks = data_size_in_KiB / block_size
# Calculate the probabilities for each hash size
probabilities = []
for bits in hash_sizes:
m = 2**bits
# Using log1p to avoid precision issues
log_p = -n_blocks**2 / (2 * m)
probability = -math.expm1(log_p) # more stable for small probabilities
probabilities.append(f"{probability:.5e}") # Format as scientific notation
# Print each row for the markdown table
row_label = f"{storage_size} TB & {block_size} KiB"
print("| " + row_label + " | " + " | ".join(probabilities) + " |")
# Example usage
storage_sizes = [0.1, 1, 10, 100] # List of storage sizes in TB
block_sizes = [4, 16, 64] # List of block sizes in KiB
calculate_collision_probabilities(storage_sizes, block_sizes)
For dedupe purposes you don't need a zero collision rate. Dedupe on btrfs enforces a byte-by-byte comparison of the data before removing the duplicate extent, so a few collisions are OK, they just reduce performance. The hash size needs to be just large enough that e.g. less than one in a million blocks collide.
Start with the log2 of the filesystem size, subtract the number of bits for the block size, then add about 20, and that's the minimum hash size. e.g. 100 TiB at 4K block size needs about 54 bits of hash.
I guess it's time for us to introduce a comprehensive log-replay-like test for csum conversion so that we can that conversion out of experimental build first.
If it can help, here the benchmark collected by hash-speedtest.
AMD Ryzen 9 7950X
./hash-speedtest ✔
CPU flags: 0x1ff
CPU features: SSE2 SSSE3 SSE41 PCLMUL SSE42 SHA AVX AVX2
Block size: 4096
Iterations: 100000
Implementation: builtin
Units: CPU cycles
NULL-NOP: cycles: 17130735, cycles/i 171
NULL-MEMCPY: cycles: 43030170, cycles/i 430, 40708.402 MiB/s
CRC32C-ref: cycles: 2631866400, cycles/i 26318, 665.577 MiB/s
CRC32C-NI: cycles: 55520820, cycles/i 555, 31550.240 MiB/s
XXHASH: cycles: 99091305, cycles/i 990, 17677.410 MiB/s
SHA256-ref: cycles: 7556944725, cycles/i 75569, 231.801 MiB/s
SHA256-NI: cycles: 3214167795, cycles/i 32141, 544.996 MiB/s
BLAKE2-ref: cycles: 1548024030, cycles/i 15480, 1131.576 MiB/s
BLAKE2-SSE2: cycles: 1641883545, cycles/i 16418, 1066.890 MiB/s
BLAKE2-SSE41: cycles: 1517379705, cycles/i 15173, 1154.430 MiB/s
BLAKE2-AVX2: cycles: 1117633140, cycles/i 11176, 1567.338 MiB/s
Hi btrfs devs. Some suggestions related to hashing algorithms:
Make xxHash 64-bit as default hashing algorithm on new btrfs partition creating.
Add support for:
Deprecate crc32 for new kernels - weak and worst hashing algorithm. xxHash have same performance with a lot better hashing quality.