Faster way to compute checksum?

abhaysamantni commented 8 months ago

We are finding that it takes about 1.8 s/Gb to compute the hash. This translates to 35 seconds of hash crunching per 1 second of streaming per channel for our use-case. We tested with 300 MB, 3 GB, 30 GB, and 300 GB. Results are pretty linear. In our use-case, file sizes have no upper bound. They can easily be terabytes.

Is there a way to speed up the sha25 hash computation? I am assuming that the hash of the entire dataset is a required field. Can we compute hash only on subsets? @shannonmhair

Teque5 commented 8 months ago

There was a discussion related to this at GrCon a while ago. IIRC the issues raised were:

"sha512 doesn't fit our use case"

It's a great cryptographic checksum, but is slow to compute. "Why don't we just allow MD5".

Community Proposals

Change the checksum field to a new key indicating the hash function. For instance, replace core:sha512 with: "core:checksum":{"algorithm":"sha256", "hash":"deadbeef..."} where you may logically use some other hashing function "core:checksum":{"function":"sha3-224", "hash":"5ca1ab1e..."}
Allow any hashing function (crusty MD5), or only some subset like FIPS 180-4 and/or FIPS 202

"we should allow partial checksums"

Some users sourcing files from over a network connection would prefer to verify some subset of the signal. Either just the first part, or specific parts.

Community Thoughts

Someone in the crowd had implemented an MD5 checksum on just first 1MB as a sanity check. Others felt the purpose of the checksum was to verify the whole file integrity. It was pointed out that the checksums could be calculated at intervals to allow random access to different portions of large signals.

No specific JSON implementation was proposed, but it could be a list of checksums or perhaps checksums per file of a collection, or checksums at some arbitrary interval (2**N bytes).

My thoughts

What we have is sufficient for non-streaming. sha512 (from the SHA2 family) is pretty fast (700MB/s on my 1 cpu core, much faster on GPU) and maybe we should just find a way to use the faster implementations. The sigmf-python package uses hashlib but perhaps we don't have a correct implementation.
If we add more methods, only use FIPS. The purpose of the checksum is to verify the file integrity. If we care about collision resistance (I think we do), then I believe we should only support FIPS compliant hashing algorithms. md5 is undeniably fast, but only up to 30% faster on some platforms.
For streaming I do think there is a need to have a checksum at particular file intervals, for streaming or random access. Someone should come up with a proposal for how this should be implemented.
The checksum field is optional.

Quick benchmarks

On a single core of my AMD 2990WX I get (openssl speed -bytes 8192 -evp sha3-512):

md5 796 MB/s
sha1 2091 MB/s
sha256 1940 MB/s
sha512 705 MB/s
sha3-224 483 MB/s
sha3-256 449 MB/s
sha3-384 340 MB/s
sha3-512 243 MB/s
shake256 433 MB/s
shake128 544 MB/s

Note AMD has hardware SHA acceleration for chips since ~2017.

Not sure if @gmabey or @777arc have thoughts on this one.

777arc commented 8 months ago

Just to clarify, are we talking about the optional hash field called sha512 within global, or the hashes that show up in Collection where it's the SHA512 hash of the metadata file (xyz.sigmf-meta) and there's one calc per item in the collection, which for a lot of us means one per digitized array element, but it's only the meta not data file.

abhaysamantni commented 8 months ago

My comment was about the hash field sha512 within global. Is it optional? I did not think so because the sigmf python library automatically calculates it on the entire dataset.

777arc commented 8 months ago

Oh very optional, I don't know many folks who actually use it, I think the python library does it out of convenience but we should be able to find a way to turn it off

Teque5 commented 8 months ago

The sigmf-python library doesn't have an option to skip_checksum while creating the metadata, but it should. Would be a simple PR.

I think the rest of this discussion should be related to the two issues above.

gmabey commented 8 months ago

I'm content to have (practically) any checksum type field, but my attitude on this got an upgrade recently. One of my coworkers sent about 11 .sigmf files to some of our counterparts in another company through a "large file delivery service" and they immediately complained that 8 of the 11 couldn't be opened (in python). Well, the python exception that was getting raised was that the sha512 field didn't match the contents ... yep 8/11 had been corrupted somehow in the upload step :-O That's the first time I thought to myself, Wow, I am glad that the default is skip_checksum=False both for creating recordings as well as reading them.

The library I'm using for computing the sha512 field also supports SHA256 (and SHA384) so some sort of arrangement where various are available sounds great to me; I would appreciate the flexibility. For sigmf-python it seems like another parameter could be added alongside skip_checksum like checksum_alg="sha512" and if checksum_alg=None then that has the same effect as skip_checksum=True.

gmabey commented 8 months ago

For streaming I do think there is a need to have a checksum at particular file intervals, for streaming or random access. Someone should come up with a proposal for how this should be implemented.

So, it seems to me like the core:sample_start (already allowed for captures) and core:sample_count (could be added to captures) would provide the scope for computing such a thing, but I'm not very excited about the concept. I suppose that such a device would detect whether portions of the data file went AWOL, but I suspect that the scenario where portions of the data file were corrupted-but-still-contiguous would be more common.

jacobagilbert commented 7 months ago

All of this sounds appropriate for an extension, maybe sigmf-integrity or something. The core namespace checksum is 100% optional, and if sha512 is a problem users are free to define other methods.

The one thing we definitely determined at that GRCON discussion is that there is no one-size-fits-all option. Other hash types offer some nominal improvement but basically have the same issues (2-3x improvement doesn't really solve much) so there was no driving motivation to switch the canonical root hash. Providing multiple core options is not unreasonable, but increases burden on implementing applications (which is a large negative).

@gmabey did you ever figure out what type of corruption happened?

Teque5 commented 7 months ago

Based on this discussion I believe we have 3 outcomes:

sha256sum is super fast and still FIPS approved so we will keep it. Consider allowing defining of hash function in the future (perhaps for implementing suggestion 3).
We need a PR for sigmf-python to allow skip-checksum on creation. That addresses OP's concern.
Streaming checksums could be implemented in captures to check for partial integrity. Perhaps this should start as an optional extension as you propose.

Feel free to close issue unless there is anything else to discuss.

gmabey commented 7 months ago

@gmabey did you ever figure out what type of corruption happened?

Honestly it didn't occur to me to try to figure that out, but that's a very interesting idea. I'll make a note to look for those corrupted files and try to compare the differences. They (apparently) weren't corrupted enough for tarfile to complain about them, but that's a pretty low bar!

Teque5 commented 3 weeks ago

Closing for now, but a good reference if we want to revisit allowing partial or multiple checksum algorithms later.

sigmf / SigMF