Closed abhaysamantni closed 3 weeks ago
There was a discussion related to this at GrCon a while ago. IIRC the issues raised were:
It's a great cryptographic checksum, but is slow to compute. "Why don't we just allow MD5".
core:sha512
with:
"core:checksum":{"algorithm":"sha256", "hash":"deadbeef..."}
where you may logically use some other hashing function
"core:checksum":{"function":"sha3-224", "hash":"5ca1ab1e..."}
Some users sourcing files from over a network connection would prefer to verify some subset of the signal. Either just the first part, or specific parts.
Someone in the crowd had implemented an MD5 checksum on just first 1MB as a sanity check. Others felt the purpose of the checksum was to verify the whole file integrity. It was pointed out that the checksums could be calculated at intervals to allow random access to different portions of large signals.
No specific JSON implementation was proposed, but it could be a list of checksums or perhaps checksums per file of a collection, or checksums at some arbitrary interval (2**N bytes).
sha512
(from the SHA2 family) is pretty fast (700MB/s on my 1 cpu core, much faster on GPU) and maybe we should just find a way to use the faster implementations. The sigmf-python
package uses hashlib
but perhaps we don't have a correct implementation. md5
is undeniably fast, but only up to 30% faster on some platforms.On a single core of my AMD 2990WX I get (openssl speed -bytes 8192 -evp sha3-512
):
md5
796 MB/ssha1
2091 MB/ssha256
1940 MB/ssha512
705 MB/ssha3-224
483 MB/ssha3-256
449 MB/ssha3-384
340 MB/ssha3-512
243 MB/sshake256
433 MB/sshake128
544 MB/sNote AMD has hardware SHA acceleration for chips since ~2017.
Not sure if @gmabey or @777arc have thoughts on this one.
Just to clarify, are we talking about the optional hash field called sha512
within global
, or the hashes that show up in Collection where it's the SHA512 hash of the metadata file (xyz.sigmf-meta) and there's one calc per item in the collection, which for a lot of us means one per digitized array element, but it's only the meta not data file.
My comment was about the hash field sha512 within global. Is it optional? I did not think so because the sigmf python library automatically calculates it on the entire dataset.
Oh very optional, I don't know many folks who actually use it, I think the python library does it out of convenience but we should be able to find a way to turn it off
The sigmf-python
library doesn't have an option to skip_checksum while creating the metadata, but it should. Would be a simple PR.
I think the rest of this discussion should be related to the two issues above.
I'm content to have (practically) any checksum type field, but my attitude on this got an upgrade recently.
One of my coworkers sent about 11 .sigmf files to some of our counterparts in another company through a "large file delivery service" and they immediately complained that 8 of the 11 couldn't be opened (in python).
Well, the python exception that was getting raised was that the sha512 field didn't match the contents ... yep 8/11 had been corrupted somehow in the upload step :-O
That's the first time I thought to myself, Wow, I am glad that the default is skip_checksum=False
both for creating recordings as well as reading them.
The library I'm using for computing the sha512
field also supports SHA256 (and SHA384) so some sort of arrangement where various are available sounds great to me; I would appreciate the flexibility.
For sigmf-python
it seems like another parameter could be added alongside skip_checksum
like checksum_alg="sha512"
and if checksum_alg=None
then that has the same effect as skip_checksum=True
.
- For streaming I do think there is a need to have a checksum at particular file intervals, for streaming or random access. Someone should come up with a proposal for how this should be implemented.
So, it seems to me like the core:sample_start
(already allowed for captures) and core:sample_count
(could be added to captures) would provide the scope for computing such a thing, but I'm not very excited about the concept.
I suppose that such a device would detect whether portions of the data file went AWOL, but I suspect that the scenario where portions of the data file were corrupted-but-still-contiguous would be more common.
All of this sounds appropriate for an extension, maybe sigmf-integrity
or something. The core
namespace checksum is 100% optional, and if sha512 is a problem users are free to define other methods.
The one thing we definitely determined at that GRCON discussion is that there is no one-size-fits-all option. Other hash types offer some nominal improvement but basically have the same issues (2-3x improvement doesn't really solve much) so there was no driving motivation to switch the canonical root hash. Providing multiple core
options is not unreasonable, but increases burden on implementing applications (which is a large negative).
@gmabey did you ever figure out what type of corruption happened?
Based on this discussion I believe we have 3 outcomes:
sha256sum
is super fast and still FIPS approved so we will keep it. Consider allowing defining of hash function in the future (perhaps for implementing suggestion 3).sigmf-python
to allow skip-checksum on creation. That addresses OP's concern.captures
to check for partial integrity. Perhaps this should start as an optional extension as you propose.Feel free to close issue unless there is anything else to discuss.
@gmabey did you ever figure out what type of corruption happened?
Honestly it didn't occur to me to try to figure that out, but that's a very interesting idea. I'll make a note to look for those corrupted files and try to compare the differences. They (apparently) weren't corrupted enough for tarfile
to complain about them, but that's a pretty low bar!
Closing for now, but a good reference if we want to revisit allowing partial or multiple checksum algorithms later.
We are finding that it takes about 1.8 s/Gb to compute the hash. This translates to 35 seconds of hash crunching per 1 second of streaming per channel for our use-case. We tested with 300 MB, 3 GB, 30 GB, and 300 GB. Results are pretty linear. In our use-case, file sizes have no upper bound. They can easily be terabytes.
Is there a way to speed up the sha25 hash computation? I am assuming that the hash of the entire dataset is a required field. Can we compute hash only on subsets? @shannonmhair