Open joverlee521 opened 5 months ago
Update hashing to stop going through Python.
This is currently blocked on sha256sum
not being available in the conda runtime.
Alternatively, the naive sha256sum
implementation in Python could exec into the GNU coreutils version, if found, otherwise fall back to the slow Python version.
Update hashing to stop going through Python.
This is currently blocked on
sha256sum
not being available in the conda runtime.
sha256sum
is now available as of the nextstrain-base 20240612T205814Z Conda package.
Note that we're (I'm) assuming the GNU coreutils sha256sum
implementation is faster than our Python one. It likely is! But we don't actually know. Benchmarking might be useful here.
Note that we're (I'm) assuming the GNU coreutils sha256sum implementation is faster than our Python one. It likely is! But we don't actually know. Benchmarking might be useful here.
Simple test with a 1.3G fasta file.
$ time sha256sum ./ingest/data/ncbi_dataset_sequences.fasta
f85d4bfc6c9cfc00d10567aed87723c4bf39498b5dc94f81c94f4b31c98fb806 ./ingest/data/ncbi_dataset_sequences.fasta
real 0m4.452s
user 0m3.952s
sys 0m0.483s
$ time ./ingest/vendored/sha256sum ./ingest/data/ncbi_dataset_sequences.fasta
Still running after 10mins...I'll post update with final time after it finishes...
Still running after 10mins...I'll post update with final time after it finishes...
🤦♀️ Nope, I was just running the script wrong
$ time ./ingest/vendored/sha256sum < ./ingest/data/ncbi_dataset_sequences.fasta
f85d4bfc6c9cfc00d10567aed87723c4bf39498b5dc94f81c94f4b31c98fb806
real 0m1.401s
user 0m0.539s
sys 0m0.841s
Wait, is the Python one actually faster? What? I mean, I know the hashlib implementations are in C as is much file i/o, but I still would expect Python overhead to be significant here.
Is your coreutils sha256sum
x86_64 or aarch64? file $(type -p sha256sum)
Is your coreutils sha256sum x86_64 or aarch64? file $(type -p sha256sum)
Ah, should have said I was running these in the Nextstrain shell using the Docker runtime.
One thing I noted looking at coreutils sha256sum is it's reading in 32 kiB chunks vs. our 5 MiB chunks.
Similar results when running in macOS terminal:
KX76YWH7NX:mpox joverlee$ time sha256sum ingest/data/ncbi_dataset_sequences.fasta
f85d4bfc6c9cfc00d10567aed87723c4bf39498b5dc94f81c94f4b31c98fb806 ingest/data/ncbi_dataset_sequences.fasta
real 0m5.059s
user 0m4.725s
sys 0m0.209s
KX76YWH7NX:mpox joverlee$ time ./ingest/vendored/sha256sum < ingest/data/ncbi_dataset_sequences.fasta
f85d4bfc6c9cfc00d10567aed87723c4bf39498b5dc94f81c94f4b31c98fb806
real 0m1.439s
user 0m0.632s
sys 0m0.221s
KX76YWH7NX:mpox joverlee$ file $(type -p sha256sum)
/opt/homebrew/bin/sha256sum: Mach-O 64-bit executable arm64
Just making sure this is true for a larger file, testing with a 70G fasta. Python is much faster than GNU coreutils!
GNU coreutils:
KX76YWH7NX:ncov-ingest joverlee$ time sha256sum data/gisaid/sequences.fasta
3f47c5c48118ec5da9955bffe4346ea6245ad6d6b443c544f85c7f4d377a4b1e data/gisaid/sequences.fasta
real 4m16.720s
user 3m59.214s
sys 0m8.749s
Python:
KX76YWH7NX:ncov-ingest joverlee$ time ./vendored/sha256sum < data/gisaid/sequences.fasta
3f47c5c48118ec5da9955bffe4346ea6245ad6d6b443c544f85c7f4d377a4b1e
real 0m45.221s
user 0m30.535s
sys 0m6.089s
Wow. I'd be really curious what the times are if you drop our read size in Python to 32 kiB.
I'd also wonder if aarch64 is coming into play here: is Python taking advantage of it (and coreutils not) in a way it couldn't on x86_64 hardware we're using on AWS Batch?
On my machine, Python is only slightly faster than coreutils. In fact, alternative non-cryptographic/secure hashing algorithms I've tried (a few impls of MurmurHash3, simple crc32, simple md5) all come out very roughly in the same ballpark (within ~20s of each other on a 3GB file), which leads me to thinking I'm bottlenecking on i/o on my machine. And so I'd wonder if we hit an i/o bottleneck in Batch too. We're not using fast disks on AWS...
I'd be really curious what the times are if you drop our read size in Python to 32 kiB.
It's actually slightly faster when I drop the chunk size
KX76YWH7NX:ncov-ingest joverlee$ time ./vendored/sha256sum < data/gisaid/sequences.fasta
chunk size: 32768
3f47c5c48118ec5da9955bffe4346ea6245ad6d6b443c544f85c7f4d377a4b1e
real 0m41.200s
user 0m30.406s
sys 0m8.086s
Prompted by https://github.com/nextstrain/ncov-ingest/issues/446
Some ideas of speeding up
upload-to-s3
proposed in related Slack thread