nextstrain / ingest

Shared internal tooling for pathogen data ingest. Used by our pathogen build repos.
1 stars 0 forks source link

Speed up `upload-to-s3` #41

Open joverlee521 opened 5 months ago

joverlee521 commented 5 months ago

Prompted by https://github.com/nextstrain/ncov-ingest/issues/446

Some ideas of speeding up upload-to-s3 proposed in related Slack thread

  1. Configure threads for compression since that is most likely the bottleneck as we are compressing using a single thread
  2. Update hashing to stop going through Python. Or we should compute the hash of the compressed version.
joverlee521 commented 5 months ago

Update hashing to stop going through Python.

This is currently blocked on sha256sum not being available in the conda runtime.

tsibley commented 5 months ago

Alternatively, the naive sha256sum implementation in Python could exec into the GNU coreutils version, if found, otherwise fall back to the slow Python version.

tsibley commented 5 months ago

Update hashing to stop going through Python.

This is currently blocked on sha256sum not being available in the conda runtime.

sha256sum is now available as of the nextstrain-base 20240612T205814Z Conda package.

Note that we're (I'm) assuming the GNU coreutils sha256sum implementation is faster than our Python one. It likely is! But we don't actually know. Benchmarking might be useful here.

joverlee521 commented 5 months ago

Note that we're (I'm) assuming the GNU coreutils sha256sum implementation is faster than our Python one. It likely is! But we don't actually know. Benchmarking might be useful here.

Simple test with a 1.3G fasta file.

$ time sha256sum ./ingest/data/ncbi_dataset_sequences.fasta
f85d4bfc6c9cfc00d10567aed87723c4bf39498b5dc94f81c94f4b31c98fb806  ./ingest/data/ncbi_dataset_sequences.fasta

real    0m4.452s
user    0m3.952s
sys 0m0.483s
$ time ./ingest/vendored/sha256sum ./ingest/data/ncbi_dataset_sequences.fasta

Still running after 10mins...I'll post update with final time after it finishes...

joverlee521 commented 5 months ago

Still running after 10mins...I'll post update with final time after it finishes...

🤦‍♀️ Nope, I was just running the script wrong

$ time ./ingest/vendored/sha256sum < ./ingest/data/ncbi_dataset_sequences.fasta
f85d4bfc6c9cfc00d10567aed87723c4bf39498b5dc94f81c94f4b31c98fb806

real    0m1.401s
user    0m0.539s
sys 0m0.841s
tsibley commented 5 months ago

Wait, is the Python one actually faster? What? I mean, I know the hashlib implementations are in C as is much file i/o, but I still would expect Python overhead to be significant here.

tsibley commented 5 months ago

Is your coreutils sha256sum x86_64 or aarch64? file $(type -p sha256sum)

joverlee521 commented 5 months ago

Is your coreutils sha256sum x86_64 or aarch64? file $(type -p sha256sum)

Ah, should have said I was running these in the Nextstrain shell using the Docker runtime.

tsibley commented 5 months ago

One thing I noted looking at coreutils sha256sum is it's reading in 32 kiB chunks vs. our 5 MiB chunks.

joverlee521 commented 5 months ago

Similar results when running in macOS terminal:

KX76YWH7NX:mpox joverlee$ time sha256sum ingest/data/ncbi_dataset_sequences.fasta
f85d4bfc6c9cfc00d10567aed87723c4bf39498b5dc94f81c94f4b31c98fb806  ingest/data/ncbi_dataset_sequences.fasta

real    0m5.059s
user    0m4.725s
sys 0m0.209s
KX76YWH7NX:mpox joverlee$ time ./ingest/vendored/sha256sum < ingest/data/ncbi_dataset_sequences.fasta
f85d4bfc6c9cfc00d10567aed87723c4bf39498b5dc94f81c94f4b31c98fb806

real    0m1.439s
user    0m0.632s
sys 0m0.221s
KX76YWH7NX:mpox joverlee$ file $(type -p sha256sum)
/opt/homebrew/bin/sha256sum: Mach-O 64-bit executable arm64
joverlee521 commented 5 months ago

Just making sure this is true for a larger file, testing with a 70G fasta. Python is much faster than GNU coreutils!

GNU coreutils:

KX76YWH7NX:ncov-ingest joverlee$ time sha256sum data/gisaid/sequences.fasta
3f47c5c48118ec5da9955bffe4346ea6245ad6d6b443c544f85c7f4d377a4b1e  data/gisaid/sequences.fasta

real    4m16.720s
user    3m59.214s
sys 0m8.749s

Python:

KX76YWH7NX:ncov-ingest joverlee$ time ./vendored/sha256sum < data/gisaid/sequences.fasta
3f47c5c48118ec5da9955bffe4346ea6245ad6d6b443c544f85c7f4d377a4b1e

real    0m45.221s
user    0m30.535s
sys 0m6.089s
tsibley commented 5 months ago

Wow. I'd be really curious what the times are if you drop our read size in Python to 32 kiB.

I'd also wonder if aarch64 is coming into play here: is Python taking advantage of it (and coreutils not) in a way it couldn't on x86_64 hardware we're using on AWS Batch?

On my machine, Python is only slightly faster than coreutils. In fact, alternative non-cryptographic/secure hashing algorithms I've tried (a few impls of MurmurHash3, simple crc32, simple md5) all come out very roughly in the same ballpark (within ~20s of each other on a 3GB file), which leads me to thinking I'm bottlenecking on i/o on my machine. And so I'd wonder if we hit an i/o bottleneck in Batch too. We're not using fast disks on AWS...

joverlee521 commented 5 months ago

I'd be really curious what the times are if you drop our read size in Python to 32 kiB.

It's actually slightly faster when I drop the chunk size

KX76YWH7NX:ncov-ingest joverlee$ time ./vendored/sha256sum < data/gisaid/sequences.fasta
chunk size: 32768
3f47c5c48118ec5da9955bffe4346ea6245ad6d6b443c544f85c7f4d377a4b1e

real    0m41.200s
user    0m30.406s
sys 0m8.086s