Closed trvrb closed 2 years ago
@trvrb Separately from the issue of speeding up the underlying augur filter
, why can't you run these calls in parallel on your laptop too?
The slowest part of augur filter is reading/writing sequence data. If you have multiple subsampling calls, the faster way to get the same output is to only output strain names for each subsampling call and then make a final call to augur filter to extract sequences for the lists of strains you want. One example of this is in the data prep guide for “select contextual data for your region of interest”. Each metadata-only (or metadata + sequence index) call should only take a couple of minutes. We can rewrite the ncov workflow to optionally extract sequences for subsampled data only when priorities need to be calculated. Faster still would be to use metadata-based priorities where we calculate Jaccard distances between strains using pre-calculated mutations (from Nextclade or GISAID). I have implemented a prototype of this in ncov as "jaccard-priority". With a little more testing, we could get to metadata-only subsampling soon.
We could speed up the sequence extraction by using something other than (Bio)Python like samtools faidx or UCSC’s faSomeRecords.
There are some times when we could short-circuit the loop through the metadata or the sequences, but this approach won't currently work when we have force-included strains that have to be found in metadata and sequences.
The slowest part of augur filter is reading/writing sequence data.
After doing a bit of testing using a dataset with 160k sequences I found on S3, I am seeing some evidence that metadata IO is slower than sequence IO.
This metadata-only call takes ~45s:
time augur filter \
--metadata data/metadata.tsv \
--group-by country year month \
--subsample-max-sequences 10 \
--output-strains filter_strains.txt
Sampling at 0.018930879516601565 per group.
160278 strains were dropped during filtering
1979 were dropped during grouping due to ambiguous month information
158299 of these were dropped because of subsampling criteria
11 strains passed all filters
augur filter --metadata data/metadata.tsv --group-by country year month 10 44.56s user 0.43s system 99% cpu 45.000 total
While the comparable metadata + sequence call takes ~52s:
time augur filter \
--sequences data/sequences.fasta \
--sequence-index data/sequence_index.tsv \
--metadata data/metadata.tsv \
--group-by country year month \
--subsample-max-sequences 10 \
--output-sequences data/filter_sequences.fasta
160293 strains were dropped during filtering
16 had no metadata
16 had no sequence data
1979 were dropped during grouping due to ambiguous month information
158282 of these were dropped because of subsampling criteria
12 strains passed all filters
augur filter --sequences data/sequences.fasta --sequence-index --metadata 51.15s user 1.20s system 99% cpu 52.485 total
This means sequence IO only accounts for ~13% (7/52) of runtime.
I'm going to dig into where this slowness is coming from and see if it's possible to speed up.
Let me know if there's a better dataset I can test with.
@victorlin Not sure that comparison is telling the whole story because it's only outputting ~10 sequences (due to --subsample-max-sequences
) instead of a more realistic number. Look at it this way, for example: the second command took an extra 7 seconds to output just 12 sequences. If that's steady, then as you ask for more sequences to be output, the sequence IO will account for a greater and greater portion of the total runtime.
@tsibley good point, it can be broken down to:
My example above does a poor job of testing (2) and (4) since output is small. But it highlights the differences between (1) and (3).
For @trvrb's example which also has small output, I'm positing that (1) is the bottleneck instead of (3). There are probably improvements to be made across the board though. @huddlej's suggestion of replacing sequence extraction with samtools/faSomeRecords focuses on (3) but I don't think that's the issue here.
Thank you for digging into this, @victorlin! You and @tsibley both make good points. Benchmaking this command is tricky since apparently minor changes to the input arguments or data formats can have a large effect on runtime.
I just reviewed my notes from testing augur filter back in September with a large ncov dataset and noticed that my conclusion about metadata vs. sequence time was based on filter commands that don't do any grouping. The group-by logic requires a second pass through the metadata, so it takes longer. I'll re-run my analysis from September and post the corresponding code here, so you can verify my results.
@victorlin Here are my notes from September with updated runtimes for the latest data as of today.
# Create an environment with the tools we need.
mamba create -n ncov -c conda-forge -c bioconda \
augur==13.0.2 \
samtools \
tabix \
awscli
# Activate the environment.
conda activate ncov
# Download complete set of open aligned, filtered sequences.
curl -O https://data.nextstrain.org/files/ncov/open/filtered.fasta.xz
# Download complete open metadata.
curl -O https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz
# Index sequences.
# This took 34 minutes.
augur index \
--sequences filtered.fasta.xz \
--output sequence_index.tsv
# Compress index (forgot to do this above).
# This took 7 seconds.
gzip sequence_index.tsv
# Select all metadata records matching a specific query.
# This took 2 minutes.
augur filter \
--metadata metadata.tsv.gz \
--query "division == 'Nebraska'" \
--output-metadata nebraska_metadata.tsv \
--output-strains nebraska_strains.txt
2490241 strains were dropped during filtering
2490241 of these were filtered out by the query: "division == 'Nebraska'"
3530 strains passed all filters
# Select metadata and sequences matching a specific query.
# This took ~16 minutes.
augur filter \
--metadata metadata.tsv.gz \
--sequences filtered.fasta.xz \
--sequence-index sequence_index.tsv.gz \
--query "division == 'Nebraska'" \
--output-metadata nebraska_metadata.tsv \
--output-strains nebraska_strains.txt \
--output-sequences nebraska_sequences.fasta
2490317 strains were dropped during filtering
227031 had no sequence data
2263286 of these were filtered out by the query: "division == 'Nebraska'"
3454 strains passed all filters
# Select 10 random metadata records matching a specific query.
# This took 2 minutes.
augur filter \
--metadata metadata.tsv.gz \
--query "division == 'Nebraska'" \
--subsample-max-sequences 10 \
--output-metadata nebraska_random_metadata.tsv \
--output-strains nebraska_random_strains.txt
2493761 strains were dropped during filtering
2490241 of these were filtered out by the query: "division == 'Nebraska'"
3520 of these were dropped because of subsampling criteria
10 strains passed all filters
# Select 10 random metadata and sequences for a query.
# This took 15 minutes.
augur filter \
--metadata metadata.tsv.gz \
--sequences filtered.fasta.xz \
--sequence-index sequence_index.tsv.gz \
--query "division == 'Nebraska'" \
--subsample-max-sequences 10 \
--output-metadata nebraska_random_metadata.tsv \
--output-strains nebraska_random_strains.txt \
--output-sequences nebraska_random_sequences.fasta
2493761 strains were dropped during filtering
227031 had no sequence data
2263286 of these were filtered out by the query: "division == 'Nebraska'"
3444 of these were dropped because of subsampling criteria
10 strains passed all filters
# Select specific sequences with prefiltered metadata
# and strain list.
# This took 16 minutes.
augur filter \
--metadata nebraska_metadata.tsv \
--sequences filtered.fasta.xz \
--exclude-all \
--include nebraska_strains.txt \
--output-sequences nebraska_sequences.fasta
2263286 strains were dropped during filtering
2263286 had no metadata
3454 of these were dropped by `--exclude-all`
3454 strains were added back because they were in nebraska_strains.txt
3454 strains passed all filters
Edit: I forgot to note that we've discussed changing back from using engine="python"
as the pandas parser to engine="c"
. When I change to the C parser, the initial metadata-only filters (both one- and two-pass) run in 1 minute instead of 2.
@victorlin Good way to break it down.
For @trvrb's example which also has small output, I'm positing that (1) is the bottleneck instead of (3). There are probably improvements to be made across the board though. @huddlej's suggestion of replacing sequence extraction with samtools/faSomeRecords focuses on (3) but I don't think that's the issue here.
The reasoning for (1) vs (3) dominating the runtime isn't clear to me. Could you explain?
And ultimately, the best thing to do here is actually profile the code.
Thanks so much for digging in here @victorlin, @huddlej and @tsibley. This doesn't need to be the only use case, but you can quickly replicate the use case that I had via the filter-test
branch I just pushed up here: https://github.com/nextstrain/ncov/tree/filter-test
From this branch can run locally as:
snakemake -j 1 -p --profile nextstrain_profiles/nextstrain-gisaid auspice/ncov_gisaid_global.json
It will take a few minutes to "sanitize" sequences but you'll be to the 12 augur filter
via subsample
rule pretty rapidly.
The reasoning for (1) vs (3) dominating the runtime isn't clear to me. Could you explain?
@tsibley The metadata-only call took 45s while metadata+sequence call took 52s. So the additional sequence IO there had little effect on run time.
However, @huddlej's output says otherwise (2m metadata vs. 15m metadata+sequences), probably since it's a different dataset and not using --group-by
. I'm running it locally to see if I can reproduce similar numbers.
And ultimately, the best thing to do here is actually profile the code.
Agreed. A few years back I used cProfile for this. If anyone has other suggestions I'd love to see what's being used for profiling nowadays.
EDIT: I'm having fun with cProfile + SnakeViz 🙂
I just ran the identical command as in the original issue, but dropping all sequence references, so:
% 25 minutes on my laptop
time augur filter
--metadata results/sanitized_metadata_gisaid.tsv.xz
--include defaults/include.txt
--exclude defaults/exclude.txt
--min-date 2021-07-28
--exclude-where 'region!=Europe'
--group-by country year month
--subsample-max-sequences 10
--probabilistic-sampling
--output-strains results/global/sample-europe_late.txt
And again confirming the timing of the original command
% 40 minutes on my laptop
time augur filter
--sequences nextstrain-ncov-private/filtered.fasta.xz
--metadata results/sanitized_metadata_gisaid.tsv.xz
--sequence-index results/combined_sequence_index.tsv.xz
--include defaults/include.txt
--exclude defaults/exclude.txt
--min-date 2021-07-28
--exclude-where 'region!=Europe'
--group-by country year month
--subsample-max-sequences 10
--probabilistic-sampling
--output results/global/sample-europe_late.fasta
--output-strains results/global/sample-europe_late.txt
So 25 minutes with just metadata and 40 minutes with sequences. I would assume that with --group-by
or other logic is causing some poor scaling behavior (in addition to sequence IO causing overhead).
@trvrb the current implementation of --group-by
is computationally expensive, notably augur.filter.get_groups_for_subsampling
which is looping through every valid sequence in the metadata file (post-filter) and determining the groups:
Mainly, line 897 is inefficient usage of a pandas DataFrame (confirmed with profiler). I'm looking to see if we can leverage builtin pandas utilities (e.g. pandas.DataFrame.groupby
) to achieve what's being done here.
Nice find, @victorlin! I'd love to see how you used SnakeViz with cProfile to track this down. I'm also curious how much we can reduce runtime from 25 minutes by switching to Pandas-based logic.
This is the necessary change for profiling: https://github.com/nextstrain/augur/commit/0f357be2ca83a737aba0aea518a85639894f196e
I ran this command for 160k sequences and uploaded the profile result file to gist:
# 83.76s user 0.52s system 99% cpu 1:24.29 total
time augur filter \
--metadata data/160k/metadata.tsv \
--group-by country year month \
--subsample-max-sequences 10 \
--output-strains filter_strains.txt
If you want to look at profile results for yourself:
You will see something like this. Most of it is pandas internal methods, but it's clear that pandas __getitem__
in augur.filter.get_groups_for_subsampling
and somewhere directly in augur.filter.run
accounts for most of the run time. Unfortunately line 897 isn't in the call stack – I did some line removals locally to determine that one.
Note: I also experimented with a couple other ideas to improve metadata read performance:
pd.read_csv(sep = "\t", engine="c")
as @huddlej mentioned above.
This improved things slightly: 45s -> 42s in my example.pd.read_csv(usecols=["strain", "date", "country"])
.
This improved things significantly: 45s -> 27s in my example.Both are separate from the refactor to use pandas functions, so I'll look into them later.
First performance improvement PR is ready for review. See PR comment for detailed profiling results, but in summary it reduces the run time of augur.filter.get_groups_for_subsampling
from 55s down to 5s.
I just ran @trvrb's example with/without PR and profiling. Everything took much less than 25m each maybe due to environment/machine differences, but still surprising.
get_groups_for_subsampling
: 6m 19sget_groups_for_subsampling
: 41sThanks for doing this before/after analysis, @victorlin. I'm trying to recreate the same results locally (the slower of the two is running right now).
In addition to the slow inner loop issue you've sped up in #794, there is another potentially easy optimization: filter currently calculates groups for strains twice; once in the first pass and again in the second pass. We should think about an efficient way to cache the results from the first pass to use in the second. For example, running Trevor's example command with Augur 13.0.3 now, it took 5+ minutes to calculate groups the first time and now its doing it again.
We should think about an efficient way to cache the results from the first pass to use in the second
I'm going to tackle this next, since it seems like another inefficient usage of pandas DataFrame (line 1522):
And this is the only reference for the second pass of get_groups_for_subsampling
.
@victorlin Rad! I've also used py-spy in the past for some Python profiling.
Amazing progress here @victorlin. Really appreciate the speed ups.
I got similar run times to Trevor with the same command:
Updated numbers on my machine for @trvrb's example of 3m sequences with the new PR.
With these PRs plus the two smaller improvements mentioned in https://github.com/nextstrain/augur/issues/789#issuecomment-985129622, I think this covers all the low-hanging fruit for metadata read performance improvement without any major changes to how we process the data.
Based on profiling results (prof file here), the bottleneck is now reading the metadata file in its entirety. I'm not sure if it's even possible to do --probablistic-subsampling
without processing the entire file, per @trvrb's original suggestion:
We shouldn't have to find every entry with region=Asia and date<2021-07-24 to collect a random 10 distributed across "country year month" that fit the bill. You should be able to process just a small fraction of the file rather than needing to process the entire file.
The current code must go through the file entirely to determine the number of distinct observed groups (e.g. (Asia, 2020-01)
, (Asia, 2020-03)
, (Asia, 2021-07)
, ...) then use Poisson sampling to determine the ~10 sequences to output. Is there a way to guarantee a randomly distributed sample by reading lines in random order?
Is there a way to guarantee a randomly distributed sample by reading lines in random order?
Don't know, but its worth noting that "reading lines in a random order" is not an entirely straightforward process either (and not that commonly done) and comes with its own perf cost of seek time if the disk isn't an SSD (less and less common, but still out there, even in the cloud).
A few more ideas, borrowed from this blog post:
dtypes
: low hanging fruit similar to other parameter changes to pd.read_csv()
.multiprocessing
: this has potential for substantial speedup, though need to experiment if we can compute the same data structures using multiprocessing shared state."reading lines in a random order" is not an entirely straightforward process either
Right, this is an inherent limit of processing data from a text file.
Has a database-oriented model been considered? For example:
augur load metadata.tsv sequences.fasta
The initial loading+indexing will be computationally expensive, but it's a one-time step which should improve performance of subsequent operations.
Has a database-oriented model been considered?
At least @corneliusroemer has considered this 😃 https://github.com/corneliusroemer/fasta_zstd_sqlite
Will work with him to prototype this into Augur.
@victorlin
Has a database-oriented model been considered?
Yes, of course. :-) We've had previous conversations around this on Slack, GitHub, and in person, including at least @huddlej, @joverlee521, @corneliusroemer, and myself (and probably others). Searching for "sqlite" on Slack is probably the best way to find the myriad of threads. We haven't summarized any of those conversations, AFAIK.
So my gentle suggestion before diving into prototyping something into Augur is to catch up on the relevant prior context here and to also sketch out a design doc for how this would work within all the parts of Augur (to identify the hard spots and decision points earlier than later, solicit and incorporate design feedback, etc).
@tsibley Nice! I'll start with going through all the context on Slack and writing up a separate GitHub issue so we can continue centralized discussion there.
Closing this issue since we have #1575 to keep track of augur filter
speed improvements while processing the entire file. We also have #866 to continue the larger database discussion.
To close out the original idea:
Would there be a way to read lines in random order trying to fill subsampling buckets and just stop once 10 entries have been used?
As mentioned in https://github.com/nextstrain/augur/issues/789#issuecomment-993035159, this isn't trivial. The closest option I know of is to use seek()
on a file object to start reading from a specific byte position. However, that isn't practical in our case where each of our lines has a different byte length, so it isn't possible to find the exact byte position of arbitrary lines without actually reading them in first, even if just to get the byte length.
Context
I've been running builds locally to get context surrounding 21K viruses and I'd like to again flag
augur filter
subsampling as a major performance bottleneck. Here, I'm starting from already aligned and filtered data: https://github.com/nextstrain/ncov/blob/master/nextstrain_profiles/nextstrain-gisaid/builds.yaml#L23. I just wanted a context of ~200 background viruses from across the pandemic to include in a 21K focal build. I thinned downmax_sequences
forafrica_early
, etc... Doing this results in 12 calls that look like:Each of these calls takes 25 minutes on my laptop. It's not so bad on the cluster where they can be parallelized, but laptop is rough with ~5 hour runtime for subsampling (and many users will be in this situation).
Feels especially crazy given that I just want a random 10 (!) entries from this table that fall into appropriate buckets.
Description
Would there be a way to read lines in random order trying to fill subsampling buckets and just stop once 10 entries have been used? We shouldn't have to find every entry with region=Asia and date<2021-07-24 to collect a random 10 distributed across "country year month" that fit the bill. You should be able to process just a small fraction of the file rather than needing to process the entire file.
cc @huddlej @corneliusroemer