Speed up filtering by not processing entire file

trvrb commented 2 years ago

Context

I've been running builds locally to get context surrounding 21K viruses and I'd like to again flag augur filter subsampling as a major performance bottleneck. Here, I'm starting from already aligned and filtered data: https://github.com/nextstrain/ncov/blob/master/nextstrain_profiles/nextstrain-gisaid/builds.yaml#L23. I just wanted a context of ~200 background viruses from across the pandemic to include in a 21K focal build. I thinned down max_sequences for africa_early, etc... Doing this results in 12 calls that look like:

augur filter 
  --sequences nextstrain-ncov-private/filtered.fasta.xz 
  --metadata results/sanitized_metadata_gisaid.tsv.xz 
  --sequence-index results/combined_sequence_index.tsv.xz 
  --include defaults/include.txt 
  --exclude defaults/exclude.txt 
  --max-date 2021-07-24 
  --exclude-where 'region!=Asia' 
  --group-by country year month 
  --subsample-max-sequences 10 
  --probabilistic-sampling 
  --output results/global/sample-asia_early.fasta 
  --output-strains results/global/sample-asia_early.txt

Each of these calls takes 25 minutes on my laptop. It's not so bad on the cluster where they can be parallelized, but laptop is rough with ~5 hour runtime for subsampling (and many users will be in this situation).

Feels especially crazy given that I just want a random 10 (!) entries from this table that fall into appropriate buckets.

Description

Would there be a way to read lines in random order trying to fill subsampling buckets and just stop once 10 entries have been used? We shouldn't have to find every entry with region=Asia and date<2021-07-24 to collect a random 10 distributed across "country year month" that fit the bill. You should be able to process just a small fraction of the file rather than needing to process the entire file.

cc @huddlej @corneliusroemer

tsibley commented 2 years ago

@trvrb Separately from the issue of speeding up the underlying augur filter, why can't you run these calls in parallel on your laptop too?

huddlej commented 2 years ago

The slowest part of augur filter is reading/writing sequence data. If you have multiple subsampling calls, the faster way to get the same output is to only output strain names for each subsampling call and then make a final call to augur filter to extract sequences for the lists of strains you want. One example of this is in the data prep guide for “select contextual data for your region of interest”. Each metadata-only (or metadata + sequence index) call should only take a couple of minutes. We can rewrite the ncov workflow to optionally extract sequences for subsampled data only when priorities need to be calculated. Faster still would be to use metadata-based priorities where we calculate Jaccard distances between strains using pre-calculated mutations (from Nextclade or GISAID). I have implemented a prototype of this in ncov as "jaccard-priority". With a little more testing, we could get to metadata-only subsampling soon.

We could speed up the sequence extraction by using something other than (Bio)Python like samtools faidx or UCSC’s faSomeRecords.

There are some times when we could short-circuit the loop through the metadata or the sequences, but this approach won't currently work when we have force-included strains that have to be found in metadata and sequences.

victorlin commented 2 years ago

The slowest part of augur filter is reading/writing sequence data.

After doing a bit of testing using a dataset with 160k sequences I found on S3, I am seeing some evidence that metadata IO is slower than sequence IO.

This metadata-only call takes ~45s:

time augur filter \
  --metadata data/metadata.tsv \
  --group-by country year month \
  --subsample-max-sequences 10 \
  --output-strains filter_strains.txt

Sampling at 0.018930879516601565 per group.
160278 strains were dropped during filtering
    1979 were dropped during grouping due to ambiguous month information
    158299 of these were dropped because of subsampling criteria
11 strains passed all filters
augur filter --metadata data/metadata.tsv --group-by country year month  10    44.56s user 0.43s system 99% cpu 45.000 total

While the comparable metadata + sequence call takes ~52s:

time augur filter \
  --sequences data/sequences.fasta \
  --sequence-index data/sequence_index.tsv \
  --metadata data/metadata.tsv \
  --group-by country year month \
  --subsample-max-sequences 10 \
  --output-sequences data/filter_sequences.fasta

160293 strains were dropped during filtering
    16 had no metadata
    16 had no sequence data
    1979 were dropped during grouping due to ambiguous month information
    158282 of these were dropped because of subsampling criteria
12 strains passed all filters
augur filter --sequences data/sequences.fasta --sequence-index  --metadata     51.15s user 1.20s system 99% cpu 52.485 total

This means sequence IO only accounts for ~13% (7/52) of runtime.

I'm going to dig into where this slowness is coming from and see if it's possible to speed up.

Let me know if there's a better dataset I can test with.

tsibley commented 2 years ago

@victorlin Not sure that comparison is telling the whole story because it's only outputting ~10 sequences (due to --subsample-max-sequences) instead of a more realistic number. Look at it this way, for example: the second command took an extra 7 seconds to output just 12 sequences. If that's steady, then as you ask for more sequences to be output, the sequence IO will account for a greater and greater portion of the total runtime.

victorlin commented 2 years ago

@tsibley good point, it can be broken down to:

metadata read
metadata write
sequence read
sequence write

My example above does a poor job of testing (2) and (4) since output is small. But it highlights the differences between (1) and (3).

For @trvrb's example which also has small output, I'm positing that (1) is the bottleneck instead of (3). There are probably improvements to be made across the board though. @huddlej's suggestion of replacing sequence extraction with samtools/faSomeRecords focuses on (3) but I don't think that's the issue here.

huddlej commented 2 years ago

Thank you for digging into this, @victorlin! You and @tsibley both make good points. Benchmaking this command is tricky since apparently minor changes to the input arguments or data formats can have a large effect on runtime.

I just reviewed my notes from testing augur filter back in September with a large ncov dataset and noticed that my conclusion about metadata vs. sequence time was based on filter commands that don't do any grouping. The group-by logic requires a second pass through the metadata, so it takes longer. I'll re-run my analysis from September and post the corresponding code here, so you can verify my results.

huddlej commented 2 years ago

@victorlin Here are my notes from September with updated runtimes for the latest data as of today.

Setup environment and data

# Create an environment with the tools we need.
mamba create -n ncov -c conda-forge -c bioconda \
  augur==13.0.2 \
  samtools \
  tabix \
  awscli

# Activate the environment.
conda activate ncov

# Download complete set of open aligned, filtered sequences.
curl -O https://data.nextstrain.org/files/ncov/open/filtered.fasta.xz

# Download complete open metadata.
curl -O https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz

# Index sequences.
# This took 34 minutes.
augur index \
  --sequences filtered.fasta.xz \
  --output sequence_index.tsv

# Compress index (forgot to do this above).
# This took 7 seconds.
gzip sequence_index.tsv

Compare subsampling performance for augur filter

# Select all metadata records matching a specific query.
# This took 2 minutes.
augur filter \
  --metadata metadata.tsv.gz \
  --query "division == 'Nebraska'" \
  --output-metadata nebraska_metadata.tsv \
  --output-strains nebraska_strains.txt

2490241 strains were dropped during filtering
    2490241 of these were filtered out by the query: "division == 'Nebraska'"
3530 strains passed all filters

# Select metadata and sequences matching a specific query.
# This took ~16 minutes.
augur filter \
  --metadata metadata.tsv.gz \
  --sequences filtered.fasta.xz \
  --sequence-index sequence_index.tsv.gz \
  --query "division == 'Nebraska'" \
  --output-metadata nebraska_metadata.tsv \
  --output-strains nebraska_strains.txt \
  --output-sequences nebraska_sequences.fasta

2490317 strains were dropped during filtering
    227031 had no sequence data
    2263286 of these were filtered out by the query: "division == 'Nebraska'"
3454 strains passed all filters

# Select 10 random metadata records matching a specific query.
# This took 2 minutes.
augur filter \
  --metadata metadata.tsv.gz \
  --query "division == 'Nebraska'" \
  --subsample-max-sequences 10 \
  --output-metadata nebraska_random_metadata.tsv \
  --output-strains nebraska_random_strains.txt

2493761 strains were dropped during filtering
    2490241 of these were filtered out by the query: "division == 'Nebraska'"
    3520 of these were dropped because of subsampling criteria
10 strains passed all filters

# Select 10 random metadata and sequences for a query.
# This took 15 minutes.
augur filter \
  --metadata metadata.tsv.gz \
  --sequences filtered.fasta.xz \
  --sequence-index sequence_index.tsv.gz \
  --query "division == 'Nebraska'" \
  --subsample-max-sequences 10 \
  --output-metadata nebraska_random_metadata.tsv \
  --output-strains nebraska_random_strains.txt \
  --output-sequences nebraska_random_sequences.fasta

2493761 strains were dropped during filtering
    227031 had no sequence data
    2263286 of these were filtered out by the query: "division == 'Nebraska'"
    3444 of these were dropped because of subsampling criteria
10 strains passed all filters

# Select specific sequences with prefiltered metadata
# and strain list.
# This took 16 minutes.
augur filter \
  --metadata nebraska_metadata.tsv \
  --sequences filtered.fasta.xz \
  --exclude-all \
  --include nebraska_strains.txt \
  --output-sequences nebraska_sequences.fasta

2263286 strains were dropped during filtering
    2263286 had no metadata
    3454 of these were dropped by `--exclude-all`
    3454 strains were added back because they were in nebraska_strains.txt
3454 strains passed all filters

Edit: I forgot to note that we've discussed changing back from using engine="python" as the pandas parser to engine="c". When I change to the C parser, the initial metadata-only filters (both one- and two-pass) run in 1 minute instead of 2.

tsibley commented 2 years ago

@victorlin Good way to break it down.

For @trvrb's example which also has small output, I'm positing that (1) is the bottleneck instead of (3). There are probably improvements to be made across the board though. @huddlej's suggestion of replacing sequence extraction with samtools/faSomeRecords focuses on (3) but I don't think that's the issue here.

The reasoning for (1) vs (3) dominating the runtime isn't clear to me. Could you explain?

And ultimately, the best thing to do here is actually profile the code.

trvrb commented 2 years ago

Thanks so much for digging in here @victorlin, @huddlej and @tsibley. This doesn't need to be the only use case, but you can quickly replicate the use case that I had via the filter-test branch I just pushed up here: https://github.com/nextstrain/ncov/tree/filter-test

From this branch can run locally as:

snakemake -j 1 -p --profile nextstrain_profiles/nextstrain-gisaid auspice/ncov_gisaid_global.json

It will take a few minutes to "sanitize" sequences but you'll be to the 12 augur filter via subsample rule pretty rapidly.

victorlin commented 2 years ago

The reasoning for (1) vs (3) dominating the runtime isn't clear to me. Could you explain?

@tsibley The metadata-only call took 45s while metadata+sequence call took 52s. So the additional sequence IO there had little effect on run time.

However, @huddlej's output says otherwise (2m metadata vs. 15m metadata+sequences), probably since it's a different dataset and not using --group-by. I'm running it locally to see if I can reproduce similar numbers.

And ultimately, the best thing to do here is actually profile the code.

Agreed. A few years back I used cProfile for this. If anyone has other suggestions I'd love to see what's being used for profiling nowadays.

EDIT: I'm having fun with cProfile + SnakeViz 🙂

trvrb commented 2 years ago

I just ran the identical command as in the original issue, but dropping all sequence references, so:

% 25 minutes on my laptop
time augur filter            
 --metadata results/sanitized_metadata_gisaid.tsv.xz              
 --include defaults/include.txt            
 --exclude defaults/exclude.txt            
 --min-date 2021-07-28                         
 --exclude-where 'region!=Europe'                                                                
 --group-by country year month                         
 --subsample-max-sequences 10            
 --probabilistic-sampling        
 --output-strains results/global/sample-europe_late.txt

And again confirming the timing of the original command

% 40 minutes on my laptop
time augur filter            
 --sequences nextstrain-ncov-private/filtered.fasta.xz            
 --metadata results/sanitized_metadata_gisaid.tsv.xz            
 --sequence-index results/combined_sequence_index.tsv.xz            
 --include defaults/include.txt            
 --exclude defaults/exclude.txt            
 --min-date 2021-07-28                         
 --exclude-where 'region!=Europe'                                                                
 --group-by country year month                         
 --subsample-max-sequences 10            
 --probabilistic-sampling            
 --output results/global/sample-europe_late.fasta            
 --output-strains results/global/sample-europe_late.txt

So 25 minutes with just metadata and 40 minutes with sequences. I would assume that with --group-by or other logic is causing some poor scaling behavior (in addition to sequence IO causing overhead).

victorlin commented 2 years ago

@trvrb the current implementation of --group-by is computationally expensive, notably augur.filter.get_groups_for_subsampling which is looping through every valid sequence in the metadata file (post-filter) and determining the groups:

https://github.com/nextstrain/augur/blob/d553f5ea01bbf99637f6faa30f26ee9cdeee96c6/augur/filter.py#L894-L934

Mainly, line 897 is inefficient usage of a pandas DataFrame (confirmed with profiler). I'm looking to see if we can leverage builtin pandas utilities (e.g. pandas.DataFrame.groupby) to achieve what's being done here.

huddlej commented 2 years ago

Nice find, @victorlin! I'd love to see how you used SnakeViz with cProfile to track this down. I'm also curious how much we can reduce runtime from 25 minutes by switching to Pandas-based logic.

victorlin commented 2 years ago

This is the necessary change for profiling: https://github.com/nextstrain/augur/commit/0f357be2ca83a737aba0aea518a85639894f196e

I ran this command for 160k sequences and uploaded the profile result file to gist:

# 83.76s user 0.52s system 99% cpu 1:24.29 total
time augur filter \
  --metadata data/160k/metadata.tsv \
  --group-by country year month \
  --subsample-max-sequences 10 \
  --output-strains filter_strains.txt

If you want to look at profile results for yourself:

Download 160k-sequences-current.prof.
Go to https://nejc.saje.info/pstats-viewer.html. (SnakeViz web server hosted by a nice person on the internet)
Select Python 3.
Open downloaded file.

You will see something like this. Most of it is pandas internal methods, but it's clear that pandas __getitem__ in augur.filter.get_groups_for_subsampling and somewhere directly in augur.filter.run accounts for most of the run time. Unfortunately line 897 isn't in the call stack – I did some line removals locally to determine that one.

snakeviz 160k-sequences-current.prof

victorlin commented 2 years ago

Note: I also experimented with a couple other ideas to improve metadata read performance:

Use pd.read_csv(sep = "\t", engine="c") as @huddlej mentioned above. This improved things slightly: 45s -> 42s in my example.
Read a subset of the columns with pd.read_csv(usecols=["strain", "date", "country"]). This improved things significantly: 45s -> 27s in my example.

Both are separate from the refactor to use pandas functions, so I'll look into them later.

victorlin commented 2 years ago

First performance improvement PR is ready for review. See PR comment for detailed profiling results, but in summary it reduces the run time of augur.filter.get_groups_for_subsampling from 55s down to 5s.

victorlin commented 2 years ago

I just ran @trvrb's example with/without PR and profiling. Everything took much less than 25m each maybe due to environment/machine differences, but still surprising.

Without PR changes:
- Non-profile run time: 8m 40s
- Profile run time: 13m 20s
  - get_groups_for_subsampling: 6m 19s
With PR changes:
- Non-profile run time: 5m 48s
- Profile run time: 7m 43s
  - get_groups_for_subsampling: 41s

huddlej commented 2 years ago

Thanks for doing this before/after analysis, @victorlin. I'm trying to recreate the same results locally (the slower of the two is running right now).

In addition to the slow inner loop issue you've sped up in #794, there is another potentially easy optimization: filter currently calculates groups for strains twice; once in the first pass and again in the second pass. We should think about an efficient way to cache the results from the first pass to use in the second. For example, running Trevor's example command with Augur 13.0.3 now, it took 5+ minutes to calculate groups the first time and now its doing it again.

victorlin commented 2 years ago

We should think about an efficient way to cache the results from the first pass to use in the second

I'm going to tackle this next, since it seems like another inefficient usage of pandas DataFrame (line 1522):

https://github.com/nextstrain/augur/blob/af23a7bcb7542d5ae61cc76841661dddecd43d39/augur/filter.py#L1519-L1524

And this is the only reference for the second pass of get_groups_for_subsampling.

tsibley commented 2 years ago

@victorlin Rad! I've also used py-spy in the past for some Python profiling.

trvrb commented 2 years ago

Amazing progress here @victorlin. Really appreciate the speed ups.

huddlej commented 2 years ago

I got similar run times to Trevor with the same command:

Augur 13.0.3: 24 minutes
PR 794: 17 minutes
PR 794 and pandas C parser: 10 minutes

victorlin commented 2 years ago

Updated numbers on my machine for @trvrb's example of 3m sequences with the new PR.

Original:
- Non-profile run time: 8m 40s
- Profile run time: 13m 20s
With #794:
- Non-profile run time: 5m 48s
- Profile run time: 7m 43s
With 794 and #809:
- Non-profile run time: 4m 30s
- Profile run time: 4m 58s

With these PRs plus the two smaller improvements mentioned in https://github.com/nextstrain/augur/issues/789#issuecomment-985129622, I think this covers all the low-hanging fruit for metadata read performance improvement without any major changes to how we process the data.

Based on profiling results (prof file here), the bottleneck is now reading the metadata file in its entirety. I'm not sure if it's even possible to do --probablistic-subsampling without processing the entire file, per @trvrb's original suggestion:

We shouldn't have to find every entry with region=Asia and date<2021-07-24 to collect a random 10 distributed across "country year month" that fit the bill. You should be able to process just a small fraction of the file rather than needing to process the entire file.

The current code must go through the file entirely to determine the number of distinct observed groups (e.g. (Asia, 2020-01), (Asia, 2020-03), (Asia, 2021-07), ...) then use Poisson sampling to determine the ~10 sequences to output. Is there a way to guarantee a randomly distributed sample by reading lines in random order?

tsibley commented 2 years ago

Is there a way to guarantee a randomly distributed sample by reading lines in random order?

Don't know, but its worth noting that "reading lines in a random order" is not an entirely straightforward process either (and not that commonly done) and comes with its own perf cost of seek time if the disk isn't an SSD (less and less common, but still out there, even in the cloud).

victorlin commented 2 years ago

A few more ideas, borrowed from this blog post:

Specify dtypes: low hanging fruit similar to other parameter changes to pd.read_csv().
Process chunks in parallel using multiprocessing: this has potential for substantial speedup, though need to experiment if we can compute the same data structures using multiprocessing shared state.

victorlin commented 2 years ago

"reading lines in a random order" is not an entirely straightforward process either

Right, this is an inherent limit of processing data from a text file.

Has a database-oriented model been considered? For example:

Store metadata and sequences into a database (e.g. SQLite) using some new command augur load metadata.tsv sequences.fasta
Compute some indexes.
All augur subcommands connect to and update the database accordingly.

The initial loading+indexing will be computationally expensive, but it's a one-time step which should improve performance of subsequent operations.

victorlin commented 2 years ago

Has a database-oriented model been considered?

At least @corneliusroemer has considered this 😃 https://github.com/corneliusroemer/fasta_zstd_sqlite

Will work with him to prototype this into Augur.

tsibley commented 2 years ago

@victorlin

Has a database-oriented model been considered?

Yes, of course. :-) We've had previous conversations around this on Slack, GitHub, and in person, including at least @huddlej, @joverlee521, @corneliusroemer, and myself (and probably others). Searching for "sqlite" on Slack is probably the best way to find the myriad of threads. We haven't summarized any of those conversations, AFAIK.

So my gentle suggestion before diving into prototyping something into Augur is to catch up on the relevant prior context here and to also sketch out a design doc for how this would work within all the parts of Augur (to identify the hard spots and decision points earlier than later, solicit and incorporate design feedback, etc).

victorlin commented 2 years ago

@tsibley Nice! I'll start with going through all the context on Slack and writing up a separate GitHub issue so we can continue centralized discussion there.

victorlin commented 2 years ago

Closing this issue since we have #1575 to keep track of augur filter speed improvements while processing the entire file. We also have #866 to continue the larger database discussion.

To close out the original idea:

Would there be a way to read lines in random order trying to fill subsampling buckets and just stop once 10 entries have been used?

As mentioned in https://github.com/nextstrain/augur/issues/789#issuecomment-993035159, this isn't trivial. The closest option I know of is to use seek() on a file object to start reading from a specific byte position. However, that isn't practical in our case where each of our lines has a different byte length, so it isn't possible to find the exact byte position of arbitrary lines without actually reading them in first, even if just to get the byte length.

nextstrain / augur

Speed up filtering by not processing entire file #789

Setup environment and data

Compare subsampling performance for augur filter