Open wsdewitt opened 4 years ago
This continues to cause problems, so suggest raising a warning with a link to this issue if a .gz
file is supplied.
I ran into this issue using mutyper variants, good to know there's a workaround! Thanks to Luke for helping me troubleshoot!
Running into this again with a new dataset, and it's wild the difference in performance this makes. Ran the job with the bgzipped fasta for >2 days and only got 400MB through a vcf file, and now with the unzipped fasta am already at 1.5GB after an hour. Maybe consider throwing a warning or error if someone tries to input a compressed ancestral fasta? it's virtually unusable when it's that slow
Accessing later regions of a fasta via a
mutyper.Ancestor
object (child class ofpyfaidx.Fasta
) is not performant, likely stemming from this issue in pyfaidx: mdshw5/pyfaidx#153.This is particularly problematic for the
mutyper targets
subcommand, since it scans through all sites in a fasta record, or a sequence of bed regions.The current workaround is to work with decompressed fasta data. A bgzipped fasta, e.g. named
ancestor.fa.gz
can be decompressed withbgzip -d ancestor.fa.gz
to produce an uncompressed fastaancestor.fa
.