performance issues with accessing bgzipped ancestral fastas

harrispopgen / mutyper

Ancestral k-mer mutation types for SNP data

https://harrispopgen.github.io/mutyper/

MIT License

7 stars 3 forks source link

performance issues with accessing bgzipped ancestral fastas #11

Open wsdewitt opened 4 years ago

wsdewitt commented 4 years ago

Accessing later regions of a fasta via a mutyper.Ancestor object (child class of pyfaidx.Fasta) is not performant, likely stemming from this issue in pyfaidx: mdshw5/pyfaidx#153.

This is particularly problematic for the mutyper targets subcommand, since it scans through all sites in a fasta record, or a sequence of bed regions.

The current workaround is to work with decompressed fasta data. A bgzipped fasta, e.g. named ancestor.fa.gz can be decompressed with bgzip -d ancestor.fa.gz to produce an uncompressed fasta ancestor.fa.

wsdewitt commented 3 years ago

This continues to cause problems, so suggest raising a warning with a link to this issue if a .gz file is supplied.

ab08028 commented 1 year ago

I ran into this issue using mutyper variants, good to know there's a workaround! Thanks to Luke for helping me troubleshoot!

ab08028 commented 1 year ago

Running into this again with a new dataset, and it's wild the difference in performance this makes. Ran the job with the bgzipped fasta for >2 days and only got 400MB through a vcf file, and now with the unzipped fasta am already at 1.5GB after an hour. Maybe consider throwing a warning or error if someone tries to input a compressed ancestral fasta? it's virtually unusable when it's that slow