virajbdeshpande / AmpliconArchitect

AmpliconArchitect (AA) is a tool to identify one or more connected genomic regions which have simultaneous copy number amplification and elucidates the architecture of the amplicon. In the current version, AA takes as input next generation sequencing reads (paired-end Illumina reads) mapped to the hg19/GRCh37 reference sequence and one or more regions of interest. Please "watch" this repository for improvements in runtime, accuracy and annotations for GRCh38 human reference genome coming up soon.
Other
131 stars 41 forks source link

Would pre-filtered bam file accelerate the runtime #100

Closed shwong-tw closed 3 years ago

shwong-tw commented 3 years ago

Dear developer,

I've been thinking about how to further accelerate the jobs and came up with two questions.

The first one is about the input bam file. For WGS with higher coverage, it was generally suggested to downsample the coverage to e.g. 10x. I was wondering if I instead provide a filtered (sorted) bam file, where only discordant reads and split reads were kept, would it be helpful to accelerate the runtime? Or would it lead to some bias in the result?

And the second one is about the number of intervals in the input bed file. I can imagine that the more intervals we put in the input bed file, the more runtime will be required. Also we probably should not split the intervals into different runs, because we were not sure which of the intervals belong to the same amplicon. However from your experience would it be possible to roughly estimate the order of runtime increase if we put e.g. 1 interval vs 10 intervals?

Thank you very much :)

jluebeck commented 3 years ago

Hi,

Thank you for your suggestions.

Regarding downsampling: A default amount of downsampling is down by AA while reading the BAM. Thus this step is already optimized. Removing reads which are not discordant would ruin AA's internal computation of copy number and would likely result in a crash or uninterpretable results.

Regarding the input intervals being separated. This is generally inadvisable as if the two intervals are connected by discordant edges, then you will have AA analyzing the same amplicon twice, which defeats the purpose of the speedup. AA has not been rigorously tested or performance benchmarked when intervals are segregated into different runs.

The best performance speedups in my experience come from storing the BAM file on a hard drive with fast IO (SSD if possible), and more importantly, careful selection of seed regions. We suggest default cutoffs of CN > 4.5 or 5 and larger than 50kbp for interval selection, as many unfiltered low complexity regions can be added otherwise. Secondly, we highly recommend our best practices which include PrepareAA and CNVKit (wrapped inside PrepareAA).

Best regards, Jens

shwong-tw commented 3 years ago

Hi Jens,

Many thanks for the prompt reply and the explanations :)

Yes I was aware of the PrepareAA page and --downsample parameters, and got my pilot run result (looks successful) on one interval in one sample today.

Just learnt that AmpliconClassifier is probably the next step, also wanted to use modular integration of prior result... (I'm glad that you offered this flexibility :))

Still some functions to try out and will get back to you in case of any further questions.

Thank you very much and have a good day!

Cheers, Siao-Han