Downsampling only on AA step or PrepareAA step

virajbdeshpande / AmpliconArchitect

AmpliconArchitect (AA) is a tool to identify one or more connected genomic regions which have simultaneous copy number amplification and elucidates the architecture of the amplicon. In the current version, AA takes as input next generation sequencing reads (paired-end Illumina reads) mapped to the hg19/GRCh37 reference sequence and one or more regions of interest. Please "watch" this repository for improvements in runtime, accuracy and annotations for GRCh38 human reference genome coming up soon.

Other

135 stars 43 forks source link

Downsampling only on AA step or PrepareAA step #105

Open ysbioinfo opened 3 years ago

ysbioinfo commented 3 years ago

Hi, I have multiple 30X WGS data and want to try AA on it. As mentioned in other issues, 30X is too large for AA and downsampling is recommended. But I'm not sure which one is the right way for downsampling:

Use other tools (like picard) to generate a 10X BAM file, then run PrepareAA and AA sequentially on this BAM file.
Still run PrepareAA on the original 30X BAM file, and then run PrepareAA -downsample 1 on this 30X BAM file and the CNV seed given by PrepareAA. Could you please give me some kind advice?

Thank you!

Yang

jluebeck commented 3 years ago

Hi,

Downsampling is performed by AA on the fly, and thus you do not need to worry about running any custom downsampling/creating a new BAM file on your own. Setting --downsample 10 will select a copy number of 10 to downsample to while searching the relevant regions.

I hope this helps! Jens

ysbioinfo commented 3 years ago

Hi Jens, Many thanks for your reply. If I understand correctly, I can run PrepareAA on my initial BAM file and then downsampling at AA step. Am I right? Another question is, as AA does not support multi-thread now, I want to allocate the resource (CPU, memory) on our HPC more efficiently. So I'm wondering how much memory does AA need if running on a 10X WGS BAM file?

Thanks a lot!

Yang

jluebeck commented 3 years ago

Hi Yang,

That sounds very reasonable regarding the downsampling strategy you proposed.

Are you using PrepareAA + CNVKit to generate the CNV seeds (highly recommended)? If so, to optimize your compute resource usage, you may consider running PrepareAA with multiple threads set (for the CNV calling stage), but do not set --run_AA, so that it generates all required files needed by AA, but does not then invoke AA. If original bam is 30X, I suggest maybe setting 15 Gb RAM allocation for this step.

Then, you can set a single threaded job which calls AA (not PAA), and specifies your bam and the *_AA_CNV_SEEDS.bed file generated by PAA, as well as whatever dowsampling level you would like to use while searching the bam. Leaving the dowsampling argument blank will default to 10X coverage. I would suggest you provide 8 Gb (maybe 10 Gb) RAM for AA. The more important number is the walltime allocation, which you may want to set very high. Most AA jobs will finish within 12 hours, however there are certain extremely difficult cases that can take several days to finish.

You can always run the CNV calling stage with 1 thread specified per job if you want to simplify all this, and which would allow you to then just set --run_AA at the end of each task and you do not need to worry about running a separate standalone AA job. Just keep in mind that the CNVKit runtime will take longer this way, and you could be looking at ~3hrs of to run CNVKit in serial for the 30x bam (rough estimate).

Jens

ysbioinfo commented 3 years ago

Thanks for your recommendation. It's very helpful for me! I've tried downsampling the BAM file to 10X but the run time is still so long (~2 days). I noticed that the main reason for such a long time is that I got so many CNV seeds (1148 lines in the CNV_SEEDS.bed file). I think this is strange according to your guideline in PrepareAA. I use CNVkit for PrepareAA and this is my command: python /home/yangshi/software/PrepareAA/PrepareAA.py -s A2780_PTX -t 24 --cnvkit_dir /home/yangshi/software/anaconda3/envs/cnvkit/bin/cnvkit.py --sorted_bam /shared/yangshi/ABCB1_WGS/bam_wgs_rename/A2780_PTX.realigned.bam

All parameters are set to default, so I'm wondering which step calls such strange result.

ysbioinfo commented 3 years ago

Oh, I forget one important thing. My data is WGS data of a cancer cell line. Considering that the "tumor purity" of cell line is very high (≈1), could it affect (or amplify) the log ratio estimated by CNVkit and thus generate so many CNV seeds?

And I think this cell line should be more "abnormal" than common clinical tumor sample (e.g. those from PCAWG). May be its genome is heavily re-arranged...... I don't know if this could also be a reasonable explanation.

Looking forward to your professional advice! Thanks!

Yang

jluebeck commented 3 years ago

Hi Yang,

That is strange that you got so many CNV seeds. What is the range of CNs estimated for these seeds? If you would like please feel free to email me the list of seed regions and I can provide some feedback about the possible issues.

I will note that if sequencing coverage is extremely uneven for some reason, then this will cause many issues with accurate CNV identification, and actually AA cannot be used. The high tumor purity should not affect anything, AA was originally developed with cell line analysis in mind.

It is possible that extreme rearrangements can generate a very large number of seeds, however we have analyzed many rearranged genomes, and having such an extreme number of CNV seeds is highly unusual.

If you suspect there is some issue with your installation of the tools, you may consider trying a sample with a simple ecDNA structure, and which should produce a very small number of seeds: https://www.ncbi.nlm.nih.gov/sra/SRX5055022[accn]

Thanks, Jens

ysbioinfo commented 3 years ago

Hi Jens, Thanks for your reply! I'll first check my pipeline on the sample you recommended and get back to you ASAP.

Yang

ysbioinfo commented 3 years ago

Hi Jens, I sent an email to you. Could you please take a look at my output files and see if there is some hint for the strange result?

Thank you very much! Yang

jluebeck commented 3 years ago

Hi Yang,

I don't see any email from you in my inbox for some reason. Can you please try jluebeck[a t] eng.ucsd.edu.

Thank you! Jens

ysbioinfo commented 3 years ago

cellline.tar.gz test.tar.gz Hi jens, I sent a email to the new mailbox. In case the *tar.gz files are blocked by your server, I also upload the result files here. cellline.tar.gz is the result of my cellline (too many cnv seeds), and the test.tar.gz is the result of the SRA file you recommended (it seems that my pipeline is OK). Thanks!

Yang

jluebeck commented 3 years ago

Hi Yang,

It appears Mosek was incorrectly configured. The license file was not found.

Traceback (most recent call last): File "/home/yangshi/software/AmpliconArchitect/src/AmpliconArchitect.py", line 286, in <module> bamFileb2b.interval_filter_vertices(ilist, amplicon_name=amplicon_name, runmode=args.runmode) File "/home/yangshi/software/AmpliconArchitect/src/bam_to_breakpoint.py", line 2160, in interval_filter_vertices task.optimize() File "/home/yangshi/software/anaconda3/envs/amplicon/lib/python2.7/site-packages/mosek/__init__.py", line 145, in accept return fun(*[ t(a) for (t,a) in zip(argtlst,args) ]) File "/home/yangshi/software/anaconda3/envs/amplicon/lib/python2.7/site-packages/mosek/__init__.py", line 37, in sync return f(self,*args,**kwds) File "/home/yangshi/software/anaconda3/envs/amplicon/lib/python2.7/site-packages/mosek/__init__.py", line 8015, in optimize raise Error(rescode(res),msg) mosek.Error: (1008) License cannot be located. The default search path is ':/home/yangshi/mosek/8/licenses:'.

Please see the instructions for installing Mosek 8, and obtaining the Mosek license file in the AA README.

Thanks, Jens

ysbioinfo commented 3 years ago

Hi Jens, Sorry I forgot to change the PATH to Mosek 8. These days I tried another 20 WGS samples (FFPE) from a clinical study, and AA works well, 0-10 amplicons are found in each sample; the genes in these amplicons are also identified as hyper-amplified by another panel-seq assay. So I think there should be some issue in my cell line data and it's not worth taking up more of your time. Thanks a lot for all your kind help these days!

Best Yang