running time and usage of AmpliconArchitect

lwlive commented 2 years ago

Hi,
I have run the AmpliconArchitect with following command "python2 AmpliconArchitect.py --ref None --downsample 10 --bed WGS000100002tumor_AACNVbed.bed --bam WGS000100002tumor.dup.realign.bam --runmode FULL --out WGS000100002tumor.Amplicon", but I have encountered some problems, could you please give me a hand. (1)although i have used "downsample" to reduce the data, but it still run one week and did not finish yet. is there some thing wrong with my commands? (2) I am wondering how to build a AA_DATA_REPO data set because I have to add a HPV. waiting for your kind response! Thank you!

jluebeck commented 2 years ago

Hi,

(1) Some very complex samples may take a long time to finish (possibly up to 2 weeks), however that is very rare. Typically reasons for long runtimes involve how the seed regions (--bed file) were selected, or the presence of artifact discordant reads arising from poorly controlled insert size distribution during library prep.

Some questions to help determine if the seed file is the issue: What CNV-caller, copy number and size cutoff were used for your bed file? How many distinct intervals and how much of the genome are present? Did the bed file undergo filtering using amplified_intervals.py? You may consider using the PrepareAA wrapper to standardize the selection of seed regions using our current best practices.

(2) Assuming you are starting with a bam file aligned to a reference that included HPV, then you can simply add the viral genome to your seed --bed file. For example, if your viral genome had the name hpv16ref_1, then you would add the following entry to your bed file before running AA: hpv16ref_1 1 7906

Please let me know if you have other questions or run into other issues! Jens

lwlive commented 2 years ago

Hi，jluebeck Thanks for your kind and quick response, and sorry for my late response. Your answers give me some idea. (1) The long running time may be caused by the bed file. I have produced a bed file(genome.bed) for the hgh38 and hpv genome, split it to a short regions( python3 /cnvkit.py target --split --avg-size 5000 -o genome.splited.bed ), produced a .cnn file for runing cnvkit. Then I have get the gained CNV regions with cnvkit( cnvkit.py batch ) and filter with amplified_intervals.py (--gain 4 --cnsize_min 10000 ). I have refered the PrepareAA wrapper. I cound not use the PrepareAA because I can not produce a directory of AA_DATA_REPO with hpv in it.
Should I filter the genome.bed file and remove the centromeres? And what had been remove in *_cnvkit_filtered_ref.cnn?

(2)Maybe I add the "hpv16ref_1 1 7906" in the GRCh38_cnvkit_filtered_ref.cnn and the run cnvkit and AA ?

Your responses do help me! Thanks again!

jluebeck commented 2 years ago

Hi,

Thanks for the clarification. To produce your own data repo with hg38 + virus, take the original hg38 data repo, then replace the reference fasta, and .fai files (plus any necessary BWA index files you need), with the hg38 version you used for human + viral alignment. Then in the data repo file file_list.txt, update the file names for the reference fasta and .fai. Releasing a viral version of the data repo is on my to-do list but I haven't gotten to it just yet, apologies. I would also recommend using --cngain 4.5, since there are possible false-positive amplification areas between 4 and 4.5. You can of course try both and compare results.
I would also recommend simply adding a bed entry for the entire viral sequence to the *_AA_CNV_SEEDS.bed file generated from PrepareAA before running AA.

Thanks, Jens

lwlive commented 2 years ago

Hi， Thanks for your kind response! I will follow yours suggestions! Best wishes! Wei Liu

virajbdeshpande / AmpliconArchitect

running time and usage of AmpliconArchitect #118