Closed tamaraprieto closed 7 years ago
Hey Tamara.
I have seen your request but have not had time to check in detail what is going on. The example output was not generated by the same script, so differences are expected. The example data are the output from the primer-trimming that is done in the cloud-based analysis, which is more sophisticated (checks primer sequence, allows for read-start wobbling) than the qgetcounts. However, counts should be similar, and if not, the same bias would apply to both, sample and control.
Many counts are quite similar, but the sorting order of lines presented is different (which is not problem at all), so that a thorough comparison here is not easy.
I will run the qgetcounts on the data as well and check the differences - and then update here.
Sorry to keep you waiting.
Frank
Hi Frank,
Have you had time to look at this?I understand it is just a different order. However, the number of amplicons in the .csv file seems to be the double than the number of amplicons present in the bed file. I also get double counts in my own data. Is there any reason. This is the example data:
head -4 NA13019.csv
chromosome,start,end,locus,amplicons,outliers,usable,expected,log2,copies,min,max,qp,sd,score,p.val,filter
chr1,5923229,6046380,NPHP4,*164*,3,161,2,0,2,1.92,2.09,35,0.35,0,9.7e-01,q25
chr2,29416015,30143141,ALK,*164*,6,158,2,0,2,1.92,2.09,35,0.34,0,9.9e-01,q25
chr2,47600571,47613837,EPCAM,*36*,12,24,2,-0.04,1.95,1.77,2.14,27,0.28,1,5.7e-01,q25
cut -f 6 CNA902Y.bed | uniq -c
82 NPHP4
97 ERCC6
66 MYPN
31 PSAP
39 KCNQ1
59 FLT3
54 RB1
73 ATP7B
35 MEFV
43 CDH1
25 MAP2K2
82 ALK
18 EPCAM
55 ITCH
66 USP25
49 USP16
23 RUNX1
27 CBS
21 TFG
38 SLC9A9
55 PIK3CA
28 SLC26A2
35 POLH
34 POU6F2
40 RUNX1T1
209 CSMD3
35 RAD21
62 JAK2
25 MTAP
42 FANCB
64 UBA1
21 GATA1
75 ATP7A
37 CHM
38 BTK
Thanks in advance,
Tamara
Hi Tamara,
The reason for the number of counts is 2x the number of amplicons is that we do use counts per primer and amplicons have two primers. That does make sense, e.g. in some cases, one half of a certain amplicon can have some homology with other regions in the genome, so that some reads are distributed, while the other half of the amplicon can be mapped uniquely. However, in most cases, counts per amplicon should be fairly similar – especially if the reads are long and the mapping algorithm does some mate rescue.
For the newest enrichment product, the V3 chemistry, there is only one single primer used, which will eliminate this issue. There will be only one count per amplicon. The new chemistry and protocol generate much more robust counts, which is obvious because the differences in neighboring regions are a lot less than with two primer (PCR) amplification.
I hope that answers your question? Please do not hesitate to ask again if something remains unclear.
Best, Frank
Hi Frank,
Thanks a lot for the explanation. In my case, the NGS libraries were constructed using the AmpliSeq technology so I do not have primer sequences and I cannot get that kind of counts. Besides, amplicons are ~120bp and reads (Ion Torrent) ~140bp so I should cover the whole amplicon almost completely.
Do you think I can use the output of Quandico directly or should tune some parameters? This is my command line:
quandico \
-s map=Tumor.bam -s x=1 -s y=1 \
-r map=Normal.bam -r x=1 -r y=1 \
--extract \
-X minmapq=5 \
--assembly data=hs37d5.fa \
--cluster --engine qcluster \
-C m=20 \
--amplicons Target.bed \
-d ${WORKDIR} -b Tumor_Normal \
--cp names=refGene.txt
Do you think I should increase the minimal absolute regions per cluster for instance?Or should I modify the parameter primerlen within qgetcounts?
Hi Tamara,
We have - for obvious reasons - optimized the code to work within our own pipeline, running Qiagen products. And we have found that counts from primer trimming provide the most robust measure to quantify the amount of template DNA that was present at the locus. That procedure circumvents problems in regions that have overlapping amplicons (where raw coverage is a mixture of two counts, actually - or even more). And, also for obvious reasons, we have not done any parameter optimizations for AmpliSeq data. My recommendation are:
Sorry for not being able to provide some out-of-the-box ideal solution for your task.
Best, Frank
Hi Frank, Thank you very much. That was very helpful. I will follow your recommendations. Best, Tamara
Hi,
I am running the first part of the analysis on the example bam file (downloaded from drive) and I get different results. This is the command line:
An this is what I get:
However, the output file you provide is like this:
Should I introduce the reference base afterwards?I don't even get the same counts (It could be because of filters) but coordinates are not the same. Do you have any idea?
Thank you in advance
Tamara