reineckef / quandico

Quantitative analysis of differences in copy numbers using read depth obtained from PCR-enriched samples and controls
GNU General Public License v3.0
3 stars 2 forks source link

qgetcounts example output #6

Closed tamaraprieto closed 7 years ago

tamaraprieto commented 7 years ago

Hi,

I am running the first part of the analysis on the example bam file (downloaded from drive) and I get different results. This is the command line:

qgetcounts -i M62_NA13019.bam -a CNA902Y.bed --samtools /opt/samtools/1.3/intel/2016/bin/samtools | head

An this is what I get:

# Creating the BAM index as the file 'M62_NA13019.bam.bai' is not there...

chr1    5923229 0   N   NPHP400001  484 NPHP4
chr1    5923339 1   N   NPHP400001  1119    NPHP4
chr1    5923335 0   N   NPHP400002  548 NPHP4
chr1    5923438 1   N   NPHP400002  553 NPHP4
chr1    5923429 0   N   NPHP400003  849 NPHP4
chr1    5923547 1   N   NPHP400003  1004    NPHP4
chr1    5923905 0   N   NPHP400004  1487    NPHP4
chr1    5924025 1   N   NPHP400004  1428    NPHP4
chr1    5923969 0   N   NPHP400005  849 NPHP4
chr1    5924081 1   N   NPHP400005  861 NPHP4

However, the output file you provide is like this:

head M62_NA13019.tsv
chr1    5923229 0   A   NPHP4   1129    NPHP4
chr1    5923335 0   A   NPHP4   586 NPHP4
chr1    5923341 1   T   NPHP4   1126    NPHP4
chr1    5923429 0   A   NPHP4   939 NPHP4
chr1    5923440 1   A   NPHP4   619 NPHP4
chr1    5923549 1   A   NPHP4   1014    NPHP4
chr1    5923905 0   T   NPHP4   1500    NPHP4
chr1    5923969 0   A   NPHP4   883 NPHP4
chr1    5924027 1   G   NPHP4   1435    NPHP4
chr1    5924046 0   C   NPHP4   970 NPHP4

Should I introduce the reference base afterwards?I don't even get the same counts (It could be because of filters) but coordinates are not the same. Do you have any idea?

Thank you in advance

Tamara

reineckef commented 7 years ago

Hey Tamara.

I have seen your request but have not had time to check in detail what is going on. The example output was not generated by the same script, so differences are expected. The example data are the output from the primer-trimming that is done in the cloud-based analysis, which is more sophisticated (checks primer sequence, allows for read-start wobbling) than the qgetcounts. However, counts should be similar, and if not, the same bias would apply to both, sample and control.

Many counts are quite similar, but the sorting order of lines presented is different (which is not problem at all), so that a thorough comparison here is not easy.

I will run the qgetcounts on the data as well and check the differences - and then update here.

Sorry to keep you waiting.

Frank

tamaraprieto commented 7 years ago

Hi Frank,

Have you had time to look at this?I understand it is just a different order. However, the number of amplicons in the .csv file seems to be the double than the number of amplicons present in the bed file. I also get double counts in my own data. Is there any reason. This is the example data:

head -4 NA13019.csv

chromosome,start,end,locus,amplicons,outliers,usable,expected,log2,copies,min,max,qp,sd,score,p.val,filter
chr1,5923229,6046380,NPHP4,*164*,3,161,2,0,2,1.92,2.09,35,0.35,0,9.7e-01,q25
chr2,29416015,30143141,ALK,*164*,6,158,2,0,2,1.92,2.09,35,0.34,0,9.9e-01,q25
chr2,47600571,47613837,EPCAM,*36*,12,24,2,-0.04,1.95,1.77,2.14,27,0.28,1,5.7e-01,q25

cut -f 6 CNA902Y.bed | uniq -c

     82 NPHP4
     97 ERCC6
     66 MYPN
     31 PSAP
     39 KCNQ1
     59 FLT3
     54 RB1
     73 ATP7B
     35 MEFV
     43 CDH1
     25 MAP2K2
     82 ALK
     18 EPCAM
     55 ITCH
     66 USP25
     49 USP16
     23 RUNX1
     27 CBS
     21 TFG
     38 SLC9A9
     55 PIK3CA
     28 SLC26A2
     35 POLH
     34 POU6F2
     40 RUNX1T1
    209 CSMD3
     35 RAD21
     62 JAK2
     25 MTAP
     42 FANCB
     64 UBA1
     21 GATA1
     75 ATP7A
     37 CHM
     38 BTK

Thanks in advance,

Tamara

reineckef commented 7 years ago

Hi Tamara,

The reason for the number of counts is 2x the number of amplicons is that we do use counts per primer and amplicons have two primers. That does make sense, e.g. in some cases, one half of a certain amplicon can have some homology with other regions in the genome, so that some reads are distributed, while the other half of the amplicon can be mapped uniquely. However, in most cases, counts per amplicon should be fairly similar – especially if the reads are long and the mapping algorithm does some mate rescue.

For the newest enrichment product, the V3 chemistry, there is only one single primer used, which will eliminate this issue. There will be only one count per amplicon. The new chemistry and protocol generate much more robust counts, which is obvious because the differences in neighboring regions are a lot less than with two primer (PCR) amplification.

I hope that answers your question? Please do not hesitate to ask again if something remains unclear.

Best, Frank

tamaraprieto commented 7 years ago

Hi Frank,

Thanks a lot for the explanation. In my case, the NGS libraries were constructed using the AmpliSeq technology so I do not have primer sequences and I cannot get that kind of counts. Besides, amplicons are ~120bp and reads (Ion Torrent) ~140bp so I should cover the whole amplicon almost completely.

Do you think I can use the output of Quandico directly or should tune some parameters? This is my command line:

quandico \
   -s map=Tumor.bam -s x=1 -s y=1 \
   -r map=Normal.bam -r x=1 -r y=1 \
   --extract \
   -X minmapq=5 \
   --assembly data=hs37d5.fa \
   --cluster --engine qcluster \
   -C m=20 \
   --amplicons Target.bed \
   -d ${WORKDIR} -b Tumor_Normal \
   --cp names=refGene.txt

Do you think I should increase the minimal absolute regions per cluster for instance?Or should I modify the parameter primerlen within qgetcounts?

reineckef commented 7 years ago

Hi Tamara,

We have - for obvious reasons - optimized the code to work within our own pipeline, running Qiagen products. And we have found that counts from primer trimming provide the most robust measure to quantify the amount of template DNA that was present at the locus. That procedure circumvents problems in regions that have overlapping amplicons (where raw coverage is a mixture of two counts, actually - or even more). And, also for obvious reasons, we have not done any parameter optimizations for AmpliSeq data. My recommendation are:

  1. Try to get counts that are exclusively made by a single amplicon, e.g. create a bed file with regions that have single amplicon coverage; then get the counts for these (bedtools can do that efficiently). Experiment on including or excluding regions where amplicons overlap - these might add noise, but some might work well.
  2. Experiment with numbers for clustering to get your desired balance of resolution (smaller clusters) and robustness (larger clusters).

Sorry for not being able to provide some out-of-the-box ideal solution for your task.

Best, Frank

tamaraprieto commented 7 years ago

Hi Frank, Thank you very much. That was very helpful. I will follow your recommendations. Best, Tamara