vibansal / crisp

Code for multi-sample variant calling from sequence data of pooled or unpooled DNA samples
MIT License
19 stars 8 forks source link

Allele Balance #14

Open mullerbsf opened 4 years ago

mullerbsf commented 4 years ago

Dear Vikas Bansal,

I am trying to estimate the allele balance for each sample genotyped with CRISP. Is there a way to obtain that information from the output?

For example, from a total of 100 reads supporting the call, a heterozygous sample might have 80 reads to the reference allele and 20 reads to the alternative allele.

From the FORMAT field, I see the following options: GT:GQ:DP:ADf:ADr:ADb. The total depth seems to be included, but it is not clear to me how to obtain how many of these reads were relative to each allele.

Thank you!

vibansal commented 4 years ago

ADf is a comma separated list of the read counts for each allele on the forward strand. You can sum up the ADf, ADr and ADb values for each allele to get the allele balance.

mullerbsf commented 4 years ago

Thank you.

I am still confused about some of the results, as often they seem to not add up. Take the examples below that I retrieved from a representative marker:

FORMAT: GT:GQ:DP:ADf:ADr:ADb

SAMPLE 1: 0/0/0/1/1:0:18:2,3:4,1:0,0 SAMPLE 2: 0/0/1/1/1:3:25:2,9:4,3:0,0

It seems like the total depth is 18 and 25, respectively. The sums of the ADf, ADr and ADb fields, however, are not equal to the total depth.

I appreciate your attention. Regards

vibansal commented 4 years ago

The total read depth (DP) includes all reads. Some of these may be filtered out for the calculation of ADf, ADr and ADb which are used for variant calling.

Update: this is not correct

mullerbsf commented 4 years ago

Dear Bansal,

Then, I did not understand why the DP definition from CRISP VCF output says: "filtered reads only"? DP: "Total number of reads (+strand,-strand,bidirectional) across all pools (filtered reads only)".

This definition generates all the misunderstandings about CRISP output.

Thank you!

vibansal commented 4 years ago

I have checked the code and the above definition is correct.

DP includes counts for all alleles at the site so it can be greater than the counts for the two alleles output in the VCF. For example, a site with a ref allele = 'A' and variant allele = 'C' can have reads supporting 'G' or 'T'. These reads are included in the DP count.

mullerbsf commented 4 years ago

Dear Bansal,

I checked the BAM files, and your argument about other reads supporting other possible alleles is not valid (see attached the screenshot from IGV). This a clear SNP with only reads supporting REF=C and ALT=T, no reads are supporting the A or G. Therefore, the DP field from CRISP output is showing the total reads depth without filtering. For the ADf, ADr and ADb there are some filters applied in the counts.

SAMPLE 1

SAMPLE1_D710_D501

SAMPLE 2

SAMPLE2_D701_D501

When compare the CRISP and FreeBayes outputs, the FreeBayes is giving the correct number of DP and AD fields without filtering.

CRISP output

1 19545 Chla_rei_1_19545 C T 3470 PASS
FORMAT: GT:GQ:DP:ADf:ADr:ADb
SAMPLE 1: 0/0/0/0/1:9:40:10,5:11,2:1,0
SAMPLE 2: 0/0/0/0/1:14:51:17,2:9,4:0,1

FreeBayes output

1 19545 Chla_rei_1_19545 C T 12110.9 .
FORMAT: GT:DP:AD:RO:QR:AO:QA
SAMPLE 1: 0/0/0/1/1:41:23,18:23:893:18:552 SAMPLE 2: 0/0/1/1/1:52:26,26:26:1057:26:909

I want to use the CRISP output to run SNPGenie software. To be able to do this, I need to double-check with you about the DP field and all the ADf, ADr and ADb.

Based on my interpretation of CRISP output, I will need to create a code to sum all these fields to replace the DP and AD fields.

Then, for the first sample, it will be something like this: FORMAT: GT:GQ:DP:ADf:ADr:ADb
SAMPLE 1: 0/0/0/0/1:9:40:10,5:11,2:1,0

The update file will be: FORMAT: GT:GQ:DP:AD SAMPLE 1: 0/0/0/0/1:9:29:22,7

Can you confirm if this is correct?

Thank you very much for this help.

vibansal commented 4 years ago

Yes, you are correct. I will have to update the code to output the correct DP value.