samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
672 stars 240 forks source link

RNAseq variant calling using bcftools #579

Closed prasundutta87 closed 7 years ago

prasundutta87 commented 7 years ago

Hi,

Is there any special recommendation that should be followed while calling variants from RNAseq BAM files using bcftools? For example, limiting oneself only to primary alignment (-q 60 option) or using --incl-flags 0x400 option if duplicates are marked in your BAM file?

chahatupreti commented 4 years ago

Hi Prasun!

I see that this issue has been closed by you. I have the same question. Did you find an answer?

prasundutta87 commented 4 years ago

Hi Chahat,

Do you really have to call variants using RNA-seq data? Don't you have DNA-seq data for the same sample/organism? Calling variants from RNA-seq data is always questionable due to the simple fact that you will miss a lot of variants in places where a gene is not expressed.

For primary alignment query, please see this post-https://github.com/samtools/bcftools/issues/710

Marking duplicates in RNA-seq data is also questionable because a duplicate marking tool cannot differentiate if the read is actually a PCR duplicate or it is due to the fact that a particular gene is highly expressed. However, to get some perspective, you can go through this post-https://www.biostars.org/p/66831/.

Regards, Prasun

chahatupreti commented 4 years ago

Hi Prasun,

Thanks for the reply. Yes, we don't have DNA-seq data for these samples, we only have RNA-seq data. And although, as you said, we will only find expressed variants, but we are fine with it. We are also interested in variants in enhancers that get expressed, so hopefully we get something useful there.

Thanks for the link about the supplementary reads.

Regarding marking duplicates, I ran MarkDuplicates by Picard on a few of the samples, and the duplicate rates were around 60% which is a lot, and its clear that a lot of this could be resulting from highly expressed genes. So, I will probably not be marking duplicates at all.

So ya, RNA-seq data is all we have, and we are hoping to find expressed variants in genes and enhancers. Any thoughts/recommendations?

prasundutta87 commented 4 years ago

If you are only interested in expressed variants, go ahead with what you are doing. Additionally, if you have done total-RNA-seq, you can include introns as well, but if it is mRNA-seq, make sure you remove all intronic variants, those will be present due to mapping errors. Regarding enhancers, they are regulatory DNA sequences, you will miss finding variants in them as they are not part of the actual coding DNA region. Unless, you are referring to transcription factor genes..

chahatupreti commented 4 years ago

Thanks! And in your experience, are the number of expressed variants tiny in comparison to the total meaningful variants in DNA? Because if that is the case, then yes we would actually be losing a lot!

On Fri, Feb 7, 2020 at 5:12 AM prasundutta87 notifications@github.com wrote:

If you are only interested in expressed variants, go ahead with what you are doing. Additionally, if you have done total-RNA-seq, you can include introns as well, but if it is mRNA-seq, make sure you remove all intronic variants, those will be present due to mapping errors. Regarding enhancers, they are regulatory DNA sequences, you will miss finding variants in them as they are not part of the actual coding DNA region. Unless, you are referring to transcription factor genes..

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/samtools/bcftools/issues/579?email_source=notifications&email_token=ACJRBKPEH6Z6EXUJWVZHT6TRBU6YVA5CNFSM4DFTVUZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELCSNMY#issuecomment-583345843, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJRBKN5RO7LCO47NIY5CDTRBU6YVANCNFSM4DFTVUZA .

--

Chahat Upreti Third Year PhD student, Molecular and Cell Biology University of Texas at Dallas

prasundutta87 commented 4 years ago

It depends what is meaningful to you..if you are interested in the variation of the regulatory region, you will obviously not get them using RNA-seq variants (unless you wanna do an allele-specific expression analysis which signposts genes showing regulatory variation in cis). If you are only interested in coding variants and want to find synonymous or non-synonymous variants, you can do what you are doing..what exactly are you looking for? Just remember.. anything outside exonic regions in a mRNA-seq experiment is noise..if you want to look beyond that, go for DNA-seq or at least total RNA-seq.

Wulei26 commented 1 year ago

Hi Prasun! I have the same problem!did you figure out the problem ? I'm trying to use bcftools to call variants through the following parameters using RNA-Seq data:

bcftools mpileup -Ou -R $BED -f $reference $bam_in | bcftools call -mv > ${oudir}/bcftools.default.vcf

I use the -R parameters to specifies exon regions, the bam file was generated by STAR and i didn't remove duplicates. Is there a problem with my workflow? How should I filter my varinats? I would appreciate your reply! wulei