piquelab / QuASAR

Quantitative Allele Specific Analysis of Reads. Joint genotyping & ASE inference for RNA-seq data
MIT License
26 stars 10 forks source link

Removal of legit variants in awk filter during pileup pre-processing step #17

Closed VitorAguiar closed 8 months ago

VitorAguiar commented 1 year ago

During the pre-processing of the pileup file, variants like the one below are removed:

chr11 35208126 T 49 ccc,CCCCc,C,><CC<<><<>><ccc.Ccc.CcCCcc,cC.cCcccc.

That happens because of the awk filter below, which removes any variant that is skipped in any alignment (e.g., exonic variants that are spliced out in some RNA molecules), as indicated by the characters ">" and "<" in the 5th column of the pileup file.

awk -v OFS='\t' '{ if ($4>0 && $5 !~ /[^\^][<>]/...

I believe the variant should not be removed since, although it is skipped by 10 reads, it is documented by 39 reads (8 matching the REF allele, and 31 matching the ALT allele).

For example, GATK's ASEReadCounter keeps the variant.

Please, can you clarify what is the justification to remove variants such as the one in my example?

rpique commented 8 months ago

We haven't investigated this enough, I think we wanted to focus on variants that may not be impacted by splicing, but I think it should be possible to keep them if using an aligner that does a good job in mapping reads in this scenario.