ratschlab / spladder

Tool for the detection and quantification of alternative splicing events from RNA-Seq data.
Other
103 stars 33 forks source link

Is_annotated always 3 on negative strand #188

Closed perllb closed 11 months ago

perllb commented 1 year ago

Description

I observe in several samples that all exon-skip events on negative strand are "is_annotated" = 3. While events on positive strand has all "is_annotated" values. This seems a bit suspicious?

What I Did

Counter(list(spladder[spladder["strand"] == "-"]["is_annotated"])) Counter({3: 3192})


Same pattern for 5 different samples.
riasc commented 1 year ago

Was wondering about the same thing. Have you found any explanation for this?

perllb commented 1 year ago

I have not.

Taking a look at the code.

spladder/classes/event.py:

def set_annotation_flag(self, anno_introns):

        self.sort_exons()

        ### check annotation status of isoform 1
        self.annotated = 3
        for i in range(self.exons1.shape[0] - 1):
            if not (self.exons1[i, 1], self.exons1[i + 1, 0]) in anno_introns:
                self.annotated -= 1
                break

        ### check annotation status of isoform 2
        if self.exons2.shape[0] < 2:
            return
        for i in range(self.exons2.shape[0] - 1):
            if not (self.exons2[i, 1], self.exons2[i + 1, 0]) in anno_introns:
                self.annotated -= 2
                break

So it sets is_annotated = 3 by default. Then if the given intron (defined by the neighbouring exon-junctions) is NOT in "anno_introns", then it will alter the is_annotated accordingly. That is, each exon-exon junction of the isoform must be in the annotation gtf for the is_annotated to stay "3".

I do not see how this can fail for only genes on negative strand?

Other options:

perllb commented 1 year ago

I suspect the issue is "more serious" than the assignment of the is_annotated value itself.

The total number of exon-skip events detected on negative strand is much lower than on positive, suggesting that the problem is not in assigning the is_annotated value, but rather that the actual detection of negative strand events have issues:


> spladder = pd.read_csv(f"merge_graphs_exon_skip_C3.confirmed.txt.gz",sep="\t")
> Counter(list(spladder[spladder["strand"] == "+"]["is_annotated"]))
Counter({1: 1245, 3: 5038, 2: 2195, 0: 727})

> Counter(list(spladder[spladder["strand"] == "-"]["is_annotated"]))
Counter({3: 3192})
perllb commented 1 year ago

Seems like running with STAR-mapped bam causes this issue. While running with HISAT2-mapped bam does not.

perllb commented 11 months ago

This issue seems to be solved by adding the following parameters to STAR:

   --outSAMstrandField intronMotif \
   --outSAMattributes NH HI NM MD AS XS \

STAR manual: https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf

For unstranded RNA-seq data, Cufflinks/Cuffdiff require spliced alignments with XS strand attribute,
which STAR will generate with --outSAMstrandField intronMotif option. As required, the XS
strand attribute will be generated for all alignments that contain splice junctions. The spliced
alignments that have undefined strand (i.e. containing only non-canonical unannotated junctions)
will be suppressed.