Open GrantHov opened 5 years ago
Any update on this issue?
By "distant" you mean ~450 Kbases apart? I guess that may be distant for Candida albicans but not for a mammalian genome.. It's hard to see what went wrong by looking at a "merged" output of many samples -- a few spurious read alignments in one sample can ruin a "locus" for the rest of them.. I see quite a few transcripts there with a relatively large "intron", e.g. MSTRG.11.1, MSTRG.11.13, MSTRG.11.18. Those are likely coming from read alignments in one or more samples where the aligner decided that a 450 Kbases intron is acceptable and the best it can align those reads.. If that intron is too large for your organism I think you should limit the maximum intron size allowed during the alignment (hopefully the aligner you used has that option), or (less recommended) filter out such alignments from your BAM file.
Although not recommended, you could also try to use the -j
option of StringTie (when you assemble each sample) in an attempt to filter out low-coverage introns (assuming those are just rare, bad alignments) -- but that has the side effect of a loss of sensitivity (low-expression isoforms might be lost) and it might actually not help if the aligner consistently aligned multiple reads with the same large "introns" in a region (e.g. due to some short local repeats, preferring the large-intron alignments over the shorter or ungapped ones).
After performing stringtie merge some transcripts appear in several distant genomic locations under the same gene_id. Below is an example (sorry if its too long). So gene_id MSTRG.11 is repeated many time in different transcripts.
I use v .1.3.6 . My command is: