Issue of "-e" causing duplication in abundance estimation needs to be revisited

JohnHadish commented 6 years ago

On the stringtie homepage not github , v1.3.1 supposedly fixed this issue:

"fixed a duplication of some output transcripts in -e mode (abundance estimation only)"

I am currently running stringtie on Arabidopsis rna-seq data sets and I am encountering duplication events that vary from sample to sample. The command I am running is:

stringtie -v -p 1 -e -o test.gtf -G TAIR10_Araport11.gtf -A test.ga -l test OutputFromHisat-Samtools_sort-Samtoools_Index.bam

I am using the arabidopsis gtf downloaded from TAIR, and the reads I am running are downloaded from NCBI. My workflow is trimmomatic, hisat2, samtools_sort, samtools_index, stringtie. I am running stringtie v1.3.4d. My workflow was run on a llinux slurm cluster, and I have also verified the same results on my local machine (Ubuntu 18.04).

I typically have 1 or 2 genes per run that have duplication events.

A colleague running the same workflow on bovine data is having the same issue.

Any help would be greatly appreciated. Thank you for your time!

gpertea commented 6 years ago

Is this easily reproducible ? That is, for a specific sample it always happens for the same transcripts (same genomic region)? Anyway I'd like to take a look at the data triggering this problem. It would be great if you could share a sample BAM and the -G file you used, TAIR10_Araport11.gtf -- is this directly downloadable from anywhere or you prepared/converted that GTF with other tools? I only see the GFF3 file at https://www.arabidopsis.org/download/index-auto.jsp?dir=%2Fdownload_files%2FGenes%2FAraport11_genome_release. I suppose you eliminated the possibility that the duplicates are coming from that gtf file itself..

If the whole sample BAM file is too large to conveniently share perhaps you can determine a subset of the alignments - i.e. an example region where the duplication occurs and if you extract the alignments overlapping that region, the smaller BAM file can also trigger the duplication in that region. Let me know if I should provide more assistance with this (there is a wiki page here where I documented a way of extracting a smaller BAM just for debugging stringtie issues in a particular genomic region: https://github.com/gpertea/stringtie/wiki/Extracting-bundle-data-for-debugging)

JohnHadish commented 6 years ago

Hi gpertea, Thank you for your quick response.

Sorry, the name of the gtf was confusing, it makes more sense in the context of our workflow. Yes, the gtf was produced from the file on the TAIR page named

Araport11_GFF3_genes_transposons.201606.gff.gz

using the command from Cufflinks

gffread -E Araport11_GFF3_genes_transposons.201606.gff -T -o- > TAIR10_Araport11.gtf

I cannot attach my file due to size constraints, but by downloading the above file and using that command you should be able to get the same reference file to work with. The file that results from the above command does not appear to have duplicates.

In the output files from stringtie on my arabidopsis data, each file seems to have a random duplication of 1 of three genes. Those genes are:

ATCG09900
AT1G79790
AT1G58520

Here is a subset of my list of my output files (whole list is very long), and below each file name is what gene is duplicated (bash uniq command). The name of the file indicates the srx ID from NCBI. Each of these samples had multiple runs associated with it (sra). Before running through the pipeline these “sra” were combined into one large fastq file corresponding to their respective “srx”. You will notice that each file had different combinations of genes that were duplicated, but that they were always 1 of the 3 genes listed above.

SRX2248274/SRX2248274_vs_TAIR10_Araport11.ga
SRX2248275/SRX2248275_vs_TAIR10_Araport11.ga
AT1G58520
SRX2248276/SRX2248276_vs_TAIR10_Araport11.ga
AT1G58520
SRX2248277/SRX2248277_vs_TAIR10_Araport11.ga
AT1G58520
SRX2248278/SRX2248278_vs_TAIR10_Araport11.ga
AT1G58520
SRX2248279/SRX2248279_vs_TAIR10_Araport11.ga
AT1G58520
ATCG09900
SRX2248280/SRX2248280_vs_TAIR10_Araport11.ga
AT1G58520
SRX2248281/SRX2248281_vs_TAIR10_Araport11.ga
AT1G58520
SRX2248282/SRX2248282_vs_TAIR10_Araport11.ga
AT1G58520
SRX2248283/SRX2248283_vs_TAIR10_Araport11.ga
AT1G58520
ATCG09900
SRX2248284/SRX2248284_vs_TAIR10_Araport11.ga
AT1G58520
ATCG09900
SRX2248285/SRX2248285_vs_TAIR10_Araport11.ga
AT1G58520
SRX2248286/SRX2248286_vs_TAIR10_Araport11.ga
AT1G58520
SRX2248287/SRX2248287_vs_TAIR10_Araport11.ga
AT1G58520

As I was thinking about it, I got paranoid that this combining of sra’s into srx may have caused the issue, so I redid it with a sample without combining (used sample SRR4426362). I got the same results. The attached files are a subset of the resulting *.bam that will cause a duplication event when used with this command and the above mentioned reference file:

stringtie -v -p 1 -e -o test1.gtf -G TAIR10_Araport11.gtf -A test1.ga -l catz bundle.bam

I used this command to see the duplicated gene in the gene abundance file:

awk '{print $1}' test1.ga|sort |uniq -d
# AT1G58520

I have also included the output gene abundance file from this subset bam.

I went through all of this again to verify that the results are still happening. Our workflow is using the latest editions of all of the other software as well.

I had to rename the attached files with a ".txt" extension to be able to upload them to git hub, so I hope that the bam still work when you get it.

Thanks

test1.ga.txt

bundle.bam.txt

gpertea commented 6 years ago

Ah, I thought you somehow suggested that transcripts were duplicated (the fix mentioned in the release notes for v1.3.1 was about that), but now I see that your problem is that genes are duplicated in the "gene abundance" output file. That's actually not quite true; each line in that output file is rather about estimating abundance for "gene" regions (loci), formed by overlapping transcripts, and in some cases such gene regions are not uniquely identified by their ID/name, the coordinates (genomic location) of each such "gene region" (locus) sometimes make a very important distinction.

Look at the transcripts for that gene (AT1G58520) in the annotation file: there are 7 transcripts there, but one of them (AT1G58520.3) is actually not overlapping any of the others, so StringTie treats it as a separate locus ("gene"), and thus it creates a separate entry in the gene abundance file for that particular locus (AT1G58520|Chr1(+)21729913-21731344) which is different from the other, "main" locus which has all the other overlapping transcripts: AT1G58520|Chr1(+)21732566-21738808
StringTie cannot "trust" the reference annotation, as sometimes the gene ID can be just a duplicated string providing no true locus identity.. So you can think of it as StringTie splitting that "gene" into two non-overlapping gene regions and assessing the expression for each gene region independently. Again, the identity of a "gene" (actually "locus") in that file is given by gene ID and the genomic location of the underlying cluster of transcripts which define that locus. As you can see, that single transcript gene region (locus Chr1(+):21729913-21731344) has zero coverage so I don't know if that gene region is even real (or perhaps just an annotation artifact).

If you really do not care about these situations (where gene definitions are not quite consistent with the transcripts so StringTie separates them like this), you could just add up the coverage values for all these lines with the same gene ID in the gene abundance file and call that the "total" abundance of that "gene" -- but this could be really misleading if the gene ID is not uniquely identifying a locus, i.e. if it's duplicated in other places on the genome (perhaps even on another chromosome).

JohnHadish commented 6 years ago

Ok, that makes a lot of sense, I like that stringtie identifies genes this way so that it does not combine only on a gene ID. What I am seeing though is that stringtie will "duplicate" some genes in only some of the output files.

It would make sense if it did it all the time, but I only see it happening in some of them. I guess what I am concerned about is that stringtie is Identifying different amounts of genes each time. So far for arabidopsis I have seen it identify either 37363 or 37364 genes.

From your last response, it seems that it should be identifying scenarios like the AT1G58520 and the AT1G58520.3 every single time it is run, not just occasionally. I was looking back at my data, and it appears that yes, this is true for AT1G58520, but it is not for ATCG09900. At the bottom of this response I have a grep of my gene abundance files for gene ATCG09900, and it appears that in some samples stringtie identifies 2 copies, but sometimes only 1.

I looked at the reference.gtf, and it looks like it should have 2 copies every time like AT1G58520 because even though it has the same gene ID, it has different genomic location. Your response above makes a lot of sense to me, stringtie should require both gene ID and genomic location to avoid bad reference.gtf files. My problem now is that I don't understand why in the case of ATCG09900 it is not being consistent across samples? I would think that the part of stringtie that identifies the gene from the reference.gtf would be independent of the sample.

Here is the reference.gtf for gene ATCG09900, I would expect that stringtie would always identify it as 2 genes from your last response:

ChrC    Araport11   exon    7967    7996    241.00  -   .   transcript_id "ATCG09900.1"; gene_id "ATCG09900"; gene_name "ATCG09900";
ChrC    Araport11   exon    8176    8207    241.00  -   .   transcript_id "ATCG09900.2"; gene_id "ATCG09900"; gene_name "ATCG09900";

Here is a section of a grep of my output gene abundance files that shows that ATCG09900 is sometimes represented by 2 copies, and sometimes only 1. It seems that its representation as 2 copies is independent of if it is actually expressed or not.

SRX2248582/SRX2248582_vs_TAIR10_Araport11.ga
ATCG09900   ATCG09900   ChrC    -   7967    7996    0.000000    0.000000    0.000000
ATCG09900   ATCG09900   ChrC    -   8176    8207    0.0 0.0 0.0
SRX2248583/SRX2248583_vs_TAIR10_Araport11.ga
ATCG09900   ATCG09900   ChrC    -   7967    8207    0.0 0.0 0.0
SRX2248584/SRX2248584_vs_TAIR10_Araport11.ga
ATCG09900   ATCG09900   ChrC    -   7967    7996    1.533333    0.790024    1.578678
ATCG09900   ATCG09900   ChrC    -   8176    8207    0.0 0.0 0.0
SRX2248585/SRX2248585_vs_TAIR10_Araport11.ga
ATCG09900   ATCG09900   ChrC    -   7967    8207    0.0 0.0 0.0
SRX2267752/SRX2267752_vs_TAIR10_Araport11.ga
ATCG09900   ATCG09900   ChrC    -   7967    8207    0.0 0.0 0.0
SRX2267753/SRX2267753_vs_TAIR10_Araport11.ga
ATCG09900   ATCG09900   ChrC    -   7967    8207    0.0 0.0 0.0
SRX2267754/SRX2267754_vs_TAIR10_Araport11.ga
ATCG09900   ATCG09900   ChrC    -   7967    8207    0.0 0.0 0.0
SRX2267755/SRX2267755_vs_TAIR10_Araport11.ga
ATCG09900   ATCG09900   ChrC    -   7967    7996    0.333333    0.276271    0.562470
ATCG09900   ATCG09900   ChrC    -   8176    8207    0.875000    0.725212    1.476486
SRX2267756/SRX2267756_vs_TAIR10_Araport11.ga
ATCG09900   ATCG09900   ChrC    -   7967    7996    0.333333    0.177870    0.363462
ATCG09900   ATCG09900   ChrC    -   8176    8207    0.281250    0.150078    0.306671
SRX2267757/SRX2267757_vs_TAIR10_Araport11.ga
ATCG09900   ATCG09900   ChrC    -   7967    8207    0.0 0.0 0.0
SRX2267758/SRX2267758_vs_TAIR10_Araport11.ga
ATCG09900   ATCG09900   ChrC    -   7967    8207    0.0 0.0 0.0

Thanks

gpertea commented 6 years ago

Good question -- apologies for leaving out an important piece of information in my previous explanation: the fact that overlaps with read alignments from the BAM file are also taken into account when determining a "gene region" (locus), so they can act as a "glue", or "bridge" which could put together "gene regions" otherwise separated when looking only at overlaps between reference transcripts. So it is likely that in some samples the two gene regions of ATCG09900 get "bridged" by read alignments overlapping both those regions.. Thus only one gene region is reported for such samples.

It is all related to the concept of "bundle" as used by StringTie. Read alignments and reference transcripts are binned together in a "bundle" defined as a transitive closure of the exon overlap relationship between these objects (read alignments or reference transcripts (guides)). StringTie analyses a "bundle" and unless "weak spots" are found (spurious alignments) to break the bundle into multiple regions, the "bundle" will end up as a "gene region" (locus) reported in the output.

This "clustering by overlap" approach can also have the downside of merging multiple otherwise clearly distinct gene regions together (i.e. with different reference gene IDs), when alignment artifacts and/or transcriptional noise artificially "bridge" over intergenic regions, in some rare cases (especially when the actual genes are very close to each other).

JohnHadish commented 6 years ago

Ahh, very good. So the samples where there is only 1 gene region means that there was either a bridge between the two transcripts, or there was no spurious alignment to break apart the bundle (in my case, this often meant no alignment).

With that info, I think I am able to answer my last question on my own. Above, some of the samples has genes that had fpkm of 0, but they still split, while others had fpkm of 0 and did not split. For example:

SRX2248582/SRX2248582_vs_TAIR10_Araport11.ga
ATCG09900   ATCG09900   ChrC    -   7967    7996    0.000000    0.000000    0.000000
ATCG09900   ATCG09900   ChrC    -   8176    8207    0.0 0.0 0.0
SRX2248583/SRX2248583_vs_TAIR10_Araport11.ga
ATCG09900   ATCG09900   ChrC    -   7967    8207    0.0 0.0 0.0

It would appear that SRX2248582 and SRX2248583 should both fail to break the bundle since no alignment has happened in either that would cause the bundle to break. This is the case in SRX2248583 but not in SRX2248582.

I went and looked at the sam file for these, and found that SRX2248582 had a read in that region, while SRX2248583 did not (I also checked the region 100 before, but none overlapped with the region so for brevity I will omit them):

$ grep ChrC *.sam| awk -F "\t" '{ if(($4 >= 7967 && $4 <= 8207)) { print } }'
SRR4426896.15866700 16  ChrC    8000    60  9M1I90M *   0   0   ATATATATATTTTTTTTTCATTTTCTATATTTTTTTCTATATTTTATTATATTATTATATATATATATATTCTTTTTGATTATTTGATTATATAAATATA    CEEDEEEDDDDDDFFFDDDHHGGHHJJJJIGIJIIJIJJJJJIJIHIIGIGGIGJJJJIJJJJJJJJJJJJJJJJJJJJJJJIJJJJHHHHHFFFFFCCC    AS:i:-8 ZS:i:-10    XN:i:0  XM:i:0  XO:i:1  XG:i:1  NM:i:1  MD:Z:99 YT:Z:UU NH:i:1
$ cd ../SRX2248583
$ grep ChrC *.sam| awk -F "\t" '{ if(($4 >= 7967 && $4 <= 8207)) { print } }'

I suppose that this spurious alignment caused the split, but was not counted towards the fpkm for some other reason.

Thank you for all of your help and quick response times, I appreciate it a lot.

fei0810 commented 5 years ago

@JohnHadish @gpertea I have encountered the same issue, thank you for this discussion. 👍

xlhd19980327 commented 5 years ago

Student from biotrainee came here with worship~

LeileiCui commented 4 years ago

Student from biotrainee +1

fei0810 commented 4 years ago

Student from biotrainee +1

这是什么梗啊 I want to know which blog you know about ISSUE

gpertea / stringtie

Issue of "-e" causing duplication in abundance estimation needs to be revisited #192