gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
388 stars 78 forks source link

Low BUSCO result for StringTie assembly #174

Open soungalo opened 6 years ago

soungalo commented 6 years ago

Hello,
I am experimenting with different tools for transcriptome assembly on a data set of RNA-seq reads from tomato I obtained from SRA. I used StringTie to assemble the transcriptome and then ran BUSCO in order to evaluate its completeness. I was surprised to see that only ~40% of the BUSCOs were found in my assembly, whereas running BUSCO on the assembly generated by Trinity assembly (in its genome-guided mode) with the same data resulted in ~80% BUSCOs. Since I know StringTie is a widely-used tool, my guess is that I'm doing something wrong, but couldn't figure out what.
Here's what I've done:

  1. I downloaded the reads (fastq format, consisting of SE and PE libraries) and the reference genome chromosomes, and used Bowtie to index the reference.
  2. I aligned the reads to the reference using TopHat. Here's the output summary:
    
    Left reads:
          Input     :   8211421
           Mapped   :   7733482 (94.2% of input)
            of these:    157283 ( 2.0%) have multiple alignments (433 have >20)
    Right reads:
          Input     :   8211421
           Mapped   :   7688100 (93.6% of input)
            of these:    158876 ( 2.1%) have multiple alignments (434 have >20)
    Unpaired reads:
          Input     :  30151259
           Mapped   :  28489825 (94.5% of input)
            of these:    338135 ( 1.2%) have multiple alignments (6 have >20)
    94.3% overall read mapping rate.

Aligned pairs: 7373119 of these: 150521 ( 2.0%) have multiple alignments 184804 ( 2.5%) are discordant alignments 87.5% concordant pair alignment rate.


3. I downloaded the reference annotation (gff) and ran StringTie v1.3.4d using the sorted bam output of TopHat and the annotation as inputs (with the -G option). Other than that I used default parameters.  
4. I used betools getfasta command to extract transcript sequences from the StringTie gtf output. I only took transcript features and discarded exon features. Finally, I ran BUSCO on the fasta file.  
As I said, I ended up with a somewhat-missing assembly. Any suggestions for calibrating the run in order to improve results? What parameters do you think should be experimented with? Any advice would be welcome.  
Thanks a lot!
soungalo commented 6 years ago

I tried reducing the minimum required coverage from the default (2.5) to 1 (-c 1), which resulted in a very minor improvement - 5 more BUSCOs were detected. Additionally, I've noticed that the minimum transcript length is not enforced, i.e using the default or even explicitly using -m 200 results in transcripts as small as 63bp. This might be related, or it could be a different issue...

gpertea commented 6 years ago

Interesting observations - thank you for the clear presentation of the steps you took here. StringTie was largely calibrated for human transcriptome samples, but it's always interesting to see how it works with RNA-Seq data for other genomes. Just to get the minimum transcript length issue out of the way -- indeed that threshold is in fact ignored whenever the -G option is used (we should have documented this..).

As for the main observation/issue you reported here: I have a suspicion that this low recovery rate is related to the high repetitive content of the tomato genome, with many of the tomato genes seemingly being affected by those complex repeats. So I am afraid that this high repeat content in the tomato transcriptome exposes what is probably the weakest point of the StringTie pipeline: multi-mapped reads. Is it possible to re-run stringtie on that sample with option -M 1 ? Also, just in case there is more going on, could you add the -C cov_t.gtf option (or any other output file name you prefer), and then check how many of the reference transcripts show up in that file? StringTie also performs some additional internal filtering of the alignments and potential isoforms which might eliminate more of the low/imperfectly covered transcripts which Trinity might have recovered otherwise. This could also be affected by the repeat content, it may even be possible that TopHat could've dropped some multi-mapped reads (perhaps --report-secondary-alignments might help in case of complex repeats in this case? I hope that the default value of 20 for the -g option, i.e. the max number of multi-hits allowed for a read, is good enough for this repetitive transcriptome; if not, the -x option might have to be increased as well).

soungalo commented 6 years ago

Thanks for the helpful reply. I reran StringTie with the flags -G -C -c 1 -M 1. All in all, it looks like setting -M 1 didn't have much effect - BUSCO result is the same and the number of assembled transcripts is only slightly larger. Regarding the reference coverage (-C), I'm not sure I am interpreting the results correctly. The output gtf includes 19,543 mRNA features, while the reference gff includes 35,768 mRNA features. Is that the correct comparison to make? If so, it looks rather low too (although I don't really know what to expect). WDYT? I could try rerunning TopHat as yo suggested, if you think that worth a try.

gpertea commented 6 years ago

Ouch, so it's not as clear as I suspected (blaming repeats seemed the easy way out :) ). How did you run genome-guided Trinity on this sample, did you use the same TopHat alignments, or you used a different aligner (I see it can work with GSNAP or STAR) ? I want to see if we can eliminate the genome alignments (the aligner) as the potential cause of the low recovery rate here.

I would be especially curious if you can find examples of whole genes (loci) missed by StringTie but assembled by Trinity, how would the TopHat alignments look like in such regions, e.g. visualized in IGV. They might be too sparse for StringTie to assemble, perhaps.. If you find such regions (especially some that look like they do in fact have a decent amount of alignments in there) perhaps you can extract a couple of such regions from the .bam file and let me have a look at them? Thanks!

Edit: I am not familiar with Trinity, so could you tell me if its resulting assemblies were in any way informed by the reference annotation (known transcripts) ? From its github wiki I couldn't see a way to pass reference annotation data (like the -G option does it for stringtie).

soungalo commented 6 years ago

Hi again, and sorry for the late reply.
To answer your first question - I used exactly the same alignment in Trinity as I used in StringTie. This alignment was obtained using TopHat.
You were correct regarding Trinity - it doesn't take the reference annotation as input. It has a genome-guided mode, but this only takes alignments of reads to the reference genome. However, TopHat does take into account the annotation, if you use the -G mode, which I did. So in a way, reference annotation did have an effect on Trinity results (but the same effect on Stringtie results). I'm not sure I understood what exactly you wanted me to look for in the results, but I prepared a list of commands you can run to reproduce the result. If you're interested it'd be great if you could take a look. Maybe you'll also spot something off with my procedure that way.
I assume you have all required software installed on your system:

# download RNA-seq data
fastq-dump --gzip --defline-seq '@$sn[_$rn]/$ri' --skip-technical  --readids --dumpbase --split-files --clip SRR390336
fastq-dump --gzip --defline-seq '@$sn[_$rn]/$ri' --skip-technical  --readids --dumpbase --split-files --clip SRR390335
# unzip
gzip -d SRR390335_1.fastq.gz
gzip -d SRR390335_2.fastq.gz
gzip -d SRR390336_1.fastq.gz
# concat R1 from SE and PE libs
cat SRR390335_1.fastq SRR390336_1.fastq > combined_R1.fastq
# download reference
wget ftp://ftp.solgenomics.net/tomato_genome/assembly/build_3.00/S_lycopersicum_chromosomes.3.00.fa
wget ftp://ftp.solgenomics.net/tomato_genome/annotation/ITAG3.2_release/ITAG3.2_gene_models.gff
# index reference genome
bowtie-build S_lycopersicum_chromosomes.3.00.fa ./SL300
# align reads to reference genome
tophat -p 20 -i 20 -I 5000 -G ITAG3.2_gene_models.gff -o ./ SL300 combined_R1.fastq  SRR390335_2.fastq
# sort alignment result
samtools sort accepted_hits.bam accepted_hits.bam.sort.bam
# transcriptome assembly
stringtie accepted_hits.bam.sort.bam -o SL_PI114490_trans_cov1_M1.gtf -p 20 -G ITAG3.2_gene_models.gff -C cov_ref_cov1_M1.gtf -c 1 -M 1
# download and extract BUSCO plants set
wget http://busco.ezlab.org/datasets/embryophyta_odb9.tar.gz
tar -zxvf embryophyta_odb9.tar.gz
# extract transcripts from StringTie gff output and convert to fasta
awk '$3 == "transcript"' SL_PI114490_trans_cov1_M1.gtf > SL_PI114490_trans_cov1_M1_transcripts.gtf
bedtools getfasta -fi S_lycopersicum_chromosomes.3.00.fa -bed SL_PI114490_trans_cov1_M1_transcripts.gtf -fo SL_PI114490_trans_cov1_M1_transcripts.fasta
# run BUSCO
python run_BUSCO.py --in SL_PI114490_trans_cov1_M1_transcripts.fasta  --out BUSCO_cov1_M1 --lineage_path embryophyta_odb9/ --mode transcriptome --cpu 20

Hope this can be of help for debugging and possibly improving performance. LMK if you need further info.

gpertea commented 6 years ago

I realize that I was wrong earlier suggesting that we can exclude the aligner if the same Tophat alignments were used by both Stringtie and Trinity if trinity is actually not using the alignments the same way but instead it's taking as input the raw reads -- and only using the TopHat mappings as a "guide" for clustering the reads. I have no experience with Trinity but from the explanation I see on that wiki page you referenced, it looks like it does not use the TopHat alignments to determine the intron-exon structures, but only as a way to group the reads into clusters (bundles) and then it uses its own de-novo assembly method to assemble the raw reads into transcripts, in each bundle. So it looks like Trinity might be able to recover some isoforms, in each bundle, which were missed by StringTie as it relied solely on the TopHat mappings.

I suppose the output of Trinity does not provide exon locations for the assembled transcripts so that's probably why what I asked earlier (to find examples of regions with Trinity transcripts but no StringTie assemblies) did not seem to make sense.. Could you please clarify to me, the only input for Trinity was the same Tophat .bam file that StringTie used? Or did it also make use of the "raw" reads (the *.fastq.gz files) ? (it might help to also know the details of your Trinity run). I guess we could map Trinity's output back to the genome and then have a more direct comparison and see what exactly StringTie missed -- so could you maybe upload the Trinity assemblies somewhere so I can get them? Also the SL_PI114490_trans_cov1_M1.gtf file resulting from StringTie -- so I can effectively compare the output of the two programs.

soungalo commented 6 years ago

Could you please clarify to me, the only input for Trinity was the same Tophat .bam file that StringTie used? Or did it also make use of the "raw" reads (the *.fastq.gz files) ?

Trinity takes both the raw reads (fastq) and their alignments to the reference (bam) as input. In fact, it can run even without the alignments, but I found out that adding the reference greatly improves results.
Attaching the files you requested. LMK if I can aid with anything else. SL_PI114490.zip