gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
361 stars 76 forks source link

Merged genes on a de-novo assembly #429

Open SalvadorGJ opened 1 month ago

SalvadorGJ commented 1 month ago

Hello @gpertea

I'm trying to make a de-novo transcriptome assembly. I used minimap2 to align long and oriented full-length reads with the following command:

minimap2 -ax splice -uf -t ${task.cpus} -G 2000000 ${index} ${fastq_file}

I used -G as 2Mb because the species has huge introns, and that value was suggested in literature. I filtered for only primary alignments with good quality and then I ran Stringtie as follows:

stringtie ${bam_file} \\
        -p ${task.cpus} \\
        -o ${assembly_name}.gtf \\
        -l ${params.idPrefix}${sample_info['sampleID']} \\
        -L -m ${params.minReadLength} \\
        -A ${assembly_name}.gene_abund.tab \\
        --conservative

I did this for the reads on multiple samples (tissues), so I used StringTie --merge to build the consensus:

stringtie --merge *.stringtie.gtf \\
        -o stringtie.merge.raw.gtf \\
        -m ${params.minReadLength} \\
        -l ${params.idPrefix}

I have an issue where two different genes are merged together, because one transcript spans across >1Mb intron.

image

I checked one read supporting the alignment and it wasn't a chimeric read. The alignment was supported by multiple reads, Below I show a zoom in at both ends of the transcript:

image image

As my species has huge introns I want to keep the minimap parameters. I want to ask if there is a parameter in StringTie to set a threshold to avoid cluster the transcripts into a single gene where the distance separating their initial coordinates is very long.

Best, Salvador