gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
361 stars 76 forks source link

Stringtie --mix returns "Error: the input alignment file is not sorted!" while short and long-reads alone are ok #419

Closed eggrandio closed 3 months ago

eggrandio commented 3 months ago

Hello,

I am trying to use the stringtie --mix option with short and long-read sorted bam files. stringtie --mix -p 28 -o CV191_denovo_annotation.gtf CV191_hisat2_sorted.bam cv19_ccs.clustered.aln_sorted.bam This returns:

Error: the input alignment file is not sorted!

Alignments (1) already found for chr2 !

While if I use each of the .bam files separately, they do not return the error and they successfully generate a .gtf file

stringtie -p 28 -o CV191_denovo_annotation.gtf CV191_hisat2_sorted.bam

stringtie -L -p 28 -o CV191_denovo_annotation.gtf cv19_ccs.clustered.aln_sorted.bam

When I tried to confirm that both .bam files have been sorted with samtools sort, I see that the version of the file is different (@HD VN), and the order of the chromosomes is sorted differently too:


samtools view -H CV191_hisat2_sorted.bam | head
@HD     VN:1.0  SO:coordinate
@SQ     SN:chr1 LN:31549113
@SQ     SN:chr10        LN:24987680
@SQ     SN:chr11        LN:30785609

samtools view -H cv19_ccs.clustered.aln_sorted.bam | head
@HD     VN:1.6  SO:coordinate
@SQ     SN:chr1 LN:31549113
@SQ     SN:chr2 LN:42898286
@SQ     SN:chr3 LN:42683706

I have tried to sort them again using the same samtools version, but the header still remains the same (and the order of the chromosomes).

What can I do to resort them and be able to use them with stringtie --mix ?

gpertea commented 3 months ago

This is a serious issue, as StringTie assumes that the chromosomes are always in lexicographic order, like you show in the HISAT2 BAM file (CV191_hisat2_sorted.bam).

StringTie also internally sorts the reference annotation data ("guides" GTF/GFF) lexicographically, so it expects any input alignments to be already sorted following the same lexicographic order for reference sequences (chromosomes).

Checking the samtools sort documentation it looks like samtools currently does not have a way to enforce lexicographic order of the reference sequences (chromosomes) during sorting, and instead it uses the order of chromosomes as found in the unsorted alignment file header (@SQ entries).

I am a bit surprised by this finding, and a bit disappointed that samtools sort lacks any options to enforce the sorting of the chromosomes in lexicographic order.

I think the steps you can take in your case is to use the header from the HISAT2 BAM file (which has the desired order) and put it onto the other file and sort that file again, using the new header. Something like this:

  1. Extract the header from the hisat2 BAM file:
    samtools view -H CV191_hisat2_sorted.bam > sort_header.sam
  2. Reheader the other BAM file with this header and re-sort the resulting BAM file:
    samtools reheader sort_header.sam cv19_ccs.clustered.aln_sorted.bam | samtools sort -o cv19_ccs.clustered.chr_sorted.bam -

    The resulting in cv19_ccs.clustered.chr_sorted.bam which should be now used with StringTie --mix instead of the original long-reads alignment file.

Please let me know if it works for you in this case, -- if not, we'll find another way to make it work.

eggrandio commented 3 months ago

Thank you for the quick reply! That seemed to fix the issue.

What I still don't understand is from where does the non-lexicographic ordering come from. I checked the original FASTA file and the chromosomes are ordered lexicographically (I thought that maybe they were reordered during HISAT2 indexing).

I have repeated the long-read mapping with minimap2 ensuring that I use the same conda environment used for short-read mapping and I am still getting the non-lexicographic order. I see that the header tag "@HD VN:1.6 SO:unknown" is already there on the non-mapped isoseq .bam. I tried to search for information on the sorting depending on the VN, but I could no find anything.

I will repeat all the isoseq processing unsing the same conda environment to ensure that samtools version is the same as the one used for short-reads.

Best,

gpertea commented 3 months ago

From what I'm reading, it seems to be related to the order of chromosomes in the original genome FASTA, at least in the case of HISAT2, its indexing scheme does not change that order.

But if you used the same genome FASTA for minimap2, it's likely that the internal indexing (?) of minimap2 is different -- or maybe simply minimap2 decides to write the header with the chromosomes sorted in "natural" (numeric) order, since it really doesn't matter for the initial unsorted SAM output.. Probably the intention was to make that order of chromosomes in the header look more natural.. i.e. "fix it" for the (imagined) user's benefit somehow, which was totally undesired here for our purposes..

I'm quite sure it has nothing to do with your conda environment, and also unrelated to the VN (version) in the header.

A simple fix for your pipeline/workflow seems to be to just reheader early, before sorting that ouput. Just prepare a header file like above and use something like samtools reheader sort_header.sam unsorted.bam | samtools sort ...