Closed eggrandio closed 3 months ago
This is a serious issue, as StringTie assumes that the chromosomes are always in lexicographic order, like you show in the HISAT2 BAM file (CV191_hisat2_sorted.bam
).
StringTie also internally sorts the reference annotation data ("guides" GTF/GFF) lexicographically, so it expects any input alignments to be already sorted following the same lexicographic order for reference sequences (chromosomes).
Checking the samtools sort
documentation it looks like samtools currently does not have a way to enforce lexicographic order of the reference sequences (chromosomes) during sorting, and instead it uses the order of chromosomes as found in the unsorted alignment file header (@SQ
entries).
I am a bit surprised by this finding, and a bit disappointed that samtools sort
lacks any options to enforce the sorting of the chromosomes in lexicographic order.
I think the steps you can take in your case is to use the header from the HISAT2 BAM file (which has the desired order) and put it onto the other file and sort that file again, using the new header. Something like this:
samtools view -H CV191_hisat2_sorted.bam > sort_header.sam
samtools reheader sort_header.sam cv19_ccs.clustered.aln_sorted.bam | samtools sort -o cv19_ccs.clustered.chr_sorted.bam -
The resulting in cv19_ccs.clustered.chr_sorted.bam
which should be now used with StringTie --mix
instead of the original long-reads alignment file.
Please let me know if it works for you in this case, -- if not, we'll find another way to make it work.
Thank you for the quick reply! That seemed to fix the issue.
What I still don't understand is from where does the non-lexicographic ordering come from. I checked the original FASTA file and the chromosomes are ordered lexicographically (I thought that maybe they were reordered during HISAT2 indexing).
I have repeated the long-read mapping with minimap2 ensuring that I use the same conda environment used for short-read mapping and I am still getting the non-lexicographic order. I see that the header tag "@HD VN:1.6 SO:unknown" is already there on the non-mapped isoseq .bam. I tried to search for information on the sorting depending on the VN, but I could no find anything.
I will repeat all the isoseq processing unsing the same conda environment to ensure that samtools version is the same as the one used for short-reads.
Best,
From what I'm reading, it seems to be related to the order of chromosomes in the original genome FASTA, at least in the case of HISAT2, its indexing scheme does not change that order.
But if you used the same genome FASTA for minimap2, it's likely that the internal indexing (?) of minimap2 is different -- or maybe simply minimap2 decides to write the header with the chromosomes sorted in "natural" (numeric) order, since it really doesn't matter for the initial unsorted SAM output.. Probably the intention was to make that order of chromosomes in the header look more natural.. i.e. "fix it" for the (imagined) user's benefit somehow, which was totally undesired here for our purposes..
I'm quite sure it has nothing to do with your conda environment, and also unrelated to the VN (version) in the header.
A simple fix for your pipeline/workflow seems to be to just reheader early, before sorting that ouput. Just prepare a header file like above and use something like samtools reheader sort_header.sam unsorted.bam | samtools sort ...
Hello,
I am trying to use the stringtie --mix option with short and long-read sorted bam files.
stringtie --mix -p 28 -o CV191_denovo_annotation.gtf CV191_hisat2_sorted.bam cv19_ccs.clustered.aln_sorted.bam
This returns:While if I use each of the .bam files separately, they do not return the error and they successfully generate a .gtf file
When I tried to confirm that both .bam files have been sorted with samtools sort, I see that the version of the file is different (@HD VN), and the order of the chromosomes is sorted differently too:
I have tried to sort them again using the same samtools version, but the header still remains the same (and the order of the chromosomes).
What can I do to resort them and be able to use them with stringtie --mix ?