Open vkkodali opened 1 year ago
I had a look at the alignment AAZ99031.1. Miniprot found the longer alignment because the shorter alignment is not as good. I don't know how to solve this. How does "compart logic" work?
Also, are there correct alignments around this locus? Maybe you can use those to filter out poor hits?
Thank you @lh3 for taking a look at this. The "compart logic" used in Splign is described here: https://pubmed.ncbi.nlm.nih.gov/18495041/ Binaries for Splign and Compart as well as the source code can be found on the splign website: https://www.ncbi.nlm.nih.gov/sutils/splign/splign.cgi
Also, are there correct alignments around this locus? Maybe you can use those to filter out poor hits?
I should be able to get "correct" alignments but I am not sure how to use those alignments to filter out poor hits? More importantly though, using AAZ99031.1 as an example, I would like to get the shorter alignment even if it is relatively poor quality compared to the longer ones. Is that possible?
@lh3 -- just checking if you have had a chance to look at the compart code...
When aligning proteins with paralogs located in a cluster,
miniprot
produces alignments encompassing two or more distinct genes.For example, aligning mouse protein NP_001036176.1 to human chromosome 1 (NC_000001.11) returns 3 alignments, one of which aligns such that it encompasses 3 distinct genes.
These alignments can be seen in the following screenshot. Numbers adjacent to the alignment represent their ranking based on the alignment score. Appropriately, the chimeric alignment spanning multiple genes is ranked the lowest.
While the chimeric alignment in the previous case was ranked the lowest, and secondary to the best alignment it is not always the case. As an example, I have aligned 3 proteins (CAD87763.1, AAZ99031.1, AUO15579.1) to the genome sequence NC_069144.1 and all 3 alignments returned by
miniprot
are chimeric as shown in the following screenshot: The genes in blue boxes on the left and right hand sides are both "pheloloxidase-8-like" genes and encode proteins that are paralogous. All 3 sequences align such that a portion of the sequence aligns to one paralog and the rest aligns to the other paralog.The alignments are riddled with mismatches and indels, and understandably, when we align distant proteins to a genome it becomes increasingly difficult to arrive at perfect alignments. So, what's an aligner to do? Perhaps an appropriate outcome to this scenario would be to, say, generate two (subpar) alignments for each protein, with unaligned tails and identify the better one as primary. Splign and Prosplign offer some inspiration by using "compart" logic that avoids generating alignments spanning multiple distant locations. Could a similar approach be applied to
miniprot
?I have a dataset of over >165k proteins aligned to the Anopheles cruzii genome and I see chimeric alignments such as the one described above quite often. After playing with
-J
and-E
parameters individually and in combination, I have made little progress in systematically avoiding them.