Chimeric alignments spanning multiple adjacent genes

vkkodali commented 1 year ago

When aligning proteins with paralogs located in a cluster, miniprot produces alignments encompassing two or more distinct genes.

For example, aligning mouse protein NP_001036176.1 to human chromosome 1 (NC_000001.11) returns 3 alignments, one of which aligns such that it encompasses 3 distinct genes.

$ miniprot --version
0.8-r220
$ efetch -db nucleotide -id NC_000001.11 -format fasta > genome.fa
$ efetch -db protein -id NP_001036176.1 -format fasta > prot.fa
## PAF output truncated for brevity
$ miniprot genome.fa prot.fa 2>/dev/null
NP_001036176.1  508  0  508  +  NC_000001.11  248956422  103617440  103625743  1305  1533  0  AS:i:2173  ms:i:2468  np:i:470  da:i:0  do:i:0  cg:Z:56M345N49M810N51M3D12M445N77M766N44M895V40M97V33M1972N39M114V41M1338V62M
NP_001036176.1  508  0  508  +  NC_000001.11  248956422  103571602  103579497  1299  1533  0  AS:i:2159  ms:i:2454  np:i:467  da:i:0  do:i:0  cg:Z:56M339N49M806N51M3D12M447N77M321N44M832V40M98V33M1949N39M114V41M1468V62M
NP_001036176.1  508  0  508  +  NC_000001.11  248956422  103617440  103719868  1302  1533  0  AS:i:2155  ms:i:2450  np:i:468  da:i:0  do:i:0  cg:Z:56M39235N49M803N51M3D12M55731N77M801N44M816V40M97V33M1972N39M114V41M1338V62M

These alignments can be seen in the following screenshot. Numbers adjacent to the alignment represent their ranking based on the alignment score. Appropriately, the chimeric alignment spanning multiple genes is ranked the lowest.

While the chimeric alignment in the previous case was ranked the lowest, and secondary to the best alignment it is not always the case. As an example, I have aligned 3 proteins (CAD87763.1, AAZ99031.1, AUO15579.1) to the genome sequence NC_069144.1 and all 3 alignments returned by miniprot are chimeric as shown in the following screenshot: The genes in blue boxes on the left and right hand sides are both "pheloloxidase-8-like" genes and encode proteins that are paralogous. All 3 sequences align such that a portion of the sequence aligns to one paralog and the rest aligns to the other paralog.

The alignments are riddled with mismatches and indels, and understandably, when we align distant proteins to a genome it becomes increasingly difficult to arrive at perfect alignments. So, what's an aligner to do? Perhaps an appropriate outcome to this scenario would be to, say, generate two (subpar) alignments for each protein, with unaligned tails and identify the better one as primary. Splign and Prosplign offer some inspiration by using "compart" logic that avoids generating alignments spanning multiple distant locations. Could a similar approach be applied to miniprot?

I have a dataset of over >165k proteins aligned to the Anopheles cruzii genome and I see chimeric alignments such as the one described above quite often. After playing with -J and -E parameters individually and in combination, I have made little progress in systematically avoiding them.

lh3 commented 1 year ago

I had a look at the alignment AAZ99031.1. Miniprot found the longer alignment because the shorter alignment is not as good. I don't know how to solve this. How does "compart logic" work?

Also, are there correct alignments around this locus? Maybe you can use those to filter out poor hits?

vkkodali commented 1 year ago

Thank you @lh3 for taking a look at this. The "compart logic" used in Splign is described here: https://pubmed.ncbi.nlm.nih.gov/18495041/ Binaries for Splign and Compart as well as the source code can be found on the splign website: https://www.ncbi.nlm.nih.gov/sutils/splign/splign.cgi

Also, are there correct alignments around this locus? Maybe you can use those to filter out poor hits?

I should be able to get "correct" alignments but I am not sure how to use those alignments to filter out poor hits? More importantly though, using AAZ99031.1 as an example, I would like to get the shorter alignment even if it is relatively poor quality compared to the longer ones. Is that possible?

vkkodali commented 1 year ago

@lh3 -- just checking if you have had a chance to look at the compart code...

lh3 / miniprot

Chimeric alignments spanning multiple adjacent genes #35