legumeinfo / azulejo

Tiling phylogenetic space with subtrees
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

minimap2-based prototype implementation evaluation #109

Open adf-ncgr opened 4 years ago

adf-ncgr commented 4 years ago

I hacked together a method based on the minimap2 alignment strategy. The basic idea right now is that after producing whole genome alignments via minimap, one can use the paftools liftover command to project coordinates of genes from the query genome into coordinates on the reference genome. Then, using an intervaltree-based script, you can produce a correspondence between the original query gene and whatever reference gene best overlaps the projected coordinate (currently using a simple heuristic to sort out cases where multiple genes are overlapped). So far, the method seems to be producing results on cowpea that are fairly consistent with the results produced by the current DAG-chainer based method as well as with the phytozome assignments (exact method they use is unknown, but seems to be based on correspondences to a single reference genome).

A couple of things that seem to me to be possible advantages of the whole genome-based method over the current implementation:

My current implementation is pretty simple and has so far only been evaluated on cowpea and glycine, but seems to be doing a competitive job. Worth further discussion with @joelb123 and @cann0010 (who may or may not get this message since the repo still hasn't been moved to legumeinfo organization per #108)

adf-ncgr commented 3 years ago

I've put the 33 glycine lines through this protocol, which is far from perfect but seems to give generally reasonable results (at least, for genes present in the line chosen as the reference). Comparison of these results to those of azulejo-0.9.19/glycine33_i-0.96_k-2/synteny_anchors.tsv yields 4368 clusters with perfect correspondence (ie ~%10 of what we might expect as reasonable); but, most of the differences seem to be of the form that azulejo is not putting in members that were recognized as such by the minimap2 approach, so it's possible that the run you plan with less stringent homology parameters will address some of it. The couple of spot-checks I've done using GCV suggest that the minimap2 results are non-controversial.