legumeinfo / gcv

Federating genomes with love (and synteny derived from functional annotations)
https://gcv.legumeinfo.org/
Apache License 2.0
41 stars 12 forks source link

CNV-induced inversions #340

Open adf-ncgr opened 4 years ago

adf-ncgr commented 4 years ago

despite #262 we are still seeing some false inversions in regions of CNV, e.g.: https://legumeinfo.org/gcv2/gene;lis=phavu.Phvul.003G002400?algorithm=repeat&match=10&mismatch=-1&gap=-1&score=30&threshold=25&bmatched=20&bintermediate=10&bmask=10&linkage=average&cthreshold=20&neighbors=10&matched=4&intermediate=5&sources=lis&bregexp=&border=chromosome&regexp=&order=distance

per @alancleary :

Yeah, it's an inversion issue. Unfortunately the inversion algorithm is optimizing score, so that inversion is hard to avoid since the CNV makes the inversion's score higher than the non-inverted segment.

Yes, I already attempted to fix this. There's some heuristics in place to prevent inversions of tandem duplications. This may be an edge case caused by the orphans breaking up the duplications, but I can't say for sure. Feel free to open an issue if you want me to attempt another fix.

couple of things to note- in this case the orientation of the genes seems like it would be helpful, see #140 for some discussion of this in another context (not exactly sure it applies here); also, could consider something akin to "homopolymer compression" for minimizing the impact of CNV in such situations. If nothing else, perhaps introducing an inversion penalty would make sense.

alancleary commented 3 years ago

@adf-ncgr I'm sure you've seen this but I just noticed a very nice example of the inversion algorithm doing a great job: https://legumeinfo.org/gcv2/gene;lis=phavu.Phvul.002G100400?q=Phvul.002G100400&sources=lis&algorithm=repeat&match=10&mismatch=-1&gap=-1&score=30&threshold=25&bmatched=20&bintermediate=10&bmask=10&linkage=average&cthreshold=20&neighbors=10&matched=4&intermediate=5&bregexp=&border=chromosome&regexp=medsa.chr4.4&order=distance

adf-ncgr commented 3 years ago

thanks @alancleary that's an interesting example, although technically I think it is not an inversion but rather a segmental duplication. This is perhaps more clear when looking at the dotplot although it is somewhat puzzling to me why we see more copies of some of the genes in the dotplot than appear to exist in the aligned track. Maybe some segments are getting ignored due to their scores (trying to fiddle with the params a bit and having trouble getting it to bend to my will- must be getting old!)

image

alancleary commented 3 years ago

Well shoot. I think you're right; it's just a segmental duplication. I may need to start taking gene orientation into account again when aligning. That would probably make these false inverses way more uncommon (including the CNV case).

Regarding the difference between copies of genes in the dot plots vs the micro-synteny view, dots plots generate a circle for all pairs of genes that share the same family. That's why you get grids of dots when there's tandem duplications: Screenshot 2021-09-17 at 14-43-47 Genome Context Viewer

adf-ncgr commented 3 years ago

not "just" a segmental duplication- a very nice demonstration of a less common type of SV that you've handled nicely (IIRC it is the reason we brought in the repeat algorithm in the first place!)

I know about the CNV grid effect, but in the case of the segmental duplication I'd expect to see (for example) three copies of the "gold" gene in the aligned medsa.chr4.4 track to one in the query phavu.Chr02 track but in fact I see only 2 to 1 in the alignment as seen here image

my guess is that the middle copy in the dotplot is getting hidden in the alignment for some technical reason but I could certainly be wrong!

adf-ncgr commented 3 years ago

BTW, in the dotplot when you hover over a circle you only get info about the query track gene. would be nice to know what the pair represents! let me know if this is issue-worthy.

alancleary commented 3 years ago

my guess is that the middle copy in the dotplot is getting hidden in the alignment for some technical reason but I could certainly be wrong!

Oh, I see. I guess I'll take a look at it while I'm working on this issue.

BTW, in the dotplot when you hover over a circle you only get info about the query track gene. would be nice to know what the pair represents! let me know if this is issue-worthy.

I actually had that same thought when playing with the dot plots just now. Definitely issue-worthy!

adf-ncgr commented 10 months ago

Just want to bump this issue slightly (however symbolic an act that may be) after having had to convince myself (and a collaborator) that the "inversion" seen in this example: https://medicago.legumeinfo.org/tools/gcv/gene;medicago=medtr.A17.gnm5.ann1_6.MtrunA17Chr4g0064841?q=medtr.A17.gnm5.MtrunA17Chr4:55872871-56368598&sources=medicago&algorithm=repeat&match=10&mismatch=-1&gap=-1&score=30&threshold=25&bmatched=20&bintermediate=10&bmask=10&bchrgenes=1&bchrlength=100000&linkage=average&cthreshold=20&neighbors=35&matched=4&intermediate=5&bregexp=&border=distance&regexp=&order=distance is in fact a case of this issue and not something to include in a manuscript. What tipped me off was the mismatch in the orientation of the arrows with the query track in the flipped region, but I confirmed it with whole genome sequence alignment/dotplot inspection. I think this probably further confirms that including gene orientations in some way is important to consider when we consider alignment scoring.