Inversion and alignment scores

alancleary commented 4 years ago

Previously, inversions were only included in micro-synteny alignments if they contained at least 2 genes and their score was at least as large as the threshold parameter. Neither of these constraints are being enforced in the refactored smith-waterman and repeat alignment algorithms, making for some concerning alignments. Additionally, the threshold parameter is actually being used by the repeat algorithm when computing its score matrix. It may be worth revising this since repeated segments are being treated as distinct tracks., thus their scores should be independent from one another. Even if we decide to keep using the threshold parameter in the repeat algorithm, it should be distinct from the inversion threshold since these functions are unrelated. Consider issue #258 when addressing this issue.

Edit: Also, consider that an alignment with a large inversion or many small inversions may not meet the score parameter until after its inverted segments have been reversed, thus increasing its score.

alancleary commented 4 years ago

This is not a blocking issue on the GoldenLayout epic so I'm going to orphan it.

adf-ncgr commented 4 years ago

thanks for redirecting me to the existing issue. part of my puzzlement is that the smith-waterman never used to produce inversions at all, no matter the size. Another part is that it isn't clear to me how segments are being handled with respect to the threshold parameter. Compare these three alignments based on the same set of tracks, the first being smith-waterman, the second repeat and the third repeat with threshold set "prohibitively high" " http://dev.lis.ncgr.org:50011/epic_search/gcv/gene;vigna=vigun.IT97K-499-35.gnm1.ann2.Vigun08g065800.v1.2?algorithm=smith-waterman&match=10&mismatch=-1&gap=-1&score=30&threshold=50&bmatched=20&bintermediate=10&bmask=10&linkage=average&cthreshold=20&neighbors=50&matched=0.2&intermediate=15&sources=vigna&bregexp=&border=chromosome&regexp=&order=chromosome

http://dev.lis.ncgr.org:50011/epic_search/gcv/gene;vigna=vigun.IT97K-499-35.gnm1.ann2.Vigun08g065800.v1.2?algorithm=repeat&match=10&mismatch=-1&gap=-1&score=30&threshold=50&bmatched=20&bintermediate=10&bmask=10&linkage=average&cthreshold=20&neighbors=50&matched=0.2&intermediate=15&sources=vigna&bregexp=&border=chromosome&regexp=&order=chromosome

http://dev.lis.ncgr.org:50011/epic_search/gcv/gene;vigna=vigun.IT97K-499-35.gnm1.ann2.Vigun08g065800.v1.2?algorithm=repeat&match=10&mismatch=-1&gap=-1&score=30&threshold=10000&bmatched=20&bintermediate=10&bmask=10&linkage=average&cthreshold=20&neighbors=50&matched=0.2&intermediate=15&sources=vigna&bregexp=&border=chromosome&regexp=&order=chromosome

adf-ncgr commented 4 years ago

PS. although I agree it isn't a GoldenLayout issue, I am not sure I'd want to make an official 2.0 release before this is addressed

alancleary commented 4 years ago

part of my puzzlement is that the smith-waterman never used to produce inversions at all, no matter the size. Another part is that it isn't clear to me how segments are being handled with respect to the threshold parameter.

I'll have to double check the code, but I'm pretty sure the threshold parameter is now only being used by the repeat algorithm as described in Biological sequence analysis, which, as I noted in my original post, I don't think should be the case. Instead it should be used to control the minimum score an inversion has to meet, as we were doing before. This will probably help remove some of the "stitching" that's occurring.

Regarding Smith-Waterman having inversions; you are correct, it didn't have inversions before. When I was revisiting the alignment algorithms I couldn't come up with a good reason why Smith-Waterman shouldn't have inversions, so I gave it inversions. Once the threshold parameter is filtering inverted segments again, it should be pretty easy to effectively disable inversions by setting a very high threshold. Alternatively, we could add a boolean to the alignment parameters that simply toggles whether or not inversions are computed.

adf-ncgr commented 4 years ago

zut! I knew I shouldn't have left Biological sequence analysis at the office during this pandemic. I vaguely recall that the reason we introduced the repeat algorithm in the first place was not for inversions but for segmental duplications (which seem to be much less common). We definitely want to retain an ability to handle inversions in the manner to which I have become accustomed. Anyway we should probably just chat some more about it when you've had a chance to review the code and feel ready for diving back into algorithmic issues- I didn't mean to distract you too much from your other heroics...

alancleary commented 4 years ago

Update: I've modified the repeat algorithm to optionally carryover the score from the previous alignment and told the GCV not to carryover, i.e. the algorithm still finds non-overlapping alignments but their scores are now completely independent. I've also constrained inversions to meet a min-size and score threshold (as previously discussed). All of this has helped, but the "stitching" you've described (@adf-ncgr) is still very common.

I looked into it and found that the true cause of the stitching is how forward and reverse oriented alignments are being combined to make tracks with inversions. Long story short, a new algorithm needs to be used, specifically, the weighted interval scheduling dynamic program. Fortunately the algorithm it's replacing can be repurposed to perform the additional processing of the forward and reverse oriented alignments needed to get them ready for the dynamic program.

It's hard to say how long this will take but I think it's safe to bump the estimate on this issue a little higher.

alancleary commented 4 years ago

Commit 1d1ab9fc57b46132291d610d6400a459ce96784e revised the alignment merging algorithm used to create alignments with inversions. It also fixed a few alignment related bugs and added a couple minor features to improve the quality of the alignments.

The alignments seem to be well behaved now, though some unexpected behavior still occurs if the threshold parameter is low. I've only seen this when it is (what I believe to be) too low, that is, lower than anyone would reasonably have it considering the other alignment parameters.

I'll wait for feedback from @adf-ncgr before closing the issue.

adf-ncgr commented 4 years ago

initial impressions are good, though I've done exactly one test so far. I'll note that the one test did suggest that something we've discussed is still an issue though I'm not sure offhand if we already have an issue for it or not. It's not a show-stopper but at some point it would be nice to fix whatever results in these wide stretches suggesting indels but apparently not corresponding to any. I vaguely remember a recent conversation we had about the HMM iterative surgery procedure maybe being responsible but don't recall whether we made an entry for it (not have I bothered to look yet...)

adf-ncgr commented 4 years ago

looked for relevant issue but did not find one; likely @alancleary could describe it better, so will let him do the honors (again, I don't think this needs to hold up the release)

alancleary commented 4 years ago

OK, I'll leave this issue open until you think it's good to go.

The empty column issue you're referring to is #198.

alancleary commented 4 years ago

Apparently there's an issue where if a gene is duplicated one or more times in tandem then the segment containing the duplications may be needlessly inverted, that is, inverting the segment doesn't improve the alignment score and it actually obfuscates the alignment.

As @adf-ncgr and I discussed, this shouldn't be a blocking issue for the version 2 release but it should be addressed.

alancleary commented 4 years ago

The tandem duplication issue was fixed in commit 478e0d3281b3ff31f1900919d5540766c984bea1.

Unfortunately there's still a couple issue that need to be resolved:

The weighted intervals that are pieced together into an alignment allow for the optimal alignment to not necessarily cover all of the genes it should, that is, some genes in the middle of the alignment may not be given a position or score! (This can occur when a run of mismatched/indel genes occurs in both the forward and reverse alignments) A proper fix would update the weighted interval generation code to only generate sets of intervals that guarantee the weighted interval scheduling dynamic program will always select a covering of the genes, but this will be tedious to devise and implement. A temporary fix would be to scan the final alignment for genes that aren't covered and use a heuristic to decide which alignment the segment should come from, i.e. if both flanking segments are from the reverse alignment, then fill it in from the reverse alignment, otherwise fill it in from the forward alignment. Either way, this should be fixed prior to the version 2 release.
Some alignments appear to have segments that aren't allowed, i.e. they have fewer genes than they should and/or their score doesn't meet the threshold parameter. A cursory inspection suggests that this may actually be an issue with the alignment drawing algorithm. Whatever the cause, I don't think it should be a blocker for the version 2 release.

alancleary commented 4 years ago

Commit 3eca963e7dc76787eb8feab6938d039601a72bc1 fixed both points from the previous comment.

There's still some oddities in the alignments, but I would consider them nuances of the algorithms than bugs. As such, I'll consider this issue closed. Undesirable alignment oddities can be pursued in other issues.

legumeinfo / gcv

Inversion and alignment scores #262