rlorigro / GFAse

Tool for globally phasing diploid assembly graphs with orthogonal data
Mozilla Public License 2.0
36 stars 4 forks source link

Where to get detailed description of GFAse's computational methods? #28

Open xujialupaoli opened 1 month ago

xujialupaoli commented 1 month ago

Hi,

I am very interested in the methods of GFAse. I have carefully read your paper ‘Phased nanopore assembly with Shasta and modular graph phasing with GFAse’, but I don’t quite understand the Computational methods of GFAse. I would like to know more details about the Computational methods. I see that Shasta has a related webpage introduction https://paoloshasta.github.io/shasta/ComputationalMethods.html, but I can’t find the one for gfase. Can you tell me where to find it?

Looking forward to your reply!

rlorigro commented 1 month ago

Hi Rebecca, unfortunately there is no existing description other than the cited papers that is more thorough. However I am happy to answer any questions you have. The summary of it is that there is a preprocessing step (to find bubbles and estimate the contact map), an optimization step, and then a chaining step. Is there anything in particular you want to know about?

xujialupaoli commented 1 month ago

Thank you very much for your reply!

I saw in your article that "To phase the graph, GFAse first identifies diploid, haplotypic bubbles. Two methods are available in GFAse: assembler annotation and sequence similarity search", but the article did not introduce how GFAse implements the phase process through assembler annotation. Could you tell me about this process?

In addition, in the introduction to GfAse, your article mentioned that "Phases are optimized using a stochastic method that approximates a solution to the optimization variant of the max-cut problem (Selvaraj et al. 2013; Edge et al. 2017; Cheng et al. 2021). The method depends on an objective function that penalizes inconsistent contacts and rewards consistent contacts." Can I know the specific judgment method of the objective function you are talking about?

rlorigro commented 1 month ago

the article did not introduce how GFAse implements the phase process through assembler annotation. Could you tell me about this process?

This one is very simple, it just means that Shasta (Mode 2) already labels its GFA by a naming convention. For example, two node names in the GFA (S lines) that are members of the same bubble will have identical prefixes, but their suffix would be either .0 or .1. You can see the naming convention in the figure below, for the nodes starting with PR. However, using the node names is no longer recommended with Shasta Mode 3, so we have reverted to using homology search.

Can I know the specific judgment method of the objective function you are talking about?

The objective that we maximize is the "consistency score" which we have defined as simply the sum of the consistent contacts minus the sum of the inconsistent contacts. The code is here. In the figure below, the bubble orientations are described with the red/blue node fill colors. The consistent weights are along the curved edges in green and the inconsistent weights are in red. In this example the objective/score would be (68+104)-(6+9). If you flipped one of the bubbles then it would be the opposite.

image