google / deeppolisher

Transformer-based sequence correction method for genome assembly polishing
BSD 3-Clause "New" or "Revised" License
30 stars 3 forks source link

Q: Model similarity between DeepVariant and DeepPolisher? #1

Closed jkalleberg closed 7 months ago

jkalleberg commented 8 months ago

Hello,

Am I correctly interpreting DeepPolisher as a blend between DeepVariant + DeepConcensus? Are there any docs, existing or in progress, to illustrate the differences and similarities between the tensors/channels used in the DeepVariant model(s) and the DeepPolisher model? For example, describing how the model listed in the case study here (/opt/models/pacbio model/checkpoint) differs from the one listed under the PacBio case study for DeepVariant (--model_type PACBIO).

I have a custom DeepVariant model (WGS & WGS.AF) for my species that we plan to use for polishing Verrko assemblies through traditional variant calling. I assume an Illumina-based DeepVariant model is incompatible as a replacement for the listed DeepPolisher model, but I'd like to understand why. Thanks!

kishwarshafin commented 8 months ago

Hi @jkalleberg ,

DeepPolisher is an extension of DeepConsensus. However, it uses a different set of features for predicting a set of errors you see in the assembly. A brief description is provided here.

DeepVariant and DeepPolisher are fundamentally different in the way that DeepVariant uses a candidate-based approach with convolutional neural network (CNN) to find variants where DeepPolisher uses a transformer-based model to predict a sequence. DeepVariant models will not work with DeepPolisher unfortunately.

jkalleberg commented 7 months ago

@kishwarshafin Thanks. I understand the incompatibility between the candidate and transformer architecture. But my question could have been worded better. I was referring specifically to the make_examples/make_images steps between the two models, not the underlying predictions made. My confusion is partially because of the similarity between the visuals. The one included in your link breaks down the features in almost the same way the DeepVariant pileup examples are visualized in the docs (for example, here). The features in DeepPolisher seem more similar to DeepVariant, as opposed to what's included in the workflow diagram for DeepConcensus.

Another point of confusion is that the case study for DeepPolisher has a lot of similarities with running DeepTrio. For example, DP runs maternal vs paternal inference separately, sort of like how DT stacks the maternal/paternal models in Fig 1 here.

I want to understand the evolution of the newer DeepXXX models against the ones I'm already using, but I'm getting confused about how the inputs are pre-processed differently. TIA!

kishwarshafin commented 7 months ago

Hi @jkalleberg ,

Yes, it could be a bit confusing to understand. DeepVariant's make_examples create "channels" that are stacked on top of each other as it uses a CNN model. Convolution is a sum operation done through multiple "channels" referred as tensors. A better representation on how channels are stacked can be seen in Figure 1 of the DeepVariant manuscript.

The make_images of DeepPolisher creates a single matrix of size "bases x features" where we append the features one after another. It's a sequential operation so the model sees features as rows. So all of the features are stacked in a single tensor instead of channels that are stacked on top of each other which you correctly see in the DeepPolisher description.

In terms of pre-processing:

Fundamentally, DeepVariant is trained to accurately genotype variants observed against the reference. DeepPolisher is trained to correct errors that happen during the genome assembly process. Although they may seem synonymous, our previous work showed that we need to be very careful while polishing genome assemblies. So, DeepPolisher tries to simplify the problem by having a sequential model trained to solve the assembly polishing problem.

I hope this gives you some details on the difference.

kishwarshafin commented 7 months ago

One more clarifying comment. If you are asking about "types of features" like base, match/mismatch. Then DeepConsensus is designed to work with PacBio subreads. They have unique features like pulse width and duration that are specific to the sequencing technology. DeepPolisher uses reads aligned against the assembly so the features we use to train DeepPolisher uses the features present in read alignment space which is also what DeepVariant has. Which is why DeepPolisher and DeepVariant may seem very similar to you as they share similar features.

jkalleberg commented 7 months ago

@kishwarshafin thank you for clarifying! It makes much more sense now. I appreciate the explanation.