google / deepconsensus

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
BSD 3-Clause "New" or "Revised" License
229 stars 36 forks source link

question about the permutation invariance in the MSA input to the DeepConsensus model #38

Closed rainwala closed 2 years ago

rainwala commented 2 years ago

Hello,

Thank you for this interesting work. I have a question regarding the input to the model. From what I can tell, you convert subread sequences from a multiple sequence alignment into input tensors (adding additional information as well). What I cannot understand is whether your model is invariant to permutations of the order of subread sequences in the MSA. When you trained the model, how did you address this issue? Did you use some sort of default order of the subread sequences in the MSA?

rainwala commented 2 years ago

Actually Andrew ansered this for me as follows:

"We experimented with inputs that randomized the subread order as well as those which preserved the order of subreads to be ordered chronologically by the time that they were generated. In practice, we observed a noticeable, though not enormous, improvement in accuracy when the chronological information of when subreads are generated was preserved. It is likely the case that the properties of a sequencing run change from the beginning to the end of the run and there is some useful information encoded in this order.

In theory, the method will not be invariant to sequence order, and the extent to which this will produce noticeable differences will be a function of what information is implicit in ordering and the quality of the model."