google / deepconsensus

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
BSD 3-Clause "New" or "Revised" License
229 stars 36 forks source link

the label without alignment #65

Closed one-matrix closed 1 year ago

one-matrix commented 1 year ago

the example Snipaste_2023-05-23_17-46-17 I have a question. When subreads " TGACA" and label "TGACA" are aligned, the first character has a space, result in no alignment. Will this affect the training accuracy?

danielecook commented 1 year ago

@one-matrix No this will not affect training accuracy.

After the sequence is predicted, we strip all gaps from both the prediction and the label before passing it to an alignment-based loss function to calculate loss.

one-matrix commented 1 year ago

@danielecook hi,danielecook. strand is different from pw ,base and 0 forward 1 reverse,but the ModifiedOnDeviceEmbedding class mask the 0 as any other.That might not be such a good idea.

` if params.use_strand: strand_vocab_size = params.STRAND_MAX + 1 self.strand_embedding_layer = ModifiedOnDeviceEmbedding( vocab_size=strand_vocab_size, embedding_width=params['strand_hidden_size'], name='strand_embedding', )

class ModifiedOnDeviceEmbedding(layers.OnDeviceEmbedding): """Subclass of OnDeviceEmbedding, init similar to EmbeddingSharedWeights."""

def init(self, vocab_size, embedding_width, **kwargs):

Set initializer and scale_factor to match the original implementation in

# tensorflow_models/official/legacy/transformer/embedding_layer.py
super().__init__(
    vocab_size,
    embedding_width,
    initializer=tf.random_normal_initializer(
        mean=0.0, stddev=embedding_width**-0.5
    ),
    scale_factor=embedding_width**0.5,
    **kwargs,
)

def call(self, inputs):

make sure 0 ids match to zero emebeddings.

embeddings = super().call(inputs)
mask = tf.cast(tf.not_equal(inputs, 0), embeddings.dtype)
embeddings *= tf.expand_dims(mask, -1)
return embeddings

`

danielecook commented 1 year ago

I'm not sure I fully understand what the issue is. We encode strand as 0=unknown, 1=forward, 2=reverse

class Strand(int, enum.Enum):
  UNKNOWN = 0
  FORWARD = 1  # read.is_reverse == False
  REVERSE = 2  # read.is_reverse == True

These values are then embedded.

one-matrix commented 1 year ago

@danielecook Thanks to danielecook, "UNKNOWN = 0" can solve my doubts. This picture is a little misleading. 22-41-41

danielecook commented 1 year ago

I see - thanks for pointing this out. I'll see if we can get the figure updated.