Open avilella opened 3 weeks ago
Any bases that are soft-clipped in the read-to-reference bam file will be ignored when generating features that are used for consensus inference. There is not a straightforward way to remove this restriction. To extend the consensus into the FWR4 region, you would need to extend the reference sequence to include the CDR3/FWR4 regions.
Alternatively, you could try using medaka smolecule, which first generates a POA of the reads and then performs a consensus of alignments to the POA consensus sequence. This should span the full length of the reads, though the accuracy will be limited by how well the variable CDR3 region can be aligned in the POA.
Thanks, I'll try medaka smolecule.
Is there a way in which I can spike "fake reads" into it so that the indels in the V-gene portion disappear but the SNVs remain there? Maybe do it as a 2-step process: medaka smolecule with spiked-in fake V-gene reads, then take the newly created reference with the CDR3/FWR4, and use it to re-align the reads against it?
On Tue, Nov 5, 2024 at 10:02 AM ftostevin-ont @.***> wrote:
Any bases that are soft-clipped in the read-to-reference bam file will be ignored when generating features that are used for consensus inference. There is not a straightforward way to remove this restriction. To extend the consensus into the FWR4 region, you would need to extend the reference sequence to include the CDR3/FWR4 regions.
Alternatively, you could try using medaka smolecule, which first generates a POA of the reads and then performs a consensus of alignments to the POA consensus sequence. This should span the full length of the reads, though the accuracy will be limited by how well the variable CDR3 region can be aligned in the POA.
— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/medaka/issues/538#issuecomment-2456741486, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGSN72ZARU3A5UISPVF73Z7CJU7AVCNFSM6AAAAABRACRI4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJWG42DCNBYGY . You are receiving this because you authored the thread.Message ID: @.***>
This may work but it seems simpler just to use the real reads. Any sequencing errors should be removed by the POA and medaka consensus steps while genuine variants would be retained.
Can medaka generate a consensus that extends from the soft-clipped 3'end of ONT reads mapped to a reference?
E.g. for B-cell repertoire or T-cell repertoire transcript sequencing with ONT, one can map the reads onto the V-gene sequence, which will look as shown below:
The
x
part of the ONT reads map to the V-Gene, there may be mismatches due to hypermutation, which should be dealt the same way as SNVs in genomic variant calling. Thec
part is the CDR3 region which is unique to each cell, and doesn't have a reference. Thef
part is the FWR4, which continues past the CDR3 region, and doesn't align to the V-gene. There could bei
insertions and-
deletions, which when they are in the V-gene mapping region, are always sequencing errors, as there is no indels in the V-gene part.Given a .bam file of reads mapping to their corresponding V-gene reference, how do I run medaka to obtain the consensus sequence that includes the CDR3 and FWR4 parts that don't map the V-gene reference?
Thanks in advance.