immcantation / presto

pRESTO is part of the Immcantation analysis framework for Adaptive Immune Receptor Repertoire sequencing (AIRR-seq). pRESTO is a bioinformatics toolkit for processing high-throughput lymphocyte receptor sequencing data.
https://presto.readthedocs.io
GNU Affero General Public License v3.0
0 stars 0 forks source link

MaskPrimers-align edge cases are not masking correctly #42

Open ssnn-airr opened 8 years ago

ssnn-airr commented 8 years ago

Original report by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


When the alignment has a gap in the input sequence at the very end of the alignment, the masking is off by one. For example:

       ID> SRR765688.1679
SEQORIENT> RC
   PRIMER> LR11
 PRORIENT> F
  PRSTART> 33
    INSEQ> TCACCTGCGCTGTCTCTGGTGGCTCCATCAGCAGTAGTAACTGGTGGAGT-TGGGTCCGCAGCCC
    ALIGN> ---------------------------------GGTGCAGCTGGTGGAGTC
   OUTSEQ>                                  NNNNNNNNNNNNNNNNNTTGGGTCCGCAGCCC
   FIXSEQ>                                  NNNNNNNNNNNNNNNNNNTGGGTCCGCAGCCC

    ERROR> 0.2777777777777778

In this case, OUTSEQ has one extra T. Trivial attempts to fix the problem, solve the problem in the right-hand gap case, but introduce the problem in the left-hand gap case:

       ID> SRR765688.1837
SEQORIENT> RC
   PRIMER> LR3
 PRORIENT> F
  PRSTART> 0
    INSEQ> -GCAATCTGGGTCTGAGTTGAAGACGGCCTGGGGCCTCAGTGAAGATTTCCTGCAAGAC
    ALIGN> TGCAATCTGGGTCTGAGTTG-------------------------------
   OUTSEQ> NNNNNNNNNNNNNNNNNNNNAAGACGGCCTGGGGCCTCAGTGAAGATTTCCTGCAAGAC
   FIXSEQ> NNNNNNNNNNNNNNNNNNNNNAGACGGCCTGGGGCCTCAGTGAAGATTTCCTGCAAGAC
    ERROR> 0.050000000000000044

Will require more detailed parsing of the local alignments to fix both edge cases.

ssnn-airr commented 4 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


Cool. There are a few more implementation out there. The developers of the first one are less than responsive:

https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library/issues/55

But there have been multiple versions since I last tried to install it:

https://pypi.org/project/ssw-py/

ssnn-airr commented 4 years ago

Original comment by Edel Aron (Bitbucket: edel.aron, ).


I’ll definitely look into those, thanks Jason!

ssnn-airr commented 4 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


Look for striped smith waterman algorithm. The problem is finding a C implementation of it with a working python wrapper that is trivial to install (via pip). See the following for a starting point:

  1. https://github.com/mengyao/complete-striped-smith-waterman-library
  2. http://scikit-bio.org/docs/0.5.6/generated/skbio.alignment.StripedSmithWaterman.html#skbio.alignment.StripedSmithWaterman

ssnn-airr commented 4 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


I think the release this issue has sat for so long is that it’s probably a better use of time to trade out the function used for the local alignment, due to performance problems, instead of trying to hack a solution around the existing local alignment function.

ssnn-airr commented 4 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


Not aside from the example I posted in the issue.

ssnn-airr commented 4 years ago

Any chance that you already have some toy data to reproduce this and fix this bug?