immcantation / presto

pRESTO is part of the Immcantation analysis framework for Adaptive Immune Receptor Repertoire sequencing (AIRR-seq). pRESTO is a bioinformatics toolkit for processing high-throughput lymphocyte receptor sequencing data.
https://presto.readthedocs.io
GNU Affero General Public License v3.0
0 stars 0 forks source link

Add subcommands to MaskPrimers to deal with data for which primers are unavailable #55

Closed ssnn-airr closed 6 years ago

ssnn-airr commented 7 years ago

Original report by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


We could use a subcommand in MaskPrimers to deal with data that do not have primer sequences, such as masking X bases from a given start position. The same mode should probably be able to extract UMIs both as part of the masking process and without any masking.

ssnn-airr commented 6 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


Testing done. No difference in the output of MaskPrimers-align and MaskPrimers-score between tip and v0.5.6.

ssnn-airr commented 6 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


extract mode probably needs the --revpr argument as well, so you can extract from the tail of different length sequences.

ssnn-airr commented 6 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


Done in 6f327e0, but MaskPrimers needs a lot of testing now.

ssnn-airr commented 6 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


Currently, that's how we would do it. Cleaning out my email and it's an old user request. Could be accomplished using the same --barcode approach in align/score in a single step.

ssnn-airr commented 6 years ago

Original comment by Roy Jiang (Bitbucket: ruoyijiangyale, ).


I think we'd want to do that in 2 stages.... no?

ssnn-airr commented 6 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


Should also accommodate this case:

>sequence
NNNNNNNNNNNNNNNNXXXXXXXXXXATGTCGATAGCTACGTCACTG

Where N = cell barcode and X = UMI. And what you want is:

>sequence|CELL=NNNNNNNNNNNNNNNN|UMI=XXXXXXXXXX
NNNNNNNNNNNNNNNNXXXXXXXXXXATGTCGATAGCTACGTCACTG
ssnn-airr commented 6 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


I added command line arguments and a skeleton for this subcommand (extract), but didn't do any of the actual implementation. Let's take a look at it whenever you have time. See how we want to handle the task.

ssnn-airr commented 6 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


Yeah, that's how we've been doing it to date. You could also do it with any sequence using --maxerror 1.

It's unintuitive though. You need a pointless file that can just be replaced by an --length argument.

ssnn-airr commented 6 years ago

Original comment by Roy Jiang (Bitbucket: ruoyijiangyale, ).


This can be done by creating a primer file like this:


>BARCODE
NNNNNNNNNNN

MaskPrimers.py score \ -s ${SEQ} -p ${PRIMERS} \ --mode cut/trim/tag/mask \ --start 2 \ --barcode \ --maxerror 0.2 #irrelevant...

But another mode where only the primer is removed vs cut and trim (which either remove the preceding nts or both the preceding and the primer). And changing the barcode specification so that the cut out chunk is placed in the annotation field.