immcantation / presto

pRESTO is part of the Immcantation analysis framework for Adaptive Immune Receptor Repertoire sequencing (AIRR-seq). pRESTO is a bioinformatics toolkit for processing high-throughput lymphocyte receptor sequencing data.
https://presto.readthedocs.io
GNU Affero General Public License v3.0
0 stars 0 forks source link

Add unique identify to start of read names for downstream analysis #87

Closed ssnn-airr closed 2 years ago

ssnn-airr commented 3 years ago

Original report by notrando (Bitbucket: notrando, ).


Hi there,

Thanks for creating presto, great tool and amazing ecosystem.

I’m trying to analyse the output of presto using the IMGT servers, but unfortunately they truncate the read names, therefore information is lost which affects downstream analysis.

There are a few options provided by presto which partially solve the issue. There’s the ParseHeaders.py subcommands add and rename. add will append to the end of the read, so unfortunately this doesn’t help and rename will add to the start with some minor issues (like adding NONE| for some odd reason) but both of these do not really solve the issue: a short unique identify that can be added to the start of the read name.

I think a simple solution is adding the record number to the start of each read. For example 100 reads would have SAMPLE_1 SAMPLE_2SAMPLE_100 added to the start of the read name. The most optimal solution would be a new subcommand that renames the headers to the sample record and then creates a text file with new and old names for renaming back or referencing.

On a slightly related note, it would be fantastic if add subcommand could add to the start of the read name.

Thanks!

ssnn-airr commented 3 years ago

Hi! I need more details. Are your pRESTO output files using the pRESTO annotation scheme? Are you using Changeo-O’s MakeDb.py imgt to parse IMGT output? (There is an example here and the documentation is here).

ssnn-airr commented 2 years ago

Reopen if needed