Parsing modification positions of peptidoform

michabirklbauer commented 8 months ago

For fraggraph we need to get the positions of the modifications from the peptidoform string
E.g. get_modification_positions("ARTKQTARKSTGGKAPRKQLATKAARKSAPAT[-79.966331]GGV[+79.966331]KKPHRYRPGTVALRE") should return (32, 35) -> [1-based index of the modification postitions]
There's probably some proteomics package that can already do that
But need someone to look into that, so we can integrate it

caetera commented 8 months ago

I did that sort of thing a lot when converting from Profoma to Peprec format (DeepLC input). Since your example input is a valid ProForma, that should work. There is, probably, a bit of overhead, though, when creating ProForma object from the string representation, timeit says 36.5 µs ± 220 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) for the complete function

from pyteomics.proforma import ProForma

# This version returns only positions
def get_peprec_modification_positions(proforma_string):
    result = []
    p = ProForma.parse(proforma_string)

    if not p.n_term is None:
        for mod in p.n_term:
            result.append(0)

    for i, (aa, mods) in enumerate(p.sequence):
        if not mods is None:
            for mod in mods:
                result.append(i + 1)

    if not p.c_term is None:
        for mod in p.c_term:
            result.append(-1)

    return result

# This version returns Peprec modification string, i.e. position1|identity1|position2|identity2|...
def get_peprec_modifications(proforma_string):
    result = []
    p = ProForma.parse(proforma_string)

    if not p.n_term is None:
        for mod in p.n_term:
            result.append(f'0|{mod}')

    for i, (aa, mods) in enumerate(p.sequence):
        if not mods is None:
            for mod in mods:
                result.append(f'{i + 1}|{mod}')

    if not p.c_term is None:
        for mod in p.c_term:
            result.append(f'-1|{mod}')

    return '|'.join(result)

michabirklbauer commented 8 months ago

@caetera Thanks Vladimir! I just tested it and it works perfect! And I don't think the overhead is going to matter, as we only compute this once or twice. Should I commit this or would you want to commit it yourself? 😊

caetera commented 8 months ago

Hi @michabirklbauer, happy to help. You are welcome to commit it yourself - you likely know better where it should be in the code.

michabirklbauer commented 8 months ago

Alright, will do! Thank you!

michabirklbauer / internal_ions

Parsing modification positions of peptidoform #19