levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

fasta .get_by_id method only matching single entry #76

Closed radusuciu closed 2 years ago

radusuciu commented 2 years ago

I work with fasta databases contain reversed decoy sequences in this format:

>sp|P01111|RASN_HUMAN GTPase NRas OS=Homo sapiens (Human) OX=9606 GN=NRAS PE=1 SV=1
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLC
VFAINNSKSFADINLYREQIKRVKDSDDVPMVLVGNKCDLPTRTVDTKQAHELAKSYGIPFIETSAKTRQGVEDAFYTLV
REIRQYRMKKLNSSDDGTQGCMGLPCVVM
>Reverse_sp|P01111|RASN_HUMAN GTPase NRas OS=Homo sapiens (Human) OX=9606 GN=NRAS PE=1 SV=1
MVVCPLGMCGQTGDDSSNLKKMRYQRIERVLTYFADEVGQRTKASTEIFPIGYSKALEHAQKTDVTRTPLDCKNGVLVMP
VDDSDKVRKIQERYLNIDAFSKSNNIAFVCLFGEGTRMYQDRMASYEEQGATDLIDLLCTEGDIVVQKRYSDEITPDYED
VFHNQILQITLASKGVGGAGVVVLKYETM

These are generated by downloading an appropriate file from uniprot, then decoys are added using OpenMS' DecoyDatabase utility. I was hoping to use pyteomics to manipulate these database (eg. adding/removing sequences), but when I attempt to get an entry using a uniprot accession, I can only see the latest entry:

>>> path = pathlib.Path('human.fasta')
>>> human = fasta.read(str(path), use_index=True, flavor='uniprot')
>>> human.get_by_id('P01111')
Protein(description={'db': 'Reverse_sp', 'id': 'P01111', 'entry': 'RASN_HUMAN', 'name': 'GTPase NRas', 'gene_id': 'RASN', 'taxon': 'HUMAN', 'OS': 'Homo sapiens (Human)', 'OX': '9606', 'GN': 'NRAS', 'PE': 1, 'SV': 1}, sequence='MVVCPLGMCGQTGDDSS
NLKKMRYQRIERVLTYFADEVGQRTKASTEIFPIGYSKALEHAQKTDVTRTPLDCKNGVLVMPVDDSDKVRKIQERYLNIDAFSKSNNIAFVCLFGEGTRMYQDRMASYEEQGATDLIDLLCTEGDIVVQKRYSDEITPDYEDVFHNQILQITLASKGVGGAGVVVLKYETM')

I understand that this is due to the fact that both the normal and reversed sequences are in the same file and are tagged with the same uniprot accession, but I was hoping that the get_by_id call would return a list of entries matching my query instead of just the last one.

levitsky commented 2 years ago

The indexing mechanism assumes that IDs are unique, there is no simple way around that. However, classes such as IndexedUniProt (which is what you are using) and all other subclasses of TwoLayerIndexedFASTA actually keep a mapping between extracted IDs and full descriptions, where the latter are the actual keys used for byte offset indexing. That means that both the target and the decoy entries are accessible by their full description. So, in case of reading a database with decoys, we can take advantage of the fact that full descriptions are identical except for the decoy prefix, and quite easily retrieve both decoy and target entries given the accession. That would require writing a subclass of TwoLayerIndexedFASTA, which I can try tomorrow if you are interested.

A completely alternative suggestion would be to drop the OpenMS utility from the workflow, and start working with the target database, then add or skip some entries, and only then generate the decoys for the final entry set using Pyteomics. There are plenty decoy-related functions in the fasta module, and if you just generate decoys as needed, you won't have the problem of repeated IDs in the file.

If you elaborate on your algorithm a bit, hopefully I can expand on either or both of my suggestions.

radusuciu commented 2 years ago

Thanks for the quick response!

Since posting this issue I've "worked around" this by skipping the indexing and just iterating through the entries, checking the accession against a list of accessions to remove, and a separate list of accessions to "patch" (ie. substitute the sequence). Kinda like this (code incomplete and abbreviated in places):

patched = []

for entry in fasta.read(str(fasta_path)):
    description = entry.description
    sequence = entry.sequence
    header = fasta.parse(description, flavor='uniprot')
    accession = header['id']

    if accession in to_remove:
        continue

    if accession in to_replace.keys():
        description, sequence = _patch_fasta_entry(header, to_replace[accession])

    patched.append((description, sequence))

where to_remove is a list of uniprot accessions and to_replace is a dict that maps uniprot accessions to replacement sequences. _patch_fasta_entry checks if it's a reverse sequence (header['db'].startswith('Reverse')) or not and returns either the replacement sequence or it's reverse using pyteomics.fasta.reverse. Then I take the list of (description, sequence) tuples and concatenate them (adding > to descriptions) into a patched fasta file.

For me this is workable, so I don't think you need to go forward with the TwoLayerIndexedFASTA subclass (unless you see value beyond my use case) and you can close this issue. Thanks again!

levitsky commented 2 years ago

It's great that you have solved your problem! Note that writing a sequence of tuples to a file is also implemented as fasta.write, so you don't necessarily have to do it yourself (adding > and concatenation).

Personally I still feel that it is generally easier to add decoys at the last step, so that it's guaranteed that they correspond to target sequences. However, clearly sometimes it's needed to work with target-decoy databases. I wonder how best to present this data so that it's easy to work with. Perhaps, yield triples of (description, sequence, decoy sequence)?

radusuciu commented 2 years ago

Yes, you're totally right about fasta.write, thanks for pointing it out.

My decoy situation is related to the fact that I have a set of databases that have been generated and used before there was an interest or need to programatically manipulate them. If I were to start from scratch or refactor this chunk of the application I probably would handle it more like how you've described.

I like the your triplet tuple suggestion, though there's probably a few edge cases there