Closed radusuciu closed 2 years ago
The indexing mechanism assumes that IDs are unique, there is no simple way around that. However, classes such as IndexedUniProt
(which is what you are using) and all other subclasses of TwoLayerIndexedFASTA
actually keep a mapping between extracted IDs and full descriptions, where the latter are the actual keys used for byte offset indexing. That means that both the target and the decoy entries are accessible by their full description. So, in case of reading a database with decoys, we can take advantage of the fact that full descriptions are identical except for the decoy prefix, and quite easily retrieve both decoy and target entries given the accession. That would require writing a subclass of TwoLayerIndexedFASTA
, which I can try tomorrow if you are interested.
A completely alternative suggestion would be to drop the OpenMS utility from the workflow, and start working with the target database, then add or skip some entries, and only then generate the decoys for the final entry set using Pyteomics. There are plenty decoy-related functions in the fasta
module, and if you just generate decoys as needed, you won't have the problem of repeated IDs in the file.
If you elaborate on your algorithm a bit, hopefully I can expand on either or both of my suggestions.
Thanks for the quick response!
Since posting this issue I've "worked around" this by skipping the indexing and just iterating through the entries, checking the accession against a list of accessions to remove, and a separate list of accessions to "patch" (ie. substitute the sequence). Kinda like this (code incomplete and abbreviated in places):
patched = []
for entry in fasta.read(str(fasta_path)):
description = entry.description
sequence = entry.sequence
header = fasta.parse(description, flavor='uniprot')
accession = header['id']
if accession in to_remove:
continue
if accession in to_replace.keys():
description, sequence = _patch_fasta_entry(header, to_replace[accession])
patched.append((description, sequence))
where to_remove
is a list of uniprot accessions and to_replace
is a dict
that maps uniprot accessions to replacement sequences. _patch_fasta_entry
checks if it's a reverse sequence (header['db'].startswith('Reverse')
) or not and returns either the replacement sequence or it's reverse using pyteomics.fasta.reverse
. Then I take the list of (description, sequence)
tuples and concatenate them (adding >
to descriptions) into a patched fasta file.
For me this is workable, so I don't think you need to go forward with the TwoLayerIndexedFASTA
subclass (unless you see value beyond my use case) and you can close this issue. Thanks again!
It's great that you have solved your problem!
Note that writing a sequence of tuples to a file is also implemented as fasta.write
, so you don't necessarily have to do it yourself (adding >
and concatenation).
Personally I still feel that it is generally easier to add decoys at the last step, so that it's guaranteed that they correspond to target sequences. However, clearly sometimes it's needed to work with target-decoy databases. I wonder how best to present this data so that it's easy to work with. Perhaps, yield triples of (description, sequence, decoy sequence)?
Yes, you're totally right about fasta.write
, thanks for pointing it out.
My decoy situation is related to the fact that I have a set of databases that have been generated and used before there was an interest or need to programatically manipulate them. If I were to start from scratch or refactor this chunk of the application I probably would handle it more like how you've described.
I like the your triplet tuple suggestion, though there's probably a few edge cases there
I work with fasta databases contain reversed decoy sequences in this format:
These are generated by downloading an appropriate file from uniprot, then decoys are added using OpenMS'
DecoyDatabase
utility. I was hoping to usepyteomics
to manipulate these database (eg. adding/removing sequences), but when I attempt to get an entry using a uniprot accession, I can only see the latest entry:I understand that this is due to the fact that both the normal and reversed sequences are in the same file and are tagged with the same uniprot accession, but I was hoping that the
get_by_id
call would return a list of entries matching my query instead of just the last one.