lazear / sage

Proteomics search & quantification so fast that it feels like magic
https://sage-docs.vercel.app
MIT License
201 stars 38 forks source link

entrapment FASTA #121

Closed KlemensFroehlich closed 4 months ago

KlemensFroehlich commented 4 months ago

hi Michael,

Thanks for your effort and this great tool!

I would love to use your FASTA that you used in the SAGE publication for entrapment searches:

Entrapment Searches PXD001468: To construct an entrapment database, the Pyrococcus furiosus reference proteome was downloaded from UniProt, and each protein sequence was randomly shuffled 100 times. Human and contaminant sequences were then added to the database as target peptides.

I cannot find it on github. Would you be willing to provide the FASTA for me?

Sorry in case I oversaw it.

btw. I have been following the tims debate and I think you just merged some branches which offer DDA PASEF support?

Is there any way to search DIA PASEF with SAGE now?

There is an utter lack of tools that can do directDIA analyses on DIA PASEF data with a reasonable speed especially with large search spaces. (If I can do anything to support this, I would absolutely gladly do so!).

Once a library would be constructed one could probably go back to DIA-NN which is fairly fast once a refined library is being used.

Best Klemens

lazear commented 4 months ago

Hi Klemens,

Here is a link to the fasta file used for the entrapment search: https://www.dropbox.com/scl/fi/hc20ncqd0qa5lwkuof8zm/pyro_entrap.fasta.gz?rlkey=t5gua5r4cobbp4r0fmuuefuvj&dl=0 I considered any protein group without a string matching "Cont_" (contaminants) or "_HUMAN" to be an entrapment match.

Here is a python script for generating a scrambled entrapment database which can then be concatenated with your fasta of interest.

from Bio import SeqIO
from random import shuffle

with open("UP000001013_186497_scram.fasta", "w") as f:
    for record in SeqIO.parse("UP000001013_186497.fasta", format="fasta"):
        SeqIO.write(record, f, format='fasta')

        seq = list(str(record.seq))
        for i in range(100):
            shuffle(seq)
            f.write(f">shuffle_{i}_{record.id}\n")
            f.write(''.join(seq) + '\n')

Unfortunately, at the moment only ddaPASEF is supported for reading from ".d" folders. Thermo/Agilent/any vendor that can be encoded into mzML should work for searching in DIA though. I think diaPASEF might be quite complicated to support in Sage - no promises, but may at some point happen.

KlemensFroehlich commented 4 months ago

hi Michael Thank you very much for providing the FASTA! I am currently testing different entrapment FASTAs for different applications and I think this will be a very nice addition!

Regarding diaPASEF: Would it be "easily" possible to at least support a library generation with SAGE? Because currently, DirectDIA for diaPASEF is I think only possible with proprietary software.

If we could generate a library directly from diaPASEF and then use eg DIA-NN for the main search, that would already be awesome!

Best Klemens

lazear commented 4 months ago

I'm going to close this issue, since we have a couple other going re: diaPASEF!