pombase / website

PomBase website v2
MIT License
6 stars 1 forks source link

Advanced search: signal peptides #2115

Closed ValWood closed 5 days ago

ValWood commented 9 months ago

Add a search to retreive all signal peptides. None urgent

kimrutherford commented 9 months ago

Which genes are signal peptides?

ValWood commented 9 months ago

Actually, I thought they came through the InterpPro pipeline, but now I look they don't.

It seems that we must've run the SignalP at some point, and then assigned to so terms to the proteins with signal peptides.

https://www.pombase.org/term/SO:0000418

Are there any signal peptides in the IP-scan file that we ignore? If so that might be a better way to do it as we could get the features on the protein display

ValWood commented 9 months ago

At least here they used to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1160203/

kimrutherford commented 9 months ago

I haven't been able to find anything in the XML yet.

I've noticed that some InterPro pombe pages have a feature for "phobius: SIGNAL_PEPTIDE": https://www.ebi.ac.uk/interpro/protein/UniProt/O13640/

I can't see anything about phobius in the XML.

ValWood commented 9 months ago

OK, do we get the TMM domains from InterPro, I can't remember

kimrutherford commented 9 months ago

OK, do we get the TMM domains from InterPro, I can't remember

We run TMHMM when we process the InterPro XML file to make a file with domains for the load. Unfortunately InterPro don't provide the TMMs.

ValWood commented 9 months ago

Regarding, TMM and SignalP, this is already supported. However, the software/models required cannot be distributed with InterProScan because they contain licensed components. The InterProScan documentation include instructions for activating these analyses: https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Finterproscan-docs.readthedocs.io%2Fen%2Flatest%2FActivatingLicensedAnalyses.html&data=05%7C01%7Cvw253%40universityofcambridgecloud.onmicrosoft.com%7C47516055ef644297c98608dbef363407%7C49a50445bdfa4b79ade3547b4f3986e9%7C1%7C0%7C638366788297525291%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=v%2BheQWm807IUgX7mhuARS2ikaJVxlnHeiZ3TRYtl2Do%3D&reserved=0. The InterProScan web service provide TMM/SignalP predictions, e.g. https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ebi.ac.uk%2Finterpro%2Fresult%2FInterProScan%2Fiprscan5-R20231127-103348-0178-16749718-p1m%2F&data=05%7C01%7Cvw253%40universityofcambridgecloud.onmicrosoft.com%7C47516055ef644297c98608dbef363407%7C49a50445bdfa4b79ade3547b4f3986e9%7C1%7C0%7C638366788297525291%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PjZR7E4GDdXyYwNksomYzSl7qnu2QWgpCL%2BKgi9i7nY%3D&reserved=0

ValWood commented 9 months ago

I made this low priority, but if it is quick (i.e largely running pipeline and co nfigureation) it can be re-prioritized

kimrutherford commented 6 months ago

Do we just need a "Commonly used query" that returns these genes?: https://www.pombase.org/term/SO:0000418

ValWood commented 6 months ago

We could but my only worry is that it isn't comprehensive. I must of added these Ad Hoc as I saw them referred to, or in a protein feature model.

kimrutherford commented 1 month ago

However, the software/models required cannot be distributed with InterProScan because they contain licensed components.

I hate that. Such an unnecessary pain for users. It tends to be older tools that do that. Tool authors seem to be more sensible these days.

The InterProScan web service provide TMM/SignalP predictions, e.g.

The search results page includes a protein feature diagram with the signal peptides marked, but there's no easy way to download that data.

I've had a look at installing and running SignalP-6.0. That doesn't look easy on oliver1 because the version of the operating system is very old. I can have a go though. The other problem is that we'd need to interpret the output. It gives a score rather than a yes/no answer for each protein.

# SignalP-6.0   Organism: Other Timestamp: 20240724114903
# ID    Prediction  OTHER   SP(Sec/SPI) LIPO(Sec/SPII)  TAT(Tat/SPI)    TATLIPO(Tat/SPII)   PILIN(Sec/SPIII)    CS Position
SPCC757.12.1 length_625 SP  0.000225    0.999116    0.000156    0.000169    0.000148    0.000141    CS pos: 22-23. Pr: 0.9807

image

I've also looked at installing and running Phobius but the instructions are very minimal and couldn't work it out from a quick attempt.

kimrutherford commented 1 month ago

It looks like we can get the signal peptide details from the UniProt API with a URL like:

https://rest.uniprot.org/uniprotkb/stream?compressed=true&fields=accession,id,ft_signal&format=tsv&query=((accession:O74922)+OR+(accession:O94565))

which returns a very helpful TSV results like this:

Entry   Entry Name      Signal peptide
O74922  AMY1_SCHPO      SIGNAL 1..22; /evidence="ECO:0000255"
O94565  OMH4_SCHPO
O13770  YE98_SCHPO      SIGNAL 1..20; /evidence="ECO:0000255"
...

I just tried submitting 1000 accessions at once, which worked. More than that didn't work but that still means we only need 5-10 API calls.

If this method sounds OK, I can download all the data we need tomorrow for loading on Thursday night.

ValWood commented 1 month ago

Yes go ahead Do we have a reference to use for the method (algorithm) The ECO is match to sequence model evidence used in manual assertion

ValWood commented 1 month ago

Maybe they manually review them, in UniPRot it says Manual assertion according to sequence analysis so maybe we should create a specific reference to say exactly where they are from.

@Antonialock How do you call signal peptides?

ValWood commented 1 month ago

Also can we get cleavage sites @kimrutherford ?

Antonialock commented 1 month ago

We run a pipeline for prediction of various sequence features, this rule is for signal peptides: https://fisheye.sib.swiss/browse/~raw,r=HEAD/SIB/unirules/anarules/ANA00006.uru

Sequence features (metal or substrate binding sites, TM domains, signal peptides...) may be added to or modified based on experimental data as part of manual curation. it looks like only 75 entries have papers associated with the signal peptide feature (none for pombe): https://www.uniprot.org/uniprotkb?query=%28scope%3A%22signal+peptide%22%29

Antonialock commented 1 month ago

there should be 52 proteins with annotated propeptide sequences in pombe

https://www.uniprot.org/uniprotkb?query=%28ft_propep%3A*%29+AND+%28taxonomy_id%3A4896%29

kimrutherford commented 1 month ago

there should be 52 proteins with annotated propeptide sequences in pombe

Hi Antonia!

I got 214 proteins by using the Proteins with: Signal peptide filter in the left hand column?

https://www.pombase.org/results/from/id/cf1af35a-18c5-4d42-9024-16f3e0409433

kimrutherford commented 1 month ago

Also can we get cleavage sites @kimrutherford ?

Hi Val. I couldn't find cleavage sites on the results page. Is there a synonym for cleavage site could look for?

kimrutherford commented 1 month ago

I got 214 proteins by using the Proteins with: Signal peptide filter in the left hand column?

I was expecting a better overlap between the genes currently annotated with SO:0000418 and the list from UniProt. This makes me a bit suspicious:

https://www.pombase.org/results/from/id/c0918f7f-d9a2-40b9-9fb6-45f4c3305c4a https://www.pombase.org/results/from/id/cf1af35a-18c5-4d42-9024-16f3e0409433 https://www.pombase.org/results/from/id/cb5b1538-adb6-443d-a42c-680b8e4c565d

kimrutherford commented 1 month ago

We have some genes annotated with "signal_anchor" which is_a signal_peptide. The list from UniProt doesn't include any of them.

ValWood commented 1 month ago

We have some genes annotated with "signal_anchor" which is_a signal_peptide. The list from UniProt doesn't include any of them.

I guess these are harder to locate if they don't include a cleavage site...

ValWood commented 1 month ago

I was expecting a better overlap between the genes currently annotated with SO:0000418 and the list from UniProt. This makes me a bit suspicious:

I think none of the methods are optimal. All of the current annotations from both look OK to me (or at least probable).

I expect most ER/Golgi/cell surface and most membrane transporters will have a signal peptide and we are nowhere near that. So there are few false positives but a lot of false negatives.

~The obvious true FP that I see is https://www.pombase.org/gene/SPAC3A11.03 (elongation factor 3) Where are these? I will delete it.~ deleted

Antonialock commented 1 month ago

there should be 52 proteins with annotated propeptide sequences in pombe

Hi Antonia!

I got 214 proteins by using the Proteins with: Signal peptide filter in the left hand column?

https://www.pombase.org/results/from/id/cf1af35a-18c5-4d42-9024-16f3e0409433

Yes propeptide is a different filter from signal peptide. I thought Val might have meant propeptide when she asked for cleavage sites. A propeptide is an extra bit of peptide that is cleaved off the protein as part of maturation.

kimrutherford commented 1 month ago

I think none of the methods are optimal. All of the current annotations from both look OK to me (or at least probable).

So do you think we should add the UniProt list to our existing signal peptide annotations? That would give 357 annotations.

kimrutherford commented 1 month ago

So do you think we should add the UniProt list to our existing signal peptide annotations? That would give 357 annotations.

The decision is to add the UniProt annotations and remove any existing annotation that are covered by UniProt.

kimrutherford commented 1 month ago

The decision is to add the UniProt annotations and remove any existing annotation that are covered by UniProt.

That's done now and checked in, but not in time for the load. I'll check on Thursday morning. I have a test load on my desktop: https://desktop.kmr.nz/term/SO:0000418

I've moved the existing manual annotations out of the contig files and into pombe-embl/supporting_files/manual_so_term_annotations.tsv.

I've left the signal_anchor annotations in the contig files for now: https://desktop.kmr.nz/term/SO:0001809

kimrutherford commented 1 month ago

Next step: the UniProt data file now gets processed so we can show the signal peptides in the feature viewer:

image

All the code and script changes are in place now to process the other columns with coordinates from the data file if they are useful:

I'll check load on Thursday morning to make sure the signal peptides are displayed correctly.

ValWood commented 1 month ago

We should do these too Transit peptide (I don't know how a transit peptide differ from a signal peptide !) Binding site Active site Modified residue (wes but we will need to map to PRO an filter redundancy for experimental ones)

ValWood commented 1 month ago

I never knew this!

Transit peptides and signal peptides are both short amino acid sequences that direct the transport of proteins to specific locations within a cell, but they have different roles and target different cellular destinations:

Transit Peptides:

Function: Transit peptides direct proteins to organelles within the cell, such as mitochondria or chloroplasts. Location of Target: These peptides typically target intracellular organelles. Example: A protein destined for the mitochondria will have a mitochondrial transit peptide that directs it to the mitochondrion. Similarly, a protein bound for the chloroplast will have a chloroplast transit peptide. Cleavage: After the protein reaches its destination (e.g., mitochondria or chloroplast), the transit peptide is usually cleaved off by specific peptidases. Signal Peptides:

Function: Signal peptides direct the nascent protein to the secretory pathway, which includes the endoplasmic reticulum (ER) and, eventually, the extracellular space or plasma membrane. Location of Target: These peptides target the ER for proteins that are secreted from the cell, inserted into the plasma membrane, or directed to lysosomes. Example: A protein destined for secretion outside the cell will have an ER signal peptide that directs it to the ER. Cleavage: The signal peptide is typically cleaved off once the protein enters the ER lumen by signal peptidase enzymes. Summary of Key Differences: Target Destination:

Transit Peptides: Direct proteins to mitochondria or chloroplasts. Signal Peptides: Direct proteins to the endoplasmic reticulum and the secretory pathway. Function:

Transit Peptides: Ensure proteins are correctly localized within specific organelles. Signal Peptides: Ensure proteins are processed through the secretory pathway and directed either to the cell membrane, outside the cell, or to lysosomes. Cleavage:

Both types of peptides are typically cleaved off once the protein reaches its destination.

kimrutherford commented 1 month ago

It sounds like it makes sense to add SO:0000725 annotations for the transit peptides, as well as showing them in the feature viewer?

There are 267 genes with transit peptides in the UniProt data, compared to 214 signal peptides.

kimrutherford commented 1 month ago

The signal peptide annotation is updated: https://www.pombase.org/term/SO:0000418 (357) but I forgot to check in the config change for the protein viewer. Whoops. So the signal peptide don't appear there yet. They will tomorrow.

I'll add the transit peptides today since most of the work is done.

ValWood commented 1 month ago

Yes, this will pick up a lot of the mitochondrial ones hopefully!

kimrutherford commented 1 month ago

There are 267 genes with transit peptides in the UniProt data

I'll only just noticed that there are quite a few genes (78 of 267) where the transit peptide is annotated but the location isn't fully specified (so 1..? instead of something like 1..26). In those cases we'll be able to add a SO:0000725 annotation but the transit peptide won't appear in the protein feature viewer.

ValWood commented 1 month ago

Yep that's OK, we will be able to see which ones have no coordinates. Hopefully the coordinates will eventually get picked up by another method and we can suppress these. It's useful to know that there is one.

kimrutherford commented 1 month ago

The transit peptides are now added as annotations in pombe-embl/supporting_files/manual_so_term_annotations.tsv and added to the protein feature viewer for tomorrow.

https://desktop.kmr.nz/gene_protein_features/SPAC12G12.04

image

ValWood commented 1 month ago

fab! will they also display in the protein domains and properties section of the gene pages, like the sisnal peptides?

kimrutherford commented 1 month ago

will they also display in the protein domains and properties section of the gene pages, like the sisnal peptides?

Yep! Here's an example from my desktop:

https://desktop.kmr.nz/gene/SPAC12G12.04

image

I need to add some configuration for the new ECO evidence codes from UniProt.

PCarme commented 1 month ago

Out of curiosity, I ran the SignalP v.6 software on all the proteins annotated to ER, Golgi and plasma membrane on PomBase. The majority of the positives I got are already annotated on PomBase, but I still got a few predictions to protein that are not covered yet on PomBase. Here is the list of these genes : New_signal_peptides_SignalP.txt And here is the complete predictions from the software : prediction_results-2.txt

ValWood commented 1 month ago

Excellent, we should add those. Could you run on the remainder to see if anything gets picked up that was excluded from the query (i.e unknowns)

PCarme commented 1 month ago

I can, the software actually runs surprisingly quickly !

PCarme commented 1 month ago

I got even fewer hits that are not already annotated on PomBase with this list of genes. A couple of them are on unknowns, which might be useful. Here is the list : New_signal_peptides_SignalP6-2.txt And the full prediction file : prediction_results-3.txt

kimrutherford commented 1 month ago

Thanks Pascal. That's great.

I'll add the new predictions on Monday.

kimrutherford commented 1 month ago

I'll add the new predictions on Monday.

I've done that. They'll be on pombase.org on Tuesday.

I added them here: pombe-embl/supporting_files/manual_so_term_annotations.tsv

PCarme commented 1 month ago

Thanks Kim !

kimrutherford commented 1 month ago

UniProt docs on signal peptides: https://www.uniprot.org/help/signal

kimrutherford commented 5 days ago

Can we close this?