Open kimrutherford opened 2 months ago
We should filter UniProt annotations if everything is the same from a specific gene/publication except the evidence code because these will al be duplicates.
I have put this as high priority because once that is done I will share with UNiProt (and can describe the extent of the overlap)
Just to check:
We should only filter modifications from UniProt if the gene, term ID and publication are the same. Have I got that right?
Should we also filter UniProt annotations where there is a PomBase modification and the UniProt modification doesn't have a publication/reference?
Sometimes there is a PomBase extension (like "present during cellular response to thiabendazole") but otherwise the annotation is identical from UniProt. In those cases the PomBase annotation is more specific so we can remove the UniProt annotation?
Note to self: the summary is that we can ignore extensions when looking for duplicates.
That's correct. If the evidence is different , but the paper and everything else is the same, we will filter it. These are where a different evidence code was selected for the same experiment and I have queried this with UniPort (they have used a manual code for a HTP experiment for example)
Sometimes there is a PomBase extension (like "present during cellular response to thiabendazole") but otherwise the annotation is identical from UniProt. In those cases the PomBase annotation is more specific so we can remove the UniProt annotation?
Absolutely!
OK, thanks. My first pass at the code finds only 618 duplicate modifications from a total from UniProt of 3983. That's less that I thought. I'm checking the results now.
I've found one case where a manual fix might be easiest. For rum1 https://www.pombase.org/gene/SPBC32F12.09 there's a PomBase annotation with the evidence code "Unknown"/UNK:
MOD:00046 O-phospho-L-serine modified residue S19 UNK Matsuoka K et al. (2002)
where UniProt provides this:
MOD:00046 O-phospho-L-serine modified residue S19 experimental evidence used in manual assertion Matsuoka K et al. (2002)
~Could we delete the PomBase annotation? It's in pombe-embl/supporting_files/legacy_modifications_from_contigs.tsv
:
SPBC32F12.09 rum1 MOD:00046 Unknown S13 PMID:12135491 4896 2009-02-13~
My first pass at the code finds only 618 duplicate modifications from a total from UniProt of 3983.
593 of the duplicates are modifications from PMID:18257517. There is one duplicate from PMID:12135491, which is the one with Unknown evidence in the previous comment. And the remaining 45 duplicates don't have a PMID in the UniProt data.
I'm still checking to make sure that's all correct.
And the remaining 45 duplicates don't have a PMID in the UniProt data.
I got that wrong. There are 24 that don't have a PMID.
It's weird that UniProt only have 593 from PMID:18257517. (we have 1010). I just checked and we only have one extension. I wonder why UniProt eliminated ~400. Maybe they used some threshold?
@Antonialock can you think of a reason why UniProt might only import a subset of modifications from a publication?
From the abstract In total, 2887 distinct phosphorylation sites were identified from 1194 proteins with an estimated false-discovery rate of <0.5% at the peptide level.
I don't know why out input file has only 1194 proteins when there were 2887 unique sites with low FP rate. But I always thought this dataset was larger than 1194...
It's weird that UniProt only have 593 from PMID:18257517.
UniProt have 1640 in total from PMID:18257517 and 593 are duplicates. That's very odd because it would mean PomBase and UniProt have about 1000 unique annotations each from PMID:18257517. I'll dig into that because that sounds like my code is nonsense. :-)
UniProt have 1640 in total from PMID:18257517
It's 2233 not 1640. I should go to bed. :-)
and 593 are duplicates
That bit is correct (I think).
I don't think the dataset must have been fully parsed for Chado ingest There is a note on the session "This session has a message to curators: protein phosphorylation done in bulk format only other thing that might be curatable some day is some phosphorylation motifs" but it does not mention any reason why the total dataset was not included.
Yes go to bed!
pombe-embl/supporting_files/legacy_modifications_from_contigs.tsv
:SPBC32F12.09 rum1 MOD:00046 Unknown S13 PMID:12135491 4896 2009-02-13
I added the step to remove duplicate modifications to the load script for last night. The removed UniProt annotations are in this log file: https://curation.pombase.org/dumps/builds/pombase-build-2024-10-12/logs/log.2024-10-12-04-39-42.modification-filter-duplicates
@Antonialock can you think of a reason why UniProt might only import a subset of modifications from a publication?
No and I’m not sure who to ask - it was probably added by someone at SIB - we don’t usually do HTP. The uniprot helpdesk might be able to answer…
Thanks @Antonialock , once we figure why we both have different data I'll ask on the helpdesk.
I've had a look at the paper and the data table in the supplementary information. I can't work out how the UniProt annotations or the PomBase annotations were extracted from the data.
I've had a look at the paper and the data table in the supplementary information. I can't work out how the UniProt annotations or the PomBase annotations were extracted from the data.
I wrote a script to process the supplementary information table based on what I could understand from the paper. That gives 1711 modification annotations for 941 genes.
The PomBase dataset has 1006 annotations for 557 genes.
UniProt has 3239 for 1099 genes.
Below is a Venn diagram of the number of genes with modifications from the three datasets. The diagram doesn't make things less confusing. :-)
Meanwhile the publication says:
In total, 2887 distinct phosphorylation sites were identified from 1194 proteins
This is bizarre!
I'm looking at the information with the supp data. It says
All phosphopeptides listed are the most likely peptides reported by SEQUEST. The phosphorylation sites, shown as a (#) and the site number are those determined most likely by the Ascore algorithm. An (*) on methionine denotes oxidation. The Ascore was run for all peptides, and the values can be read from left to right in the case of multiple phosphorylation sites. Sites with Ascore values <19 are considered ambiguous, while sites with Ascore values >19 are considered localized and are presented in green. “N/A” in the Ascore means that there is only one possible phosphorylation site in the amino acid sequence. After removing redundancy, the final data set contains 2489 unique phosphopeptides from 1194 phosphoproteins. An active link to all MS/MS spectra is given on each peptide and a link to the Ascore is available on that page.
So, possible we should only take the ones with Ascore values >1 OR “N/A” in the Ascore means that there is only one possible phosphorylation site in the amino acid sequence.
After removing redundancy, the final data set contains 2489 unique phosphopeptides from 1194 phosphoprotein. probably includes all of the phopshosites, even the ones that could not be unambiguously located.
What's in the list of 33 that are found by us and UniPort, but are not in your script?
If we can figure out the differences we can decide which parts of the venn to include.
The POmBase one seems more conservative. Midori may have spoken with the author. Unfortunately due to the EBI we no longer have that archive.
So, possible we should only take the ones with Ascore values >1 OR “N/A” in the Ascore means that there is only one possible phosphorylation site in the amino acid sequence.
The data file has Ascore1, Ascore2 and Ascore3 columns to make it more challenging. :-)
My script looks at each Ascore separately. If any of the three Ascore values is > 19 that site is included in the output. If the Ascore columns are N/A the site is also included.
The numbers from the script don't match the numbers reported in the manuscript so I think I must have that wrong.
What's in the list of 33 that are found by us and UniPort, but are not in your script?
If we can figure out the differences we can decide which parts of the venn to include.
I looked at those 33 genes. These are them: https://www.pombase.org/results/from/id/6e05f643-cf3d-42eb-93d5-4cd620ccf7d7
Confusingly, 32 of them aren't in the spreadsheet from the publication at all even though we have data from PomBase and UniProt. I don't know what that means. :-(
The one gene from the 33 that is in the spreadsheet is: SPAPB1A10.09 mod: S537 It's excluded by my script because Ascore1 is 0.01 The S537 modification appears in three other datasets apart from PMID:18257517 so seems correct?
I'm very confused.
I don't know if it helps but there is a second spreadsheet (EVIN) and most of the missing entries are in there.
Except these, https://www.pombase.org/results/from/id/531ff02d-cc63-490e-98ff-c14494b68cf4 and these seem to be special because they are mainly exact (or close) duplicates of entries that are in the other set...
i.e rpn502 = rpn501 rps1501= rps502 rps1602 = rps1601 rps1802 = rps1801 rpl2401 = rpl2402 rpl401 = rpl402 rpl502= rpl501 ubi4 = ubi1 = ubi2 = ubi3 (at least, the ubiquitin part will be identical so a mass spec would not be able to differntiate) tif512= tif511
In Uniprot these might have all been mapped to a single protein entry at this point, and we would split them to both of our identifiers.
this leaves ssa1 spk1 SPAC750.01 as 'magic-ed from nowhere' I will dig further into these...
This is how SPAC750.01 aligns to SPAC977.14C so I am guessing these fragments would not map unambiguously
Lets discuss tomorrow.
Actions:
pombe-embl/external_data/modification_files/PMID_18257517_modifications.tsv
don't load annotations for PMID:18257517 from UniProt data file because we suspect they include modifications that are below the score cut-off from the paper (228 genes)
They'll be filtered in Thursday night's load.
add annotations from the script to pombe-embl/external_data/modification_files/PMID_18257517_modifications.tsv
I've generated a new version of that file. I've put it in an in_progress
directory for now in SVN:
external_data/modification_files/in_progress/PMID_18257517_modifications.tsv
The modification positions are incorrect for some genes so I'll need to run Manu's code to fix them.
I've generated a new version of that file. I've put it in an in_progress directory for now in SVN:
external_data/modification_files/in_progress/PMID_18257517_modifications.tsv
The modification positions are incorrect for some genes so I'll need to run Manu's code to fix them.
I'm back looking at this again. The easiest way to process the modifications with Manu's code is to include the new annotations in the nightly load and then Manu's pipeline will run automatically.
The annotations will be wrong in Chado and on the website for a day so I'll do that this weekend.
The easiest way to process the modifications with Manu's code is to include the new annotations in the nightly load and then Manu's pipeline will run automatically.
The annotations will be wrong in Chado and on the website for a day so I'll do that this weekend.
That's done for Friday night's load. I'll check things at the weekend.
Now we get modifications from UniProt there are exact duplicates. We should filter them like we filter GO, with the PomBase annotations taking priority.