Filter duplicate modifications from UniProt

kimrutherford commented 2 months ago

Now we get modifications from UniProt there are exact duplicates. We should filter them like we filter GO, with the PomBase annotations taking priority.

ValWood commented 1 month ago

We should filter UniProt annotations if everything is the same from a specific gene/publication except the evidence code because these will al be duplicates.

ValWood commented 1 month ago

I have put this as high priority because once that is done I will share with UNiProt (and can describe the extent of the overlap)

kimrutherford commented 1 month ago

Just to check:

We should only filter modifications from UniProt if the gene, term ID and publication are the same. Have I got that right?

Should we also filter UniProt annotations where there is a PomBase modification and the UniProt modification doesn't have a publication/reference?

kimrutherford commented 1 month ago

Sometimes there is a PomBase extension (like "present during cellular response to thiabendazole") but otherwise the annotation is identical from UniProt. In those cases the PomBase annotation is more specific so we can remove the UniProt annotation?

Note to self: the summary is that we can ignore extensions when looking for duplicates.

ValWood commented 1 month ago

That's correct. If the evidence is different , but the paper and everything else is the same, we will filter it. These are where a different evidence code was selected for the same experiment and I have queried this with UniPort (they have used a manual code for a HTP experiment for example)

ValWood commented 1 month ago

Sometimes there is a PomBase extension (like "present during cellular response to thiabendazole") but otherwise the annotation is identical from UniProt. In those cases the PomBase annotation is more specific so we can remove the UniProt annotation?

Absolutely!

kimrutherford commented 1 month ago

OK, thanks. My first pass at the code finds only 618 duplicate modifications from a total from UniProt of 3983. That's less that I thought. I'm checking the results now.

I've found one case where a manual fix might be easiest. For rum1 https://www.pombase.org/gene/SPBC32F12.09 there's a PomBase annotation with the evidence code "Unknown"/UNK:

MOD:00046   O-phospho-L-serine   modified residue S19   UNK  Matsuoka K et al. (2002)

where UniProt provides this:

MOD:00046   O-phospho-L-serine   modified residue S19     experimental evidence used in manual assertion    Matsuoka K et al. (2002)

~Could we delete the PomBase annotation? It's in pombe-embl/supporting_files/legacy_modifications_from_contigs.tsv: SPBC32F12.09 rum1 MOD:00046 Unknown S13 PMID:12135491 4896 2009-02-13~

kimrutherford commented 1 month ago

My first pass at the code finds only 618 duplicate modifications from a total from UniProt of 3983.

593 of the duplicates are modifications from PMID:18257517. There is one duplicate from PMID:12135491, which is the one with Unknown evidence in the previous comment. And the remaining 45 duplicates don't have a PMID in the UniProt data.

I'm still checking to make sure that's all correct.

kimrutherford commented 1 month ago

And the remaining 45 duplicates don't have a PMID in the UniProt data.

I got that wrong. There are 24 that don't have a PMID.

ValWood commented 1 month ago

It's weird that UniProt only have 593 from PMID:18257517. (we have 1010). I just checked and we only have one extension. I wonder why UniProt eliminated ~400. Maybe they used some threshold?

@Antonialock can you think of a reason why UniProt might only import a subset of modifications from a publication?

ValWood commented 1 month ago

From the abstract In total, 2887 distinct phosphorylation sites were identified from 1194 proteins with an estimated false-discovery rate of <0.5% at the peptide level.

I don't know why out input file has only 1194 proteins when there were 2887 unique sites with low FP rate. But I always thought this dataset was larger than 1194...

kimrutherford commented 1 month ago

It's weird that UniProt only have 593 from PMID:18257517.

UniProt have 1640 in total from PMID:18257517 and 593 are duplicates. That's very odd because it would mean PomBase and UniProt have about 1000 unique annotations each from PMID:18257517. I'll dig into that because that sounds like my code is nonsense. :-)

kimrutherford commented 1 month ago

UniProt have 1640 in total from PMID:18257517

It's 2233 not 1640. I should go to bed. :-)

and 593 are duplicates

That bit is correct (I think).

ValWood commented 1 month ago

I don't think the dataset must have been fully parsed for Chado ingest There is a note on the session "This session has a message to curators: protein phosphorylation done in bulk format only other thing that might be curatable some day is some phosphorylation motifs" but it does not mention any reason why the total dataset was not included.

ValWood commented 1 month ago

Yes go to bed!

ValWood commented 1 month ago

[x] Could we delete the PomBase annotation? It's in pombe-embl/supporting_files/legacy_modifications_from_contigs.tsv:

SPBC32F12.09    rum1    MOD:00046       Unknown S13             PMID:12135491   4896    2009-02-13

kimrutherford commented 1 month ago

I added the step to remove duplicate modifications to the load script for last night. The removed UniProt annotations are in this log file: https://curation.pombase.org/dumps/builds/pombase-build-2024-10-12/logs/log.2024-10-12-04-39-42.modification-filter-duplicates

Antonialock commented 1 month ago

@Antonialock can you think of a reason why UniProt might only import a subset of modifications from a publication?

No and I’m not sure who to ask - it was probably added by someone at SIB - we don’t usually do HTP. The uniprot helpdesk might be able to answer…

ValWood commented 1 month ago

Thanks @Antonialock , once we figure why we both have different data I'll ask on the helpdesk.

kimrutherford commented 1 month ago

I've had a look at the paper and the data table in the supplementary information. I can't work out how the UniProt annotations or the PomBase annotations were extracted from the data.

kimrutherford commented 1 month ago

I've had a look at the paper and the data table in the supplementary information. I can't work out how the UniProt annotations or the PomBase annotations were extracted from the data.

I wrote a script to process the supplementary information table based on what I could understand from the paper. That gives 1711 modification annotations for 941 genes.

The PomBase dataset has 1006 annotations for 557 genes.

UniProt has 3239 for 1099 genes.

Below is a Venn diagram of the number of genes with modifications from the three datasets. The diagram doesn't make things less confusing. :-)

Meanwhile the publication says:

In total, 2887 distinct phosphorylation sites were identified from 1194 proteins

ValWood commented 1 month ago

This is bizarre!

ValWood commented 1 month ago

I'm looking at the information with the supp data. It says

All phosphopeptides listed are the most likely peptides reported by SEQUEST. The phosphorylation sites, shown as a (#) and the site number are those determined most likely by the Ascore algorithm. An (*) on methionine denotes oxidation. The Ascore was run for all peptides, and the values can be read from left to right in the case of multiple phosphorylation sites. Sites with Ascore values <19 are considered ambiguous, while sites with Ascore values >19 are considered localized and are presented in green. “N/A” in the Ascore means that there is only one possible phosphorylation site in the amino acid sequence. After removing redundancy, the final data set contains 2489 unique phosphopeptides from 1194 phosphoproteins. An active link to all MS/MS spectra is given on each peptide and a link to the Ascore is available on that page.

So, possible we should only take the ones with Ascore values >1 OR “N/A” in the Ascore means that there is only one possible phosphorylation site in the amino acid sequence.

After removing redundancy, the final data set contains 2489 unique phosphopeptides from 1194 phosphoprotein. probably includes all of the phopshosites, even the ones that could not be unambiguously located.

ValWood commented 1 month ago

What's in the list of 33 that are found by us and UniPort, but are not in your script?

If we can figure out the differences we can decide which parts of the venn to include.

ValWood commented 1 month ago

The POmBase one seems more conservative. Midori may have spoken with the author. Unfortunately due to the EBI we no longer have that archive.

kimrutherford commented 1 month ago

So, possible we should only take the ones with Ascore values >1 OR “N/A” in the Ascore means that there is only one possible phosphorylation site in the amino acid sequence.

The data file has Ascore1, Ascore2 and Ascore3 columns to make it more challenging. :-)

My script looks at each Ascore separately. If any of the three Ascore values is > 19 that site is included in the output. If the Ascore columns are N/A the site is also included.

The numbers from the script don't match the numbers reported in the manuscript so I think I must have that wrong.

kimrutherford commented 1 month ago

What's in the list of 33 that are found by us and UniPort, but are not in your script?

If we can figure out the differences we can decide which parts of the venn to include.

I looked at those 33 genes. These are them: https://www.pombase.org/results/from/id/6e05f643-cf3d-42eb-93d5-4cd620ccf7d7

Confusingly, 32 of them aren't in the spreadsheet from the publication at all even though we have data from PomBase and UniProt. I don't know what that means. :-(

The one gene from the 33 that is in the spreadsheet is: SPAPB1A10.09 mod: S537 It's excluded by my script because Ascore1 is 0.01 The S537 modification appears in three other datasets apart from PMID:18257517 so seems correct?

I'm very confused.

ValWood commented 1 month ago

I don't know if it helps but there is a second spreadsheet (EVIN) and most of the missing entries are in there.

Except these, https://www.pombase.org/results/from/id/531ff02d-cc63-490e-98ff-c14494b68cf4 and these seem to be special because they are mainly exact (or close) duplicates of entries that are in the other set...

i.e rpn502 = rpn501 rps1501= rps502 rps1602 = rps1601 rps1802 = rps1801 rpl2401 = rpl2402 rpl401 = rpl402 rpl502= rpl501 ubi4 = ubi1 = ubi2 = ubi3 (at least, the ubiquitin part will be identical so a mass spec would not be able to differntiate) tif512= tif511

In Uniprot these might have all been mapped to a single protein entry at this point, and we would split them to both of our identifiers.

this leaves ssa1 spk1 SPAC750.01 as 'magic-ed from nowhere' I will dig further into these...

ValWood commented 1 month ago

This is how SPAC750.01 aligns to SPAC977.14C so I am guessing these fragments would not map unambiguously

ValWood commented 1 month ago

Lets discuss tomorrow.

kimrutherford commented 1 month ago

Actions:

[ ] add annotations from the script to pombe-embl/external_data/modification_files/PMID_18257517_modifications.tsv
[x] don't load annotations for PMID:18257517 from UniProt data file because we suspect they include modifications that are below the score cut-off from the paper (228 genes)

kimrutherford commented 3 weeks ago

don't load annotations for PMID:18257517 from UniProt data file because we suspect they include modifications that are below the score cut-off from the paper (228 genes)

They'll be filtered in Thursday night's load.

kimrutherford commented 3 weeks ago

add annotations from the script to pombe-embl/external_data/modification_files/PMID_18257517_modifications.tsv

I've generated a new version of that file. I've put it in an in_progress directory for now in SVN:

external_data/modification_files/in_progress/PMID_18257517_modifications.tsv

The modification positions are incorrect for some genes so I'll need to run Manu's code to fix them.

kimrutherford commented 1 week ago

I've generated a new version of that file. I've put it in an in_progress directory for now in SVN: external_data/modification_files/in_progress/PMID_18257517_modifications.tsv The modification positions are incorrect for some genes so I'll need to run Manu's code to fix them.

I'm back looking at this again. The easiest way to process the modifications with Manu's code is to include the new annotations in the nightly load and then Manu's pipeline will run automatically.

The annotations will be wrong in Chado and on the website for a day so I'll do that this weekend.

kimrutherford commented 4 days ago

The easiest way to process the modifications with Manu's code is to include the new annotations in the nightly load and then Manu's pipeline will run automatically.

The annotations will be wrong in Chado and on the website for a day so I'll do that this weekend.

That's done for Friday night's load. I'll check things at the weekend.

pombase / pombase-chado

Filter duplicate modifications from UniProt #1223