Do not load UniProt "Component"

ValWood commented 2 years ago

I'm not sure what this is? It doesn't seem to be GO cellular component? It only has 37 items, and I can't see what they have in common? It seems to be a random fixed bag of activities.

Where does this come from? (and what does it mean?)

(drop down Model Browser, Select a data type or browse)

rachellyne commented 2 years ago

Component is part of the data we get from Uniprot. If you start your query from Uniprot you will see the component class referenced. The component describes the names of processed products.

ValWood commented 2 years ago

OK, I think there might be some critical information missing to make use of this:

the PRO ID of the processed form and the specific residues:

Chain PRO_0000352836 | 40 – 753 | Rsm22-cox11 tandem protein 1, mitochondrialAdd BLAST Chain PRO_0000352837 | 40 – 568 | 37S ribosomal protein S22-1

The description of the protein: Rsm22-cox11 tandem protein 1, mitochondrial Cleaved into the following 2 chains: [37S ribosomal protein S22-1] Cytochrome c oxidase assembly protein cox11-1

UniProt calls this "molecule processing" which is a much clearer label. "Component" is confusing because it usually refers to cellular component (like a complex) rather than a sub-component of a sequence.

ValWood commented 2 years ago

Also here, these do not seem to link to the PomBase 'primary identifiers' (i.e SPAC*) in any way. I seem to remember we decided to link the UniProt data as features of the species primary identifiers(so that we would not have multiple representations of the same entity).

I can't remember the discussion precisely, but maybe @kimrutherford does. This is something we can discuss on a future call when Kim is present.

ValWood commented 2 years ago

So for example, If I select all of the uniProt attributes, they currently looks like this:

There is no connection to the primary systematic identifiers

ValWood commented 2 years ago

I think we will need to look which UniPRot queries will make sense. Some (like EC is now subsumed by GO+Rhea, and SPKW) are subsumed by GO annotation . It makes sense for us to prevent our users form using these directly because the data they will retrieve from using these as the basis for queries will be suboptimal (many false positives and false negatives), than compared to the more current collaborative way to do functional curation through the UniPROT and the MODs. GO curation weeds out all false positives, but also will provide a more complete list so that that the mappings become largely obsolete for most species. I can try to explain in more detail in person.

For example SPKW "cell cycle/nucleus/ mitochondrial" will retrieve only a fraction of the gene set that should be assigned to this 'term'. They still exist because they feed annotation into GO, but the subset they would provide from an independent query is arbitrary and would already be covered in more detail by GO and are therefore redundant. I would not advise any use to use this as a query.

rachellyne commented 2 years ago

Component is as it is called in the Uniprot XML. The Component class just shows the names of any processed products. The details from the "molecular processing" section can be found under features - I can show you when we meet.

rachellyne commented 2 years ago

The bits we load from uniprot are in a configuration file I think, so should be ok to tweak! How does anyone who isn't directly involved in GO know that stuff from Uniprot isn't current?

~Also identifiers - we can link to the gene, but did you mean the pombe protein ID? We may have a couple of references not populating correctly, I'll take a closer look.~

rachellyne commented 2 years ago

Ahh I missed this paper: https://academic.oup.com/bioinformatics/article/36/6/1896/5613180

I don't see any reason why we can't also extract the catalytic activity, Rhea and Chebi ids. I can't see where we configure this for uniprot, but will look into it.

Hiding comment as covered in https://github.com/intermine/pombemine/issues/40

ValWood commented 2 years ago

Also identifiers - we can link to the gene, but did you mean the pombe protein ID? I think linking to the gene is better as that is the ID we usually use. The protein ID is the same with a ".p" but nobody uses that.

How does anyone who isn't directly involved in GO know that stuff from Uniprot isn't current? It isn't that is isn't current. Uniprot still represent EC (but I believe EC will no longer maintained- I am checking that). It's just that GO will give a more complete and comprehensive answer.

The same with RHEA. There is no need to import it really because there is mapping between RHEA and GO. GO is more powerful for querying because of the hierarchy. if you did include RHEA Ids they should be associated with the GO IDs they are mapped to if that makes sense (so that all annotations to a term will be retrieved, rather than only direct annotations. But they aren't critical for now.

We can look what is loaded from UniProt when we speak.

ValWood commented 2 years ago

ACTION

[ ] If possible, do not load Uniprot "component" (people will confuse with GO component but it has only 33 mixed associations)

danielabutano commented 2 years ago

@ValWood already done, I have forgotten to include in the build

intermine / pombemine

Do not load UniProt "Component" #16