Load complex portal UIDs into Chado and put them in the GPAD file

ValWood commented 5 months ago

I don't see complex portal IDs in Noctua, so I guess we need to add them to the ~GPAD~. GPI?

v

cc @PCarme

kimrutherford commented 5 months ago

SGD complex example: https://www.yeastgenome.org/complex/CPX-2262

Action for Kim: check the SGD GPI/GPAD file for this complex.

kimrutherford commented 5 months ago

@ValWood How do we associate gene IDs with complex portal IDs? Is there a mapping file?

kimrutherford commented 5 months ago

On the call I was wondering what SO term to use in column 5 of the GPI file (DB_Object_Type). I had a look at the GPAD/GPI 2.0 spec and it says:

the entity type in column 5 is captured using an ID from the Sequence Ontology, Protein Ontology, or
Gene Ontology

So probably we can use something like protein-containing complex (GO:0032991) as the type for complexes.

check the SGD GPI/GPAD file for this complex.

SGD are still on GPAD/GPI v1.2 which doesn't need a term ID for the object type.

This is the line in the SGD GPI file for that complex:

SGD     S000218145      CPX-2262                26S Proteasome complex|Proteasome Activator|2f16|2gpl|2zcy|3hye|1g0u|3bdm|3dy3|3dy4|3gpw|3gpt|1z7q|3e47|3d29|1jd2|2fak|4v7o|1g65|1ryp|3gpj|3JCK|3.4.25.1|4cr4|2596|4cr3|2595|3JCO|3JCP|4cr2|2594|3.4.19.12        protein_complex taxon:55929

The type is just protein_complex

kimrutherford commented 5 months ago

As a first step I've added a build step to load a file with a mapping from gene systematic IDs to Complex Portal IDs.

The file is: pombe-embl/supporting_files/protein_complex_id_mapping.tsv

The three tab separated columns are:

gene systematic ID
Complex Portal ID
PubMed ID (maybe a Complex Portal paper?)

The PubMed ID is require by Chado.

It's currently an empty file.

kimrutherford commented 5 months ago

After we have added some complexes to the mapping file and successfully loaded them into Chado, I'll change the GPI writer to include the complex details.

kimrutherford commented 5 months ago

So probably we can use something like protein-containing complex (GO:0032991) as the type for complexes.

GO:0032991 is what the GO db-xrefs file says.

ValWood commented 5 months ago

The type is just protein_complex

We should use a broader term if there is one, (to cover for protein-RNA complexes) I can't even find protein_complex in SO?

ValWood commented 5 months ago

are you using GO protein complex ID? if so use "protein-containing complex (GO:0032991"

kimrutherford commented 5 months ago

I can't even find protein_complex in SO?

The GPI 1.2 spec allow "protein_complex" as a special case:

DB_Object_Type

A description of the type of the gene or gene product being annotated. This field uses Sequence Ontology labels and may correspond to one of the following: gene, protein_complex; protein; transcript; ncRNA; rRNA; tRNA; snRNA; snoRNA; or any subtype of ncRNA in the Sequence Ontology.

https://geneontology.org/docs/gene-product-information-gpi-format/#db_object_type

kimrutherford commented 5 months ago

are you using GO protein complex ID? if so use "protein-containing complex (GO:0032991"

Yep, that's what I'm using.

kimrutherford commented 5 months ago

After we have added some complexes to the mapping file and successfully loaded them into Chado, I'll change the GPI writer to include the complex details.

I added some fake protein complex data to my local test Chado database. So I've now implemented and tested writing the complexes to the GPI file.

The complexes will start appearing in the GPI file once we have some complexes in pombe-embl/supporting_files/protein_complex_id_mapping.tsv

ValWood commented 5 months ago

https://www.ebi.ac.uk/complexportal/complex/organisms It might. be the "complex tab" file here, but I don't think the download is working?

ValWood commented 5 months ago

We do have some real data in the spreadsheet Sandra shared with us https://docs.google.com/spreadsheets/d/1S4qU55KgNAKLsfXr-4DCb06jKgXcvt3kcAupSNPvl5Y/edit#gid=0

We will need to be careful mapping using gene names (Complex Portal will likely use the UniPRot gene names, and sometimes their names are inferred from S. cerevisiae and are not the official names. We probably need to use UnIProt identifiers instead in the 'real' conversion)

ValWood commented 5 months ago

Or, we can use the link from CPX-555 -> GO:0005955 and then use our genes from https://www.pombase.org/data/annotations/Gene_ontology/GO_complexes/

kimrutherford commented 5 months ago

Or, we can use the link from CPX-555 -> GO:0005955

Where is that link?

kimrutherford commented 5 months ago

https://www.ebi.ac.uk/complexportal/complex/organisms It might. be the "complex tab" file here, but I don't think the download is working?

It didn't work for me either. I found the TSV files here: http://ftp.ebi.ac.uk/pub/databases/intact/complex/current/complextab/ http://ftp.ebi.ac.uk/pub/databases/intact/complex/current/complextab/284812.tsv

kimrutherford commented 5 months ago

http://ftp.ebi.ac.uk/pub/databases/intact/complex/current/complextab/284812.tsv

Annoyingly the gene IDs are UniProt IDs but we can look them up in Chado when loading.

kimrutherford commented 4 months ago

The complexes will start appearing in the GPI file once we have some complexes in pombe-embl/supporting_files/protein_complex_id_mapping.tsv

New plan: we now download the data file from Complex Portal when it changes, then load the details into Chado.

I'll check in the morning that it's all OK. We should have the complex IDs in the GPI from tomorrow.

Annoyingly the gene IDs are UniProt IDs but we can look them up in Chado when loading

That is handled by the load script.

kimrutherford commented 4 months ago

I'll check in the morning that it's all OK. We should have the complex IDs in the GPI from tomorrow.

The GPI has the complexes now. But I think I missed the Protein_Containing_Complex_Members field (column 9): https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md

I'll change the GPI writer to fill this in.

kimrutherford commented 4 months ago

But I think I missed the Protein_Containing_Complex_Members field (column 9): I'll change the GPI writer to fill this in.

I've done that in time for the nightly load.

The example in the GPI docs shows UniProt IDs but the spec implies that any ID will do. I've used PomBase gene IDs for now. I'll change it if there's a problem.

kimrutherford commented 2 weeks ago

The example in the GPI docs shows UniProt IDs but the spec implies that any ID will do. I've used PomBase gene IDs for now. I'll change it if there's a problem.

We should check this with the Noctua people. Perhaps we need to use UniProt IDs in field 9 ("Protein_Containing_Complex_Members") of the GPI file?

kimrutherford commented 2 weeks ago

We should check this with the Noctua people.

I've commented here:

https://github.com/geneontology/noctua/issues/910#issuecomment-2339382465

pombase / pombase-chado

Load complex portal UIDs into Chado and put them in the GPAD file #1166