Closed ValWood closed 4 months ago
SGD complex example: https://www.yeastgenome.org/complex/CPX-2262
Action for Kim: check the SGD GPI/GPAD file for this complex.
@ValWood How do we associate gene IDs with complex portal IDs? Is there a mapping file?
On the call I was wondering what SO term to use in column 5 of the GPI file (DB_Object_Type). I had a look at the GPAD/GPI 2.0 spec and it says:
the entity type in column 5 is captured using an ID from the Sequence Ontology, Protein Ontology, or
Gene Ontology
So probably we can use something like protein-containing complex (GO:0032991) as the type for complexes.
check the SGD GPI/GPAD file for this complex.
SGD are still on GPAD/GPI v1.2 which doesn't need a term ID for the object type.
This is the line in the SGD GPI file for that complex:
SGD S000218145 CPX-2262 26S Proteasome complex|Proteasome Activator|2f16|2gpl|2zcy|3hye|1g0u|3bdm|3dy3|3dy4|3gpw|3gpt|1z7q|3e47|3d29|1jd2|2fak|4v7o|1g65|1ryp|3gpj|3JCK|3.4.25.1|4cr4|2596|4cr3|2595|3JCO|3JCP|4cr2|2594|3.4.19.12 protein_complex taxon:55929
The type is just protein_complex
As a first step I've added a build step to load a file with a mapping from gene systematic IDs to Complex Portal IDs.
The file is: pombe-embl/supporting_files/protein_complex_id_mapping.tsv
The three tab separated columns are:
The PubMed ID is require by Chado.
It's currently an empty file.
After we have added some complexes to the mapping file and successfully loaded them into Chado, I'll change the GPI writer to include the complex details.
So probably we can use something like protein-containing complex (GO:0032991) as the type for complexes.
GO:0032991 is what the GO db-xrefs file says.
The type is just protein_complex
We should use a broader term if there is one, (to cover for protein-RNA complexes) I can't even find protein_complex in SO?
are you using GO protein complex ID? if so use "protein-containing complex (GO:0032991"
I can't even find protein_complex in SO?
The GPI 1.2 spec allow "protein_complex" as a special case:
DB_Object_Type
A description of the type of the gene or gene product being annotated. This field uses Sequence Ontology labels and may correspond to one of the following: gene, protein_complex; protein; transcript; ncRNA; rRNA; tRNA; snRNA; snoRNA; or any subtype of ncRNA in the Sequence Ontology.
https://geneontology.org/docs/gene-product-information-gpi-format/#db_object_type
are you using GO protein complex ID? if so use "protein-containing complex (GO:0032991"
Yep, that's what I'm using.
After we have added some complexes to the mapping file and successfully loaded them into Chado, I'll change the GPI writer to include the complex details.
I added some fake protein complex data to my local test Chado database. So I've now implemented and tested writing the complexes to the GPI file.
The complexes will start appearing in the GPI file once we have some complexes in pombe-embl/supporting_files/protein_complex_id_mapping.tsv
https://www.ebi.ac.uk/complexportal/complex/organisms It might. be the "complex tab" file here, but I don't think the download is working?
We do have some real data in the spreadsheet Sandra shared with us https://docs.google.com/spreadsheets/d/1S4qU55KgNAKLsfXr-4DCb06jKgXcvt3kcAupSNPvl5Y/edit#gid=0
We will need to be careful mapping using gene names (Complex Portal will likely use the UniPRot gene names, and sometimes their names are inferred from S. cerevisiae and are not the official names. We probably need to use UnIProt identifiers instead in the 'real' conversion)
Or, we can use the link from CPX-555 -> GO:0005955 and then use our genes from https://www.pombase.org/data/annotations/Gene_ontology/GO_complexes/
Or, we can use the link from CPX-555 -> GO:0005955
Where is that link?
https://www.ebi.ac.uk/complexportal/complex/organisms It might. be the "complex tab" file here, but I don't think the download is working?
It didn't work for me either. I found the TSV files here: http://ftp.ebi.ac.uk/pub/databases/intact/complex/current/complextab/ http://ftp.ebi.ac.uk/pub/databases/intact/complex/current/complextab/284812.tsv
http://ftp.ebi.ac.uk/pub/databases/intact/complex/current/complextab/284812.tsv
Annoyingly the gene IDs are UniProt IDs but we can look them up in Chado when loading.
The complexes will start appearing in the GPI file once we have some complexes in pombe-embl/supporting_files/protein_complex_id_mapping.tsv
New plan: we now download the data file from Complex Portal when it changes, then load the details into Chado.
I'll check in the morning that it's all OK. We should have the complex IDs in the GPI from tomorrow.
Annoyingly the gene IDs are UniProt IDs but we can look them up in Chado when loading
That is handled by the load script.
I'll check in the morning that it's all OK. We should have the complex IDs in the GPI from tomorrow.
The GPI has the complexes now. But I think I missed the Protein_Containing_Complex_Members field (column 9): https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md
I'll change the GPI writer to fill this in.
But I think I missed the Protein_Containing_Complex_Members field (column 9): I'll change the GPI writer to fill this in.
I've done that in time for the nightly load.
The example in the GPI docs shows UniProt IDs but the spec implies that any ID will do. I've used PomBase gene IDs for now. I'll change it if there's a problem.
The example in the GPI docs shows UniProt IDs but the spec implies that any ID will do. I've used PomBase gene IDs for now. I'll change it if there's a problem.
We should check this with the Noctua people. Perhaps we need to use UniProt IDs in field 9 ("Protein_Containing_Complex_Members") of the GPI file?
We should check this with the Noctua people.
I've commented here:
I don't see complex portal IDs in Noctua, so I guess we need to add them to the ~GPAD~. GPI?
v
cc @PCarme