pombase / pombase-chado

PomBase code for accessing Chado
MIT License
5 stars 3 forks source link

Query GO for pombe GO-CAMs #1177

Closed kimrutherford closed 2 weeks ago

kimrutherford commented 4 months ago

We want to be able query for pombe GO-CAMs rather than having to manually curate files in SVN (like pombe-embl/supporting_files/production_gocam_id_mapping.tsv)

See also:

SPARQL query for use here: https://geneontology.org/sparql:

PREFIX metago: <http://model.geneontology.org/>
PREFIX provided_by: <http://purl.org/pav/providedBy>

SELECT distinct ?gocam WHERE {
  GRAPH ?gocam {
    ?gocam provided_by:  "http://www.pombase.org" .
  }
}

(Currently it returns no results)

kimrutherford commented 4 months ago

(Currently it returns no results)

This (fixed) query returns three GO-CAM IDs provided_by PomBase. Progress!

PREFIX gocam: <http://model.geneontology.org/>
PREFIX provided_by: <http://purl.org/pav/providedBy>

SELECT distinct ?gocam WHERE {
  GRAPH ?gocam {
    ?gocam provided_by: "http://www.pombase.org"^^<http://www.w3.org/2001/XMLSchema#string> .
  }
}
ORDER BY ?gocam

The query results are: gocam:66187e4700001573 gocam:66187e4700001781 gocam:66187e4700002284

For comparison, we have these GO-CAMs manually configured:

66187e4700001573 66187e4700001781 66187e4700002284 66187e4700003150 662af8fa00000408 662af8fa00000499

kimrutherford commented 4 months ago

We can query model IDs and gene IDs in GO-CAM models provided by PomBase with:

PREFIX gocam: <http://model.geneontology.org/>
PREFIX provided_by: <http://purl.org/pav/providedBy>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX pombasegeneid: <http://identifiers.org/pombase/>

SELECT distinct ?gocam ?geneid WHERE {
  GRAPH ?gocam {
    ?gocam provided_by: "http://www.pombase.org"^^<http://www.w3.org/2001/XMLSchema#string> .
    ?modelgeneid rdf:type ?geneid
  }
  FILTER(strstarts(str(?geneid), str(pombasegeneid:)))
}
ORDER BY ?gocam ?geneid

Results look something like:

gocam:66187e4700001573 pombasegeneid:SPAC644.04 gocam:66187e4700001573 pombasegeneid:SPBC2F12.08c gocam:66187e4700001573 pombasegeneid:SPCC330.10 gocam:66187e4700001781 pombasegeneid:SPAC1B3.17 ...

I'm hoping we can wrap this in a script that gets run nightly or maybe weekly.

kimrutherford commented 4 months ago

We can query model IDs and gene IDs in GO-CAM models provided by PomBase with:

I meant to say that this query returns every gene ID used anywhere in the model. I think that's probably what we want but the query could be made more precise if needed later (once I understand the GO-CAM model better).

kimrutherford commented 4 months ago

Here's a slightly more precise query after re-reading the GO SPARQL docs:

PREFIX gocam: <http://model.geneontology.org/>
PREFIX provided_by: <http://purl.org/pav/providedBy>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX pombasegeneid: <http://identifiers.org/pombase/>
PREFIX enabled_by: <http://purl.obolibrary.org/obo/RO_0002333>

SELECT distinct ?gocam ?geneid WHERE {
  GRAPH ?gocam {
    ?gocam provided_by: "http://www.pombase.org"^^<http://www.w3.org/2001/XMLSchema#string> .
    ?s enabled_by: ?gpnode .
    ?gpnode rdf:type ?geneid .
  }
  FILTER(strstarts(str(?geneid), str(pombasegeneid:)))
}
ORDER BY ?gocam ?geneid
kimrutherford commented 4 months ago

I've added a script that makes a SPARQL query to get the GO-CAM IDs and corresponding gene IDs. The output is the same format as pombe-embl/supporting_files/production_gocam_id_mapping.tsv

As with the file generated for process terms (pombase/website#2173), there are only three models in the output of the script:

I'm assuming that more will be available after the next GO update.

I hope that eventually we'll be able automatically update production_gocam_id_mapping.tsv nightly or weekly using the new script.

kimrutherford commented 4 months ago

SPARQL is being deprecated by GO in favour of the GO API

It looks like we can use this API end-point to return gene products given a list of GO-CAM IDs: /api/models/gp

https://geneontology.org/docs/tools-guide/

kimrutherford commented 4 months ago

And this end-point should return a list of pombe models: /api/taxon/{taxon}/models

There's an issue at the moment though:

kimrutherford commented 3 months ago

If necessary, we can get the pombe GO-CAM IDs and genes from: http://snapshot.geneontology.org/products/upstream_and_raw_data/noctua_pombase.gpad.gz

kimrutherford commented 3 months ago

I've written a script that uses the Noctua PomBase GPAD file and the GO model API (https://api.geneontology.org/api/go-cam/{ID}) to update the data files for Chado:

The script is also retrieves the GO-CAM model titles from the API for adding to Chado.

I've run the script and committed the updated files to SVN. So on Tuesday morning we should have a bunch of extra models and the models will have titles.

kimrutherford commented 3 months ago

For now I plan to run the script manually until I'm convinced that it's reliable.

I'm going to close this issue since we can now get what we need from GO.

kimrutherford commented 2 weeks ago

From email, but I wanted to attach it to this issue:


I remembered the discussion with FlyBase yesterday about GO-CAMs. We talked about how the Alliance seem to have their own internal API for GO data. It's used by the GO-CAM widget. I had thought that the widget needed a Alliance special API to work but it occurred to me today that maybe the data from that API could be in the correct format for the GO-CAM update script we use. After trying a few things I was able to get that to work and we'll have much more up-to-date GO-CAM information for tonight's load. There will be 290 genes from this query tomorrow: https://www.pombase.org/results/from/id/f99d8133-3206-4941-b44e-9314e7cae3d2

kimrutherford commented 2 weeks ago

There will be 290 genes from this query tomorrow: https://www.pombase.org/results/from/id/f99d8133-3206-4941-b44e-9314e7cae3d2

These are the 290 genes: https://www.pombase.org/results/from/id/690ab3eb-0db5-4698-82e4-ea006399c40a

ValWood commented 2 weeks ago

Its odd that these arent in the list

SPBC3D6.07 | gpi3 | pig-A, phosphatidylinositol N-acetylglucosaminyltransferase subunit Gpi3 SPCC16A11.06c | gpi10 | pig-B SPAC13G6.03 | gpi7 | Pig-G, CP2 mannose-ethanolamine phosphotransferase GPI anchor biosynthesis protein Gpi7 SPBC27B12.06 | gpi13 | pig-O SPAC4G8.12c | smp3 | pig-Z, alpha-1,2-mannosyltransferase Smp3

they have been in the production model for quite a while...

SPBC3D6.07 gpi3 pig-A, phosphatidylinositol N-acetylglucosaminyltransferase subunit Gpi3 SPCC16A11.06c gpi10 pig-B SPAC13G6.03 gpi7 Pig-G, CP2 mannose-ethanolamine phosphotransferase GPI anchor biosynthesis protein Gpi7 SPBC27B12.06 gpi13 pig-O SPAC4G8.12c smp3 pig-Z, alpha-1,2-mannosyltransferase Smp3

kimrutherford commented 2 weeks ago

they have been in the production model for quite a while...

Which model? I can look it up if you let me know the ID.

ValWood commented 2 weeks ago

this one http://noctua.geneontology.org/workbench/noctua-visual-pathway-editor/?model_id=gomodel%3A665912ed00000192

ValWood commented 2 weeks ago

similarly

SPAC3A11.08 cul4 CLRC complex subunit, cullin 4
SPCC970.07c raf2 CLRC ubiquitin ligase complex subunit Raf2
SPCC613.12c raf1 CLRC ubiquitin ligase complex WD repeat subunit Raf1/Dos1
SPCC11E10.08 rik1 CLRC ubiquitin ligase complex WD repeat subunit Rik1

SPAC3A11.08 cul4 CLRC complex subunit, cullin 4 SPCC970.07c raf2 CLRC ubiquitin ligase complex subunit Raf2 SPCC613.12c raf1 CLRC ubiquitin ligase complex WD repeat subunit Raf1/Dos1 SPCC11E10.08 rik1 CLRC ubiquitin ligase complex WD repeat subunit Rik1

are in http://noctua.geneontology.org/workbench/noctua-visual-pathway-editor/?model_id=gomodel%3A66187e4700001781 http://noctua.geneontology.org/workbench/noctua-visual-pathway-editor/?model_id=gomodel%3A665912ed00001983 http://noctua.geneontology.org/workbench/noctua-visual-pathway-editor/?model_id=gomodel%3A665912ed00000652

kimrutherford commented 2 weeks ago

OK, thanks.

I'm now testing an even more dodgy hack to get all the details of all the production models. We can talk about what I've done next time we have a chat. It's quite a fragile solution that queries the Noctua server (which Seth recommended / requested that we don't do) and the Alliance GO-CAM API (which seems like a temporary hack on their part). So I don't know how long it will work for us. But it works for now and it seems quite up to date.

With the new hack we get 354 genes in models and all the missing genes from your comment are present. Here's the list on my desktop: https://desktop.kmr.nz/results/from/id/89524fcb-3b06-4943-ae78-6f748e141108

(Lots of data is missing from that version as it was a quick load to test the GO-CAMs)

kimrutherford commented 2 weeks ago

With the new hack we get 354 genes in models

I've committed those changes into SVN so they'll be on pombase.org tomorrow.

ValWood commented 2 weeks ago

excellent, lets use that for the time being. It's really taking shape, we have done about 10% of likely possible in 3 months!

kimrutherford commented 2 weeks ago

they'll be on pombase.org tomorrow

That worked: https://www.pombase.org/results/from/id/f99d8133-3206-4941-b44e-9314e7cae3d2

Let me know if you notice any missing genes. (Or genes that shouldn't be there)

ValWood commented 2 weeks ago

Perfect, the progress is pretty amazing because we have quite a lot in development and things to add to existing pathways.