intermine / pombemine

0 stars 1 forks source link

SPCC548.03c from "GO annotation dataset" (plus question, how to query the "with" filed?) #51

Closed ValWood closed 2 years ago

ValWood commented 2 years ago

There are 2 genes represented in pombemine for SPCC548.03c.01 SPCC548.03c.02

Neither @kimrutherford or I can figure out where these originate (they are isoform IDs but not separate genes). We can't see where we export these. Are they coming from another source?

thanks v

ValWood commented 2 years ago

I should mention that KIms digging says this comes from: "GO Annotation data set"

but this is not a gene name in our annotation, it's an isoform identifier. ...and I can't see that we have used it in the PomBase GO annotations.

ValWood commented 2 years ago

There is also another GO data derived gene Q9P3V0 This appears in the gene list even.when applying the filter for taxon 4896. But it is a human gene?

ValWood commented 2 years ago

@kimrutherford

kimrutherford commented 2 years ago

There are 2 genes represented in pombemine for SPCC548.03c.01 SPCC548.03c.02

That should be: SPCC548.03c.1 SPCC548.03c.2

One case where we use these IDs in a place that is mostly gene IDs is the "with" column of the GAF file. For example:

PomBase SPCC1906.03     wtf19           GO:0005737      PMID:32032353   ISS     PomBase:SPCC548.03c.1   C       wtf meiotic drive antidote Wtf19                protein taxon:4896      20200914      PomBase part_of(CL:0000607)
PomBase SPCC1906.03     wtf19           GO:0005737      PMID:32032353   ISS     PomBase:SPCC548.03c.2   C       wtf meiotic drive antidote Wtf19                protein taxon:4896      20200914      PomBase part_of(CL:0000607)

Maybe they are being misunderstood as gene identifiers in that context?

ValWood commented 2 years ago

Right that makes sense.Hmm this is a real edge case. We can infer the location of the different specific versions of this protein (poison and antidote) , and in this case we have specified the isoform(alternative transcript) ID in the with column.

I checked the docs http://geneontology.org/docs/go-annotation-file-gaf-format-2.1/#with-or-from-column-8 to see if this field is restricted to "gene" and it isn't but isoform is not documented:

Screenshot 2022-05-23 at 13 24 05

I suspect if we discussed this the format would be the same as an allele, so it would be DB:gene_symbol[isoform_symbol]

I will check this with GO

kimrutherford commented 2 years ago

There is also another GO data derived gene Q9P3V0 This appears in the gene list even.when applying the filter for taxon 4896. But it is a human gene?

Did you paste the wrong ID? That one (Q9P3V0) is pombe wtf4. Did you mean Q9H9V9?

Now that I've investigated more this may be a similar problem to SPCC548.03c.1/SPCC548.03c.2

But this time it's a PomBase bug. In the GAF file we are prefixing everything in the "with" column with "PomBase:" so we have:

PomBase:SPAC25H1.02             RO:0002331      GO:0002184      GO_REF:0000050  ECO:0000266     PomBase:Q9H9V9          2007-05-31      PomBase
PomBase:SPAC25H1.02             RO:0002327      GO:0106156      GO_REF:0000050  ECO:0000266     PomBase:Q9H9V9          2007-05-31      PomBase

It's wrong in the GPAD file too. Whoops.

I've made an issue and I'll get to it this week: pombase/pombase-chado/issues/970

ValWood commented 2 years ago

But this time it's a PomBase bug. In the GAF file we are prefixing everything in the "with" column with "PomBase:" so we have:

It looks as though we are not prefixing everything (because they usually resolve on the web pages). if we omit the prefix, PomBase must be inferred. I can fixed this issue in the "legacy GO annotation file"

ValWood commented 2 years ago

GO ticket https://github.com/geneontology/helpdesk/issues/394

@danielabutano I have taken this ticket and I 'll report back. It might be possible to improve how InterMIne handles this field if IDs can be typed. Note that "protein complex" identifiers can also be used in this field (I am not sure how?)

ValWood commented 2 years ago

@danielabutano one thing I did wonder was about the value of adding the genes from the "with" field. The genes of interest will be loaded already from other routes.

ValWood commented 2 years ago

OK I have a response from GO. https://github.com/geneontology/helpdesk/issues/394 basically it isn't safe to assume that the IDs in the "with" field refer to genes.

But I think that is OK, we don't need to use these "with field" entries in any queries. They are really arbitrary sources of support for an annotation, but they aren't useful for querying, and therefore probably shouldn't be loaded as independent genes (as long as the string is visible (Prefix plus ID) people can look up the sources if they want to validate a specific annotation).

I wanted to see what the with field output looks like but I can't get a query to output this column. Where are the instructions for this?

danielabutano commented 2 years ago

Hi @ValWood, this query shows the genes created from the with column image

danielabutano commented 2 years ago

this query is more precise image

below the XML if you want to import the it: `

Pombe-gene data setcerevisiae-orthologs data sethuman-orthologs data setBioGRID interaction data setPomBase disease data setPomBase phenotypes data set

`

ValWood commented 2 years ago

Got it. I forgot I need to switch to "GO evidence"