intermine / pombemine

0 stars 1 forks source link

A small number of rows where gene name, symbol and feature type have "no value" #47

Closed ValWood closed 2 years ago

ValWood commented 2 years ago
Screenshot 2022-05-18 at 09 09 57
ValWood commented 2 years ago

Also, in the above is one random Uniprot entry. UniProt accessions are otherwise loaded as features attached to gene s(so that we only have one gene set)

ValWood commented 2 years ago

@kimrutherford @manulera

kimrutherford commented 2 years ago

I get the same list with this query:

<query model="genomic" view="Gene.primaryIdentifier Gene.secondaryIdentifier Gene.symbol Gene.name Gene.length Gene.organism.shortName" constraintLogic="(A and B)" sortOrder="">
   <constraint path="Gene.length" op="IS NULL" code="B"/>
   <constraint path="Gene.organism.shortName" value="S. pombe" op="=" code="A"/>
</query>

PombeMine-zero-length-genes-1

kimrutherford commented 2 years ago

I'm investigating (https://github.com/pombase/pombase-chado/issues/967) why we export the transcript ID "SPAC1556.06.1" as an exact synonym for "SPAC1556.06" in the JSON file for PombeMine. But it might just be a coincidence that it's in this list.

ValWood commented 2 years ago

That's weird, I wonder where they are coming from?

for example :

SPAC1F12.03c | removed; replaced by a nuclear mitochondrial pseudogene (NUMT) feature |   | 2012-07-16

is a NUMT (a small fragment of the mitochondria, that looks like a gene fragment, and so I had it as coding for a while), but it was removed from PomBase in 2012.

I thought we only load genes from PomBase?

kimrutherford commented 2 years ago

I thought we only load genes from PomBase?

They will be loaded from any source that has gene data.

I should have done this earlier. Here is the result of querying PombeMine for the gene identifier and the DataSet that the identifier came from:

identifier DataSet
Q9H9V9 GO Annotation data set
SPAC1556.06.1 BioGRID interaction data set
SPAC1F12.03c BioGRID interaction data set
SPAC4H3.12c BioGRID interaction data set
SPBC28F2.11 cerevisiae-orthologs data set
SPBC8E4.02c BioGRID interaction data set
SPCC548.03c.1 GO Annotation data set
SPCC548.03c.2 GO Annotation data set
ValWood commented 2 years ago
ValWood commented 2 years ago

Also asked @kimrutherford not to load into PomBase https://github.com/intermine/pombemine/issues/51

ValWood commented 2 years ago

I don't understand this one. The S. c orthologs are parsed from the contig files and this isn't mentioned except as a systematic ID?

See query https://github.com/intermine/pombemine/issues/50

ValWood commented 2 years ago

Can you send the GOA GAF so that I can investigate further? (the alternative forms would be in the column "gene product form ID (column 17)

Addded to https://github.com/intermine/pombemine/issues/51

kimrutherford commented 2 years ago

~SPBC28F2.11 | cerevisiae-orthologs data set I don't understand this one. The S. c orthologs are parsed from the contig files and this isn't mentioned except as a systematic ID? Yep, I think that's one for InterMine to investigate.~ outdated

Can you send the GOA GAF so that I can investigate further? (the alternative forms would be in the column "gene product form ID (column 17)

Here's the pombe and japonicus lines from the GOA GAF we load: https://curation.pombase.org/kmr44/gene_association.goa_uniprot.pombe+japonicus-2022-04-01.tsv.gz

That's what PomBase uses, but PombeMine might be reading the XML file.

ValWood commented 2 years ago

That's what PomBase uses, but PombeMine might be reading the XML file.

Pombemine uses our GO data not the GOA (sothat we do not import the filtered incorrect propagations that have not yet been fixed, and that we get the non-redundant set) SO if they are getting these IDs it is somehow via UniProt, not via the GO GAF

ValWood commented 2 years ago
identifier DataSet
SPCC548.03c.1 GO Annotation data set
SPCC548.03c.2 GO Annotation data set

https://github.com/intermine/pombemine/issues/51

ValWood commented 2 years ago

Sorry @danielabutano ! I thought this ticket was on our tracker whilst we tracked down the sources. So I can close this issue, more informative tickets. have been opened for the individual issues requiring action.

ValWood commented 2 years ago

BioGrid have mailed back. They have fixed the 4 issues at their end so these will disappear soon.