A small number of rows where gene name, symbol and feature type have "no value"

ValWood commented 2 years ago

ValWood commented 2 years ago

Also, in the above is one random Uniprot entry. UniProt accessions are otherwise loaded as features attached to gene s(so that we only have one gene set)

ValWood commented 2 years ago

@kimrutherford @manulera

kimrutherford commented 2 years ago

I get the same list with this query:

<query model="genomic" view="Gene.primaryIdentifier Gene.secondaryIdentifier Gene.symbol Gene.name Gene.length Gene.organism.shortName" constraintLogic="(A and B)" sortOrder="">
   <constraint path="Gene.length" op="IS NULL" code="B"/>
   <constraint path="Gene.organism.shortName" value="S. pombe" op="=" code="A"/>
</query>

PombeMine-zero-length-genes-1

kimrutherford commented 2 years ago

I'm investigating (https://github.com/pombase/pombase-chado/issues/967) why we export the transcript ID "SPAC1556.06.1" as an exact synonym for "SPAC1556.06" in the JSON file for PombeMine. But it might just be a coincidence that it's in this list.

SPAC1F12.03c and SPAC4H3.12c aren't current PomBase identifiers. Those genes were removed sometime in the past. (Details: https://www.pombase.org/status/new-and-removed-genes)
- Ensembl Genomes has SPAC4H3.12c still: https://fungi.ensembl.org/Schizosaccharomyces_pombe/Gene/Summary?g=SPAC4H3.12c
- SPAC1F12.03c isn't in Ensembl Genomes.
SPBC28F2.11 is a current PomBase gene. There are two genes with that DB identifier in PombeMine. I'm not sure why they haven't merged.
SPBC8E4.02c is a synonym of SPNCRNA.9001 in PomBase because two genes were merged in the past. In PombeMine there is a gene object for SPBC8E4.02c and one for SPNCRNA.9001.
- Ensembl Genomes has SPBC8E4.02c but not SPNCRNA.9001
SPCC548.03c.1 and SPCC548.03c.2 are transcript IDs.

ValWood commented 2 years ago

That's weird, I wonder where they are coming from?

for example :

SPAC1F12.03c | removed; replaced by a nuclear mitochondrial pseudogene (NUMT) feature | | 2012-07-16

is a NUMT (a small fragment of the mitochondria, that looks like a gene fragment, and so I had it as coding for a while), but it was removed from PomBase in 2012.

I thought we only load genes from PomBase?

kimrutherford commented 2 years ago

I thought we only load genes from PomBase?

They will be loaded from any source that has gene data.

I should have done this earlier. Here is the result of querying PombeMine for the gene identifier and the DataSet that the identifier came from:

identifier	DataSet
Q9H9V9	GO Annotation data set
SPAC1556.06.1	BioGRID interaction data set
SPAC1F12.03c	BioGRID interaction data set
SPAC4H3.12c	BioGRID interaction data set
SPBC28F2.11	cerevisiae-orthologs data set
SPBC8E4.02c	BioGRID interaction data set
SPCC548.03c.1	GO Annotation data set
SPCC548.03c.2	GO Annotation data set

ValWood commented 2 years ago

[ ] https://beta.uniprot.org/uniprotkb/Q9H9V9/entry is a human entry, determine how this gets into pombe gene set

ValWood commented 2 years ago

[x] Contact BioGRID about:
SPBC8E4.02c is now a synonym of -> SPNCRNA.9001 (there is no longer a protein coding orf for this ID)
SPAC1F12.03c. removed; replaced by a nuclear mitochondrial pseudogene (NUMT) feature
SPAC4H3.12c not protein-coding (of upstream region of snr62). No corresponding gene feature (but might be part of snr62 transcript)
SPAC1556.06.1 is a transcript ID for an alternative transcript of SPAC1556.06

Also asked @kimrutherford not to load into PomBase https://github.com/intermine/pombemine/issues/51

ValWood commented 2 years ago

[ ] SPBC28F2.11 | cerevisiae-orthologs data set

I don't understand this one. The S. c orthologs are parsed from the contig files and this isn't mentioned except as a systematic ID?

See query https://github.com/intermine/pombemine/issues/50

ValWood commented 2 years ago

[ ] when I search UniPRrt for these isoforms I only get one entry Q9P3V0

Can you send the GOA GAF so that I can investigate further? (the alternative forms would be in the column "gene product form ID (column 17)

Addded to https://github.com/intermine/pombemine/issues/51

kimrutherford commented 2 years ago

~SPBC28F2.11 | cerevisiae-orthologs data set I don't understand this one. The S. c orthologs are parsed from the contig files and this isn't mentioned except as a systematic ID? Yep, I think that's one for InterMine to investigate.~ outdated

Can you send the GOA GAF so that I can investigate further? (the alternative forms would be in the column "gene product form ID (column 17)

Here's the pombe and japonicus lines from the GOA GAF we load: https://curation.pombase.org/kmr44/gene_association.goa_uniprot.pombe+japonicus-2022-04-01.tsv.gz

That's what PomBase uses, but PombeMine might be reading the XML file.

ValWood commented 2 years ago

That's what PomBase uses, but PombeMine might be reading the XML file.

Pombemine uses our GO data not the GOA (sothat we do not import the filtered incorrect propagations that have not yet been fixed, and that we get the non-redundant set) SO if they are getting these IDs it is somehow via UniProt, not via the GO GAF

ValWood commented 2 years ago

identifier	DataSet
SPCC548.03c.1	GO Annotation data set
SPCC548.03c.2	GO Annotation data set

https://github.com/intermine/pombemine/issues/51

ValWood commented 2 years ago

Sorry @danielabutano ! I thought this ticket was on our tracker whilst we tracked down the sources. So I can close this issue, more informative tickets. have been opened for the individual issues requiring action.

ValWood commented 2 years ago

BioGrid have mailed back. They have fixed the 4 issues at their end so these will disappear soon.

intermine / pombemine

A small number of rows where gene name, symbol and feature type have "no value" #47