Closed ValWood closed 2 years ago
Also, in the above is one random Uniprot entry. UniProt accessions are otherwise loaded as features attached to gene s(so that we only have one gene set)
@kimrutherford @manulera
I get the same list with this query:
<query model="genomic" view="Gene.primaryIdentifier Gene.secondaryIdentifier Gene.symbol Gene.name Gene.length Gene.organism.shortName" constraintLogic="(A and B)" sortOrder="">
<constraint path="Gene.length" op="IS NULL" code="B"/>
<constraint path="Gene.organism.shortName" value="S. pombe" op="=" code="A"/>
</query>
I'm investigating (https://github.com/pombase/pombase-chado/issues/967) why we export the transcript ID "SPAC1556.06.1" as an exact synonym for "SPAC1556.06" in the JSON file for PombeMine. But it might just be a coincidence that it's in this list.
SPAC1F12.03c and SPAC4H3.12c aren't current PomBase identifiers. Those genes were removed sometime in the past. (Details: https://www.pombase.org/status/new-and-removed-genes)
SPBC28F2.11 is a current PomBase gene. There are two genes with that DB identifier in PombeMine. I'm not sure why they haven't merged.
SPBC8E4.02c is a synonym of SPNCRNA.9001 in PomBase because two genes were merged in the past. In PombeMine there is a gene object for SPBC8E4.02c and one for SPNCRNA.9001.
SPCC548.03c.1 and SPCC548.03c.2 are transcript IDs.
That's weird, I wonder where they are coming from?
for example :
SPAC1F12.03c | removed; replaced by a nuclear mitochondrial pseudogene (NUMT) feature | | 2012-07-16
is a NUMT (a small fragment of the mitochondria, that looks like a gene fragment, and so I had it as coding for a while), but it was removed from PomBase in 2012.
I thought we only load genes from PomBase?
I thought we only load genes from PomBase?
They will be loaded from any source that has gene data.
I should have done this earlier. Here is the result of querying PombeMine for the gene identifier and the DataSet that the identifier came from:
identifier | DataSet |
---|---|
Q9H9V9 | GO Annotation data set |
SPAC1556.06.1 | BioGRID interaction data set |
SPAC1F12.03c | BioGRID interaction data set |
SPAC4H3.12c | BioGRID interaction data set |
SPBC28F2.11 | cerevisiae-orthologs data set |
SPBC8E4.02c | BioGRID interaction data set |
SPCC548.03c.1 | GO Annotation data set |
SPCC548.03c.2 | GO Annotation data set |
[x] Contact BioGRID about:
SPBC8E4.02c is now a synonym of -> SPNCRNA.9001 (there is no longer a protein coding orf for this ID)
SPAC1F12.03c. removed; replaced by a nuclear mitochondrial pseudogene (NUMT) feature
SPAC4H3.12c not protein-coding (of upstream region of snr62). No corresponding gene feature (but might be part of snr62 transcript)
SPAC1556.06.1 is a transcript ID for an alternative transcript of SPAC1556.06
Also asked @kimrutherford not to load into PomBase https://github.com/intermine/pombemine/issues/51
I don't understand this one. The S. c orthologs are parsed from the contig files and this isn't mentioned except as a systematic ID?
Can you send the GOA GAF so that I can investigate further? (the alternative forms would be in the column "gene product form ID (column 17)
~SPBC28F2.11 | cerevisiae-orthologs data set I don't understand this one. The S. c orthologs are parsed from the contig files and this isn't mentioned except as a systematic ID? Yep, I think that's one for InterMine to investigate.~ outdated
Can you send the GOA GAF so that I can investigate further? (the alternative forms would be in the column "gene product form ID (column 17)
Here's the pombe and japonicus lines from the GOA GAF we load: https://curation.pombase.org/kmr44/gene_association.goa_uniprot.pombe+japonicus-2022-04-01.tsv.gz
That's what PomBase uses, but PombeMine might be reading the XML file.
That's what PomBase uses, but PombeMine might be reading the XML file.
Pombemine uses our GO data not the GOA (sothat we do not import the filtered incorrect propagations that have not yet been fixed, and that we get the non-redundant set) SO if they are getting these IDs it is somehow via UniProt, not via the GO GAF
identifier | DataSet |
---|---|
SPCC548.03c.1 | GO Annotation data set |
SPCC548.03c.2 | GO Annotation data set |
Sorry @danielabutano ! I thought this ticket was on our tracker whilst we tracked down the sources. So I can close this issue, more informative tickets. have been opened for the individual issues requiring action.
BioGrid have mailed back. They have fixed the 4 issues at their end so these will disappear soon.