pombase / pombase-chado

PomBase code for accessing Chado
MIT License
5 stars 3 forks source link

MGI xrefs failing GO checks #1224

Closed ValWood closed 1 week ago

ValWood commented 2 months ago

Our MGI ISO xrefs are failing checks.

WARNING - Invalid identifier:GORULE:0000027: 1298204 does not match any id_syntax patterns for MGI in dbxrefs--PomBase SPBC530.12c pdf1 enables GO:0008474 PMID:15075260 ISO MGI:MGI:1298204 F palmitoyl protein thioesterase/ dolichol pyrophosphate phosphatase fusion protein Pdf1 protein taxon:4896 20040414 PomBase WARNING - Invalid identifier:GORULE:0000027: 1316717 does not match any id_syntax patterns for MGI in dbxrefs--PomBase SPBC20F10.03 SPBC20F10.03 is_active_in GO:0005634 GO_REF:0000024 ISS MGI:MGI:1316717 C armadillo-type fold protein, human IFRD1 ortholog, implicated in transcription or signaling protein taxon:4896 20170830 PomBase WARNING - Invalid identifier:GORULE:0000027: 1346084 does not match any id_syntax patterns for MGI in dbxrefs--PomBase SPAC6C3.09 rpp40 part_of GO:0005655 GO_REF:0000024 ISS MGI:MGI:1346084 C RNase P and RNase MRP subunit Rpp40 protein taxon:4896 20061017 PomBase WARNING - Invalid identifier:GORULE:0000027: 1919005 does not match any id_syntax patterns for MGI in dbxrefs--PomBase SPAC513.06c dhd1 involved_in GO:0042843 GO_REF:0000024 ISS MGI:MGI:1919005 P D-xylose 1-dehydrogenase (NADP+) protein taxon:4896 20150502 PomBase WARNING - Invalid identifier:GORULE:0000027: 1919005 does not match any id_syntax patterns for MGI in dbxrefs--PomBase SPAC513.06c dhd1 enables GO:0047837 GO_REF:0000024 ISS MGI:MGI:1919005 F D-xylose 1-dehydrogenase (NADP+) protein taxon:4896 20150502 PomBase

but its a bit weird because the display and the URL are MGI:1298204 but on the pop-up it says MGI:1298204 could you have a dig and see if the syntax has been resolved to remove the first MGI: or something?

kimrutherford commented 2 months ago

I can't see anything that's changed. We get the MGI IDs from the GOA GAF and they look like "MGI:MGI:1919005".

It's a bit weird. We have "MGI:MGI:1919005" in the GAF file but the error is:

1919005 does not match any id_syntax patterns for MGI

It's like the "MGI:" prefix is being removed twice.

The id_syntax in the db-xrefs file is "MGI:[0-9]{5,}" and that has been the same for a few years.

I think this might be a problem with the GO check. The issue for the check is still in progress:

ValWood commented 2 months ago

@pgaudet @kltm is this a GO checks problem? v

pgaudet commented 2 months ago

maybe @kltm knows why this is happening

kltm commented 2 months ago

Looking at https://github.com/geneontology/go-site/blob/master/metadata/rules/gorule-0000027.md . Okay, "soft" warning, so no data filtering.

The moment of failure is likely here: https://github.com/biolink/ontobio/blob/master/ontobio/io/assocparser.py#L835 Special casing for MGI leading into it is: https://github.com/biolink/ontobio/blob/master/ontobio/io/assocparser.py#L802-L806

So, it looks like MGI:MGI:1919005 would be clipped to MGI and 1919005, the latter of which would fail when checking against the regexp. The options here would be:

Either way, @pgaudet , this is probably best approached as a GO QC bug for the moment (although a "light" one as no fix or filtering is done) and added to the QC worklist.

ValWood commented 1 week ago

@pgaudet @kltm I'm closign this on the PomBAse tracker. It should be in the GO tracker if it's still an issue?