Closed manulera closed 1 year ago
@ValWood some very good results after running the pipeline on the modification data. We have 60,271 modification annotations in total.
1409 errors found, of which 1142 fixed:
- 1082 sequence errors
- 60 syntax errors
- 0 have several possible fixes, for those check the `change_sequence_position_to` field for "|" characters
Types of errors fixed:
old_coords_fix 1004
histone_fix 71
syntax_error 60
multi_shift_fix 7
Using the updated changelog, that covers the entire history of the genome (prior to the use of systematic_id and all), was able to fix 1004 errors. The multi-shift ones we should discuss whether it makes sense.
Mostly addressed in #21
Excellent news! that's incredible.
@ValWood I downloaded the data that I originally used on the 14th of December. I have re-ran the analysis in the data now and I picked up 95 more errors, all fixable using old coordinates. I see that most of them are coming from automatic assertions. I guess this means that whatever pipeline makes this predictions uses outdated coordinates (from Uniprot I guess). Full list attached.
There are 4 genes in this file:
Systematic ID | Gene name | Product description |
---|---|---|
SPAPB1E7.05 | gde1 | glycerophosphoryl diester phosphodiesterase Gde1 |
SPBC16E9.16c | lsd90 | Lsd90 protein |
SPAC3G6.03c | ifs1 | Maf-like protein, nucleoside-triphosphate diphosphatase, human ASMTL ortholog |
SPBP16F5.03c | tra1 | SAGA complex/ASTRA complex, phosphatidylinositol pseudokinase Tra1 |
Systematic ID Gene name Product descriptionSorted by: Product description SPAPB1E7.05 gde1 glycerophosphoryl diester phosphodiesterase Gde1 SPBC16E9.16c lsd90 Lsd90 protein SPAC3G6.03c ifs1 Maf-like protein, nucleoside-triphosphate diphosphatase, human ASMTL ortholog SPBP16F5.03c tra1 SAGA complex/ASTRA complex, phosphatidylinositol pseudokinase Tra1
tra1 update here: https://github.com/pombase/curation/issues/3024 gde1 updated here: https://github.com/pombase/curation/issues/3384
Ifs1 and lsd90 have been updated, but not recently as far as I know?
lsd90 FT /controlled_curation="term=warning, gene structure FT updated; cv=warning; db_xref=PMID:18079165; date=20090505" FT /controlled_curation="term=warning, sequence error in FT genomic data; db_xref=PMID:26615217; date=20170404" FT /controlled_curation="term=warning, frameshifted; FT cv=warning; db_xref=PMID:26615217; date=20170404" FT /controlled_curation="term=warning, Chr_II:1948953!GA->G; FT db_xref=PMID:26615217; date=20170404" FT /controlled_curation="term=warning, Chr_II:1950050!A->AG; FT db_xref=PMID:26615217; date=20170404"
ifs1 FT /controlled_curation="term=warning, gene structure FT updated; db_xref=PMID:21511999; date=20110311"
and the phosphorylation data hasn't been edited at all since December
Hi @ValWood my mistake. I had used an outdated version of the genome on the first one, in which modifications to all those genes weren't there, sorry about that.
Anyway, it means that since October we would have introduced those many errors by changing genome coordinates (lsd90 was changed then).
ifs1 was a silly mistake, there was no difference in error between versions, it was just that the primary name was added, and the lines were different because of that.
OK these checks are good, and you picked up the 2 new changes OK.
E.g. imp2 S346 > T346 (For kinases a fix could be suggested).