pombase / allele_qc

Quality control for PomBase alleles
MIT License
1 stars 1 forks source link

Fix also modifications #10

Closed manulera closed 1 year ago

manulera commented 1 year ago

E.g. imp2 S346 > T346 (For kinases a fix could be suggested).

manulera commented 1 year ago

@ValWood some very good results after running the pipeline on the modification data. We have 60,271 modification annotations in total.

1409 errors found, of which 1142 fixed:
  - 1082 sequence errors
  - 60 syntax errors
  - 0 have several possible fixes, for those check the `change_sequence_position_to` field for "|" characters

Types of errors fixed:

old_coords_fix     1004
histone_fix          71
syntax_error         60
multi_shift_fix       7

Using the updated changelog, that covers the entire history of the genome (prior to the use of systematic_id and all), was able to fix 1004 errors. The multi-shift ones we should discuss whether it makes sense.

manulera commented 1 year ago

Mostly addressed in #21

ValWood commented 1 year ago

Excellent news! that's incredible.

manulera commented 1 year ago

@ValWood I downloaded the data that I originally used on the 14th of December. I have re-ran the analysis in the data now and I picked up 95 more errors, all fixable using old coordinates. I see that most of them are coming from automatic assertions. I guess this means that whatever pipeline makes this predictions uses outdated coordinates (from Uniprot I guess). Full list attached.

dummy.tsv.zip

ValWood commented 1 year ago

There are 4 genes in this file:

Systematic ID Gene name Product description
SPAPB1E7.05 gde1 glycerophosphoryl diester phosphodiesterase Gde1
SPBC16E9.16c lsd90 Lsd90 protein
SPAC3G6.03c ifs1 Maf-like protein, nucleoside-triphosphate diphosphatase, human ASMTL ortholog
SPBP16F5.03c tra1 SAGA complex/ASTRA complex, phosphatidylinositol pseudokinase Tra1

Systematic ID Gene name Product descriptionSorted by: Product description SPAPB1E7.05 gde1 glycerophosphoryl diester phosphodiesterase Gde1 SPBC16E9.16c lsd90 Lsd90 protein SPAC3G6.03c ifs1 Maf-like protein, nucleoside-triphosphate diphosphatase, human ASMTL ortholog SPBP16F5.03c tra1 SAGA complex/ASTRA complex, phosphatidylinositol pseudokinase Tra1

tra1 update here: https://github.com/pombase/curation/issues/3024 gde1 updated here: https://github.com/pombase/curation/issues/3384

Ifs1 and lsd90 have been updated, but not recently as far as I know?

lsd90 FT /controlled_curation="term=warning, gene structure FT updated; cv=warning; db_xref=PMID:18079165; date=20090505" FT /controlled_curation="term=warning, sequence error in FT genomic data; db_xref=PMID:26615217; date=20170404" FT /controlled_curation="term=warning, frameshifted; FT cv=warning; db_xref=PMID:26615217; date=20170404" FT /controlled_curation="term=warning, Chr_II:1948953!GA->G; FT db_xref=PMID:26615217; date=20170404" FT /controlled_curation="term=warning, Chr_II:1950050!A->AG; FT db_xref=PMID:26615217; date=20170404"

ifs1 FT /controlled_curation="term=warning, gene structure FT updated; db_xref=PMID:21511999; date=20110311"

and the phosphorylation data hasn't been edited at all since December

manulera commented 1 year ago

Hi @ValWood my mistake. I had used an outdated version of the genome on the first one, in which modifications to all those genes weren't there, sorry about that.

Anyway, it means that since October we would have introduced those many errors by changing genome coordinates (lsd90 was changed then).

ifs1 was a silly mistake, there was no difference in error between versions, it was just that the primary name was added, and the lines were different because of that.

ValWood commented 1 year ago

OK these checks are good, and you picked up the 2 new changes OK.