pombase / allele_qc

Quality control for PomBase alleles
MIT License
1 stars 1 forks source link

Fix modifications #66

Closed manulera closed 1 year ago

manulera commented 1 year ago

Hi @kimruterford, just summarising what we discussed in the call today. Similar files are generated by the pipeline for protein modifications as for alleles, and can be used to fix them in the PHAF files and in canto.

They are in this folder: https://github.com/pombase/allele_qc/tree/master/results

How to use the proposed fixes

For the fixing, the unique identifier of a fix is systematic_id, sequence_position, reference (probably reference can be omitted, but just in case let's use it because that's what the script uses as unique identifier).

Important exceptions

Important for when you write the script that applies the changes from protein_modification_auto_fix.tsv, there is a column in the file solution_index, explained in https://github.com/pombase/canto/issues/2689#issue-1563394806. If this column has a value, it means that the pipeline found two possible solutions, and a decision has to be made.

TLDR: If there is a value in solution_index, do not apply the fix.

Another special case to take into account is decribed in #62. It can happen that someone has reported a modification on a residue that no longer exists in the currect gene structure (probably assigned with a high-throughput pipeline). For those cases, I have set the value of the column change_sequence_position_to to ?. Right now we only have an example:

SPAC57A7.12 ssz1    MOD:00046   experimental evidence   S12 present_during(GO:0000087)  PMID:21712547   4896    2011-06-28  S12 ?   old_coords_fix, revision 8148: complement(join(1515089..1516663,1516789..1516914))

But more are likely to happen in the future. These can either be deleted, or kept knowing that they have a sequence error.

Related to #63

kimrutherford commented 1 year ago

I've now applied these changes to Canto. I'll apply the fixes to the modifications in SVN next.

08562810fce65760: changing SPCC622.08c PMID:20661445 MOD:00696 S128 to S129
08562810fce65760: changing SPAC19G12.06c PMID:20661445 MOD:00696 S127 to S128
08562810fce65760: changing SPCC622.08c PMID:20661445 MOD:00696 S128 to S129
08562810fce65760: changing SPCC622.08c PMID:20661445 MOD:00696 S128 to S129
08562810fce65760: changing SPAC19G12.06c PMID:20661445 MOD:00696 S127 to S128
08562810fce65760: changing SPAC19G12.06c PMID:20661445 MOD:00696 S127 to S128
08562810fce65760: changing SPCC622.08c PMID:20661445 MOD:00696 S128 to S129
08562810fce65760: changing SPAC19G12.06c PMID:20661445 MOD:00696 S127 to S128
08562810fce65760: changing SPCC622.08c PMID:20661445 MOD:00696 S128 to S129
08562810fce65760: changing SPAC19G12.06c PMID:20661445 MOD:00696 S127 to S128
08562810fce65760: changing SPCC622.08c PMID:20661445 MOD:00696 S128 to S129
08562810fce65760: changing SPCC622.08c PMID:20661445 MOD:00696 S128 to S129
08562810fce65760: changing SPAC19G12.06c PMID:20661445 MOD:00696 S127 to S128
08562810fce65760: changing SPAC19G12.06c PMID:20661445 MOD:00696 S127 to S128
08562810fce65760: changing SPCC622.08c PMID:20661445 MOD:00696 S128 to S129
28333b01f58bc586: changing SPBC4F6.12 PMID:34133210 MOD:00696 S3, S24, S31, T55, T64, S67, S97, S136, T214 to S3,S24,S31,T55,T64,S67,S97,S136,T214
521475f7c063d784: changing SPBC16G5.15c PMID:18235227 MOD:00046 S321A to S321
5339c3839d6a7634: changing SPAC1834.04 PMID:20299449 MOD:00723 K4 to K5
5339c3839d6a7634: changing SPAC1834.04 PMID:20299449 MOD:00723 K4 to K5
5339c3839d6a7634: changing SPAC1834.04 PMID:20299449 MOD:00723 K9 to K10
5339c3839d6a7634: changing SPAC1834.04 PMID:20299449 MOD:00723 K56 to K57
536dc2e074eee139: changing SPCC4E9.01c PMID:25993311 MOD:00696 S10|S22|S43|S150|S439|S496 to S10,S22,S43,S150,S439,S496
536dc2e074eee139: changing SPCC4E9.01c PMID:25993311 MOD:00696 T60|T70 to T60,T70
767451d8f8ef6abe: changing SPAC6G9.08 PMID:21182284 MOD:00046 S129 to S130                                                                                    
767451d8f8ef6abe: changing SPAC6G9.08 PMID:21182284 MOD:00046 S133 to S134                                                                                    
767451d8f8ef6abe: changing SPAC6G9.08 PMID:21182284 MOD:00046 S359 to S360                                                                                    
767451d8f8ef6abe: changing SPAC6G9.08 PMID:21182284 MOD:00047 T143 to T144                                                                                    
7bf1fc1e6f06a613: changing SPCC338.08 PMID:33836577 MOD:00047 T89, T154, T155 to T89,T154,T155                                                                
7bf1fc1e6f06a613: changing SPCC338.08 PMID:33836577 MOD:00046 S77, S151 to S77,S151                                                                           
884c35ae47e3fec8: changing SPBC1A4.03c PMID:30635402 MOD:00046 S1363, S1364 to S1363,S1364                                                                    
99f58cdf989ca814: changing SPCC622.08c PMID:19965387 MOD:00046 S121 to S122                                                                                   
99f58cdf989ca814: changing SPAC19G12.06c PMID:19965387 MOD:00046 S121 to S122
9b5edbe6f0efcb45: changing SPAC1834.04 PMID:17369611 MOD:00723 K56 to K57
9b5edbe6f0efcb45: changing SPBC8D2.04 PMID:17369611 MOD:00723 K56 to K57
9b5edbe6f0efcb45: changing SPBC1105.11c PMID:17369611 MOD:00723 K56 to K57
9d9a265db15a87cd: changing SPBP23A10.10 PMID:27191590 MOD:00696 S630, S632 to S630,S632                                                                      
a09af17a2956146d: changing SPAC1834.04 PMID:31468675 MOD:01148 K14 to K15
a09af17a2956146d: changing SPBC8D2.04 PMID:31468675 MOD:01148 K14 to K15
a09af17a2956146d: changing SPBC1105.11c PMID:31468675 MOD:01148 K14 to K15
b2ae716b0ad7c3cb: changing SPCC338.17c PMID:28438891 MOD:00046 S163, S164,S165, S174, S209, S216, S219,S223, S226, S444, S507, S544, S545, S553 to S163,S164,S165,S174,S209,S216,S219,S223,S226,S444,S507,S544,S545,S553
c0af69aa51ff9eff: changing SPAC1834.04 PMID:11792803 MOD:00046 S10 to S11
c0af69aa51ff9eff: changing SPBC8D2.04 PMID:11792803 MOD:00046 S10 to S11
c0af69aa51ff9eff: changing SPBC1105.11c PMID:11792803 MOD:00046 S10 to S11
d62844597282017d: changing SPBC14C8.07c PMID:9353247 MOD:00047 T10A to T10
d62844597282017d: changing SPBC14C8.07c PMID:9353247 MOD:00047 T46A to T46
d62844597282017d: changing SPBC14C8.07c PMID:9353247 MOD:00047 T60A to T60
d62844597282017d: changing SPBC14C8.07c PMID:9353247 MOD:00047 T104A to T104
d62844597282017d: changing SPBC14C8.07c PMID:9353247 MOD:00047 T134A to T134
d62844597282017d: changing SPBC14C8.07c PMID:9353247 MOD:00047 T374A to T374
db3533d819cff33d: changing SPCC162.07 PMID:23297348 MOD:00046 S220 to S216
dd0b314b0bd84119: changing SPCC962.02c PMID:20739936 MOD:00696 S202A to S202
dd0b314b0bd84119: changing SPCC962.02c PMID:20739936 MOD:00696 S229A to S229
dd0b314b0bd84119: changing SPCC962.02c PMID:20739936 MOD:00696 S244A to S244
dd0b314b0bd84119: changing SPCC962.02c PMID:20739936 MOD:00696 S278A to S278
dd0b314b0bd84119: changing SPCC962.02c PMID:20739936 MOD:00696 S294A to S294
dd0b314b0bd84119: changing SPCC962.02c PMID:20739936 MOD:00696 T393A to T393
dd0b314b0bd84119: changing SPCC962.02c PMID:20739936 MOD:00696 T831A to T831
dd0b314b0bd84119: changing SPCC962.02c PMID:20739936 MOD:00696 T908A to T908
e68d23abf86a3c7c: changing SPAC17G8.10c PMID:34674264 MOD:00046 S4, S20, S166, S251, S266 to S4,S20,S166,S251,S266                                           
e865b65eeb6f06b0: changing SPAC1834.04 PMID:29136238 MOD:00696 Y41 to Y42
e865b65eeb6f06b0: changing SPBC8D2.04 PMID:29136238 MOD:00696 Y41 to Y42
e865b65eeb6f06b0: changing SPBC1105.11c PMID:29136238 MOD:00696 Y41 to Y42
f30149c5fcc7f553: changing SPAC17G8.10c PMID:29975113 MOD:01148 K3, K26, K54, K82, K124, K164, K174, K237, K262 to K3,K26,K54,K82,K124,K164,K174,K237,K262
f45b7c9c20201a38: changing SPAC20H4.06c PMID:36361590 MOD:00046 S239, S308, S312 to S239,S308,S312
f45b7c9c20201a38: changing SPCC188.11 PMID:36361590 MOD:00046 S228, S236 to S228,S236
f45b7c9c20201a38: changing SPAC4D7.03 PMID:36361590 MOD:00047 T657, T666, T669 to T657,T666,T669
f7e6c33889ea1fa0: changing SPAC11E3.03 PMID:20935472 MOD:00696 S47 to S87
fe6e8e353ea78411: changing SPAP8A3.08 PMID:10364209 MOD:00046 S2A to S2
fe6e8e353ea78411: changing SPAP8A3.08 PMID:10364209 MOD:00046 S6A to S6
kimrutherford commented 1 year ago

I'll apply the fixes to the modifications in SVN next.

That's done too now. I'll check Chado after tomorrow's load.

Edit - these changes were made:

skipping change where new position is unknown: SPAC57A7.12 MOD:00046 S12->? PMID:21712547
external_data/modification_files/PMID_21712547_modifications.tsv: changing SPAC57A7.12 MOD:00046 S500 to S484
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S200 to S184
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S202 to S186
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S212 to S196
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S224 to S208
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S229 to S213
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S232 to S216
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S237 to S221
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S239 to S223
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S309 to S293
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S316 to S300
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S337 to S321
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S345 to S329
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S354 to S338
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S376 to S360
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S381 to S365
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S383 to S367
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S409 to S393
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S411 to S395
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S419 to S403
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S426 to S410
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S445 to S429
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S447 to S431
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00046 S455 to S439
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00047 T129 to T113
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00047 T257 to T241
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00047 T352 to T336
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00047 T375 to T359
external_data/modification_files/PMID_29996109_modifications.tsv: changing SPBC25B2.07c MOD:00047 T439 to T423
external_data/modification_files/PMID_30726745_modifications.tsv: changing SPAC3H1.05 MOD:00046 S440 to S410
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S202 to S186
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S212 to S196
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S239 to S223
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S309 to S293
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S337 to S321
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S345 to S329
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S354 to S338
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S381 to S365
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S383 to S367
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00046 S434 to S418
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00047 T129 to T113
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00047 T257 to T241
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00047 T352 to T336
external_data/modification_files/PMID_33823663_modifications.tsv: changing SPBC25B2.07c MOD:00047 T375 to T359
manulera commented 1 year ago

Hi @kimrutherford, it seems like most of them went through, except for a few. The one with the "?" (expected), but also some histone_fix ones.

https://github.com/pombase/allele_qc/blob/master/results/protein_modification_auto_fix.tsv

systematic_id   primary_name    modification    evidence    sequence_position   annotation_extension    reference   taxon   date    sequence_error  change_sequence_position_to auto_fix_comment    solution_index
SPAC1834.03c    hhf1    MOD:00663   Inferred from Sequence or Structural Similarity K20     PB_REF:0000001  4896    2010-03-11  K20 K21 histone_fix 
SPAC1834.04 hht1    MOD:00663       K14 present_during(GO:0031508)  PMID:14561399   4896    2010-03-11  K14 K15 histone_fix 
SPAC1834.04 hht1    MOD:00663   Inferred from Sequence or Structural Similarity K4      PB_REF:0000001  4896    2010-03-11  K4  K5  histone_fix 
SPAC1834.04 hht1    MOD:00663       K9  present_during(GO:0031508)  PMID:14561399   4896    2010-03-11  K9  K10 histone_fix 
SPAC57A7.12 ssz1    MOD:00046   experimental evidence   S12 present_during(GO:0000087)  PMID:21712547   4896    2011-06-28  S12 ?   old_coords_fix, revision 8148: complement(join(1515089..1516663,1516789..1516914))  
SPBC1105.11c    hht3    MOD:00427   Inferred from Sequence or Structural Similarity K4      PB_REF:0000001  4896    2010-03-11  K4  K5  histone_fix 
SPBC1105.11c    hht3    MOD:00427   Inferred from Sequence or Structural Similarity K9      PB_REF:0000001  4896    2010-03-11  K9  K10 histone_fix 
SPBC1105.12 hhf3    MOD:00427   Inferred from Sequence or Structural Similarity K20     PB_REF:0000001  4896    2010-03-11  K20 K21 histone_fix 
SPBC8D2.03c hhf2    MOD:00427   Inferred from Sequence or Structural Similarity K20     PB_REF:0000001  4896    2010-03-11  K20 K21 histone_fix 
SPBC8D2.04  hht2    MOD:00427   Inferred from Sequence or Structural Similarity K4      PB_REF:0000001  4896    2010-03-11  K4  K5  histone_fix 
SPBC8D2.04  hht2    MOD:00427   Inferred from Sequence or Structural Similarity K9      PB_REF:0000001  4896    2010-03-11  K9  K10 histone_fix 
SPCC622.09  htb1    MOD:01148   Inferred from Direct Assay  K119        PMID:17374714   4896    2007-07-16  K119    K120    histone_fix 
manulera commented 1 year ago

I think I see why:

SPCC622.09  htb1    MOD:01148   Inferred from Direct Assay  K119        PMID:17374714   4896    2007-07-16  K119    K120    histone_fix 
manulera commented 1 year ago

Hi @Kimrutherford, as I said today, some new alleles have appeared in the allele list that did not exist before, for instance

SPAC13C5.03 D543->stop  tht1    tht1-D543*      nonsense mutation   PMID:9442101

The reason why it did not appear before is because this allele has no annotations in canto, and was dropped and not ran through the previous pipeline. Not sure how we want to handle that, maybe you can filter that list before exporting it. I am pretty sure there is a lot of garbage on alleles without annotations.

Also, the misterious unfixed modification K119 might be related to the ones mentioned in https://github.com/pombase/allele_qc/issues/83 ?

kimrutherford commented 1 year ago

Not sure how we want to handle that, maybe you can filter that list before exporting it. I am pretty sure there is a lot of garbage on alleles without annotations.

Hi Manu. The Canto allele export file has an "annotation_count" column. Could you ignore alleles where that column is zero?

manulera commented 1 year ago

The Canto allele export file has an "anno

Yes, I can use that to filter them out.

manulera commented 1 year ago

Hi @kimrutherford, I did this in https://github.com/pombase/allele_qc/commit/3ef4c14acd29ccfed893cdf5a39073c1ebe31f38

I am not just removing the alleles that have zero annotations in the canto file, in case there would be a case in which there is an allele in Canto without annotations, but with annotations in the PHAF files. See below to check that it makes sense

https://github.com/pombase/allele_qc/blob/master/filter_alleles_pombase.py

manulera commented 1 year ago

I think we can close this one