glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

OGlcNAc sites with non-numeric position number #1407

Closed kmartinez834 closed 1 month ago

kmartinez834 commented 4 months ago

We need to provide instructions to Robel to deal with the following issue:

OGlcNAc MCW source files have rows with sites in this format: "T291 (P05067-4);T292 (P05067-4);T576 (P05067-4)" The QC script is ok with the ; delimiter but is kicking these out because it also includes the (P05067-4) part

$ head /data/projects/glygen/generated/datasets/logs/human_proteoform_glycosylation_sites_o_glcnac_mcw.log 
uniprotkb_canonical_ac","glycosylation_site_uniprotkb","amino_acid","saccharide","glycosylation_type","xref_key","xref_id","src_xref_key","src_xref_id","uniprotkb_id","entry_name","organism","full_name","oglcnacscore","oglcnac_sites","phosphorylation_sites","pmids","sequence","eco_id","carb_name","glycosylation_subtype","status","uniprotkb_id","gene_name","recommended_name_full","peptide","filter_flags
"P05067-1","291 P05067-4","Thr","G49108TO","O-linked","protein_xref_oglcnac_db","P05067","protein_xref_oglcnac_db","P05067","P05067","A4_HUMAN","Homo sapiens","Amyloid-beta precursor protein","17.731082115423458","T291 (P05067-4);T292 (P05067-4);T576 (P05067-4)","S198;S206;S441;T497;T729;S730;T743;T757","31156159;34019948;21182826;28624365","MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLNMHMNVQNGKWDSDPSGTKTCIDTKEGILQYCQEVYPELQITNVVEANQPVTIQNWCKRGRKQCKTHPHFVIPYRCLVGEFVSDALLVPDKCKFLHQERMDVCETHLHWHTVAKETCSEKSTNLHDYGMLLPCGIDKFRGVEFVCCPLAEESDNVDSADAEEDDSDVWWGGADTDYADGSEDKVVEVAEEEEVAEVEEEEADDDEDDEDGDEVEEEAEEPYEEATERTTSIATTTTTTTESVEEVVREVCSEQAETGPCRAMISRWYFDVTEGKCAPFFYGGCGGNRNNFDTEEYCMAVCGSAMSQSLLKTTQEPLARDPVKLPTTAASTPDAVDKYLETPGDENEHAHFQKAKERLEAKHRERMSQVMREWEEAERQAKNLPKADKKAVIQHFQEKVESLEQEAANERQQLVETHMARVEAMLNDRRRLALENYITALQAVPPRPRHVFNMLKKYVRAEQKDRQHTLKHFEHVRMVDPKKAAQIRSQVMTHLRVIYERMNQSLSLLYNVPAVAEEIQDEVDELLQKEQNYSDDVLANMISEPRISYGNDALMPSLTETKTTVELLPVNGEFSLDDLQPWHSFGADSVPANTENEVEPVDARPAADRGLTTRPGSGLTNIKTEEISEVKMDAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIATVIVITLVMLKKKQYTSIHHGVVEVDAAVTPEERHLSKMQQNGYENPTYKFFEQMQN","ECO_0000269","GlcNac","O-GlcNAcylation","reviewed","A4_HUMAN","APP","Amyloid-beta precursor protein","","aa_pos_non_numeric"
katewarner commented 3 months ago

@rykahsay Please see Karina's comment above.

Please update your script for generating the "glycosylation_site_uniprotkb" column in all the MCW datasets so that it excludes or ignores all bracketed AC data within the "oglcnac sites" rows in the MCW downloads e.g.

It's currently producing this for rows with sites in this format: "T291 (P05067-4);T292 (P05067-4);T576 (P05067-4)":

uniprotkb_canonical_ac","glycosylation_site_uniprotkb","amino_acid","saccharide","glycosylation_type","xref_key","xref_id","src_xref_key","src_xref_id","uniprotkb_id","entry_name","organism","full_name","oglcnacscore","oglcnac_sites","phosphorylation_sites","pmids","sequence","eco_id","carb_name","glycosylation_subtype","status","uniprotkb_id","gene_name","recommended_name_full","peptide","filter_flags
"P05067-1","291 P05067-4","Thr","G49108TO","O-linked","protein_xref_oglcnac_db","P05067","protein_xref_oglcnac_db","P05067","P05067","A4_HUMAN","Homo sapiens","Amyloid-beta precursor protein","17.731082115423458","T291 (P05067-4);T292 (P05067-4);T576 (P05067-4)","S198;S206;S441;T497;T729;S730;T743;T757","31156159;34019948;21182826;28624365","MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLNMHMNVQNGKWDSDPSGTKTCIDTKEGILQYCQEVYPELQITNVVEANQPVTIQNWCKRGRKQCKTHPHFVIPYRCLVGEFVSDALLVPDKCKFLHQERMDVCETHLHWHTVAKETCSEKSTNLHDYGMLLPCGIDKFRGVEFVCCPLAEESDNVDSADAEEDDSDVWWGGADTDYADGSEDKVVEVAEEEEVAEVEEEEADDDEDDEDGDEVEEEAEEPYEEATERTTSIATTTTTTTESVEEVVREVCSEQAETGPCRAMISRWYFDVTEGKCAPFFYGGCGGNRNNFDTEEYCMAVCGSAMSQSLLKTTQEPLARDPVKLPTTAASTPDAVDKYLETPGDENEHAHFQKAKERLEAKHRERMSQVMREWEEAERQAKNLPKADKKAVIQHFQEKVESLEQEAANERQQLVETHMARVEAMLNDRRRLALENYITALQAVPPRPRHVFNMLKKYVRAEQKDRQHTLKHFEHVRMVDPKKAAQIRSQVMTHLRVIYERMNQSLSLLYNVPAVAEEIQDEVDELLQKEQNYSDDVLANMISEPRISYGNDALMPSLTETKTTVELLPVNGEFSLDDLQPWHSFGADSVPANTENEVEPVDARPAADRGLTTRPGSGLTNIKTEEISEVKMDAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIATVIVITLVMLKKKQYTSIHHGVVEVDAAVTPEERHLSKMQQNGYENPTYKFFEQMQN","ECO_0000269","GlcNac","O-GlcNAcylation","reviewed","A4_HUMAN","APP","Amyloid-beta precursor protein","","aa_pos_non_numeric"

But it should look like this:

uniprotkb_canonical_ac","glycosylation_site_uniprotkb","amino_acid","saccharide","glycosylation_type","xref_key","xref_id","src_xref_key","src_xref_id","uniprotkb_id","entry_name","organism","full_name","oglcnacscore","oglcnac_sites","phosphorylation_sites","pmids","sequence","eco_id","carb_name","glycosylation_subtype","status","uniprotkb_id","gene_name","recommended_name_full","peptide","filter_flags
"P05067-1","291","Thr","G49108TO","O-linked","protein_xref_oglcnac_db","P05067","protein_xref_oglcnac_db","P05067","P05067","A4_HUMAN","Homo sapiens","Amyloid-beta precursor protein","17.731082115423458","T291 (P05067-4);T292 (P05067-4);T576 (P05067-4)","S198;S206;S441;T497;T729;S730;T743;T757","31156159;34019948;21182826;28624365","MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLNMHMNVQNGKWDSDPSGTKTCIDTKEGILQYCQEVYPELQITNVVEANQPVTIQNWCKRGRKQCKTHPHFVIPYRCLVGEFVSDALLVPDKCKFLHQERMDVCETHLHWHTVAKETCSEKSTNLHDYGMLLPCGIDKFRGVEFVCCPLAEESDNVDSADAEEDDSDVWWGGADTDYADGSEDKVVEVAEEEEVAEVEEEEADDDEDDEDGDEVEEEAEEPYEEATERTTSIATTTTTTTESVEEVVREVCSEQAETGPCRAMISRWYFDVTEGKCAPFFYGGCGGNRNNFDTEEYCMAVCGSAMSQSLLKTTQEPLARDPVKLPTTAASTPDAVDKYLETPGDENEHAHFQKAKERLEAKHRERMSQVMREWEEAERQAKNLPKADKKAVIQHFQEKVESLEQEAANERQQLVETHMARVEAMLNDRRRLALENYITALQAVPPRPRHVFNMLKKYVRAEQKDRQHTLKHFEHVRMVDPKKAAQIRSQVMTHLRVIYERMNQSLSLLYNVPAVAEEIQDEVDELLQKEQNYSDDVLANMISEPRISYGNDALMPSLTETKTTVELLPVNGEFSLDDLQPWHSFGADSVPANTENEVEPVDARPAADRGLTTRPGSGLTNIKTEEISEVKMDAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIATVIVITLVMLKKKQYTSIHHGVVEVDAAVTPEERHLSKMQQNGYENPTYKFFEQMQN","ECO_0000269","GlcNac","O-GlcNAcylation","reviewed","A4_HUMAN","APP","Amyloid-beta precursor protein","","aa_pos_non_numeric"
rykahsay commented 1 month ago

Fixed, but it is still rejected because the reported site is on P05067-4 and is aa_mismatch on P05067-1:

$ cat logs/human_proteoform_glycosylation_sites_oglcnac_mcw.log | grep P05067 | grep "\"291\"" | head -1
"P05067-1","291","Thr","G49108TO","O-linked","protein_xref_pubmed","28624365","protein_xref_oglcnac_db","P05067","P05067","A4_HUMAN","Homo sapiens","Amyloid-beta precursor protein","26.706452997372033","T291 (P05067-4);T292 (P05067-4);T576 (P05067-4)","S198;S206;S441;T497;T729;S730;T743;T757","28624365;21182826;34019948;38665916;31156159","MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLNMHMNVQNGKWDSDPSGTKTCIDTKEGILQYCQEVYPELQITNVVEANQPVTIQNWCKRGRKQCKTHPHFVIPYRCLVGEFVSDALLVPDKCKFLHQERMDVCETHLHWHTVAKETCSEKSTNLHDYGMLLPCGIDKFRGVEFVCCPLAEESDNVDSADAEEDDSDVWWGGADTDYADGSEDKVVEVAEEEEVAEVEEEEADDDEDDEDGDEVEEEAEEPYEEATERTTSIATTTTTTTESVEEVVREVCSEQAETGPCRAMISRWYFDVTEGKCAPFFYGGCGGNRNNFDTEEYCMAVCGSAMSQSLLKTTQEPLARDPVKLPTTAASTPDAVDKYLETPGDENEHAHFQKAKERLEAKHRERMSQVMREWEEAERQAKNLPKADKKAVIQHFQEKVESLEQEAANERQQLVETHMARVEAMLNDRRRLALENYITALQAVPPRPRHVFNMLKKYVRAEQKDRQHTLKHFEHVRMVDPKKAAQIRSQVMTHLRVIYERMNQSLSLLYNVPAVAEEIQDEVDELLQKEQNYSDDVLANMISEPRISYGNDALMPSLTETKTTVELLPVNGEFSLDDLQPWHSFGADSVPANTENEVEPVDARPAADRGLTTRPGSGLTNIKTEEISEVKMDAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIATVIVITLVMLKKKQYTSIHHGVVEVDAAVTPEERHLSKMQQNGYENPTYKFFEQMQN","ECO_0000269","GlcNac","O-GlcNAcylation","reviewed","A4_HUMAN","APP","Amyloid-beta precursor protein","","aa_mismatch"
image image
katewarner commented 1 month ago

@rykahsay It looks like this related to #1641 so I will close this ticket while we assess what to do with AA mismatches. I can open it again if you think I should.