Investigate why the EMBL datasets are failing global QC checks

katewarner commented 1 month ago

2,pos_out_of_seq_range,human_proteoform_glycosylation_sites_embl
32,pos_out_of_seq_range,mouse_proteoform_glycosylation_sites_embl
365,aa_mismatch,mouse_proteoform_glycosylation_sites_embl
50,aa_mismatch,human_proteoform_glycosylation_sites_embl
242,o_glycan_aa_mismatch,mouse_proteoform_glycosylation_sites_embl
142,o_glycan_aa_mismatch,human_proteoform_glycosylation_sites_embl

katewarner commented 1 month ago

I'm looking at the global QC flag logs for the human_proteoform_glycosylation_sites_embl.csv and the mouse_proteoform_glycosylation_sites_embl.csv datasets. Most of the flags seems correct, and the issue (at least for the "aa_mismatch" and "pos_out_of_seq_range" flags) appears to be a problem on their side so I will be contacting them about it. However, while checking the "pos_out_of_seq_range" flags in the qc/logs/mouse_proteoform_glycosylation_sites_embl.csv file I found this line which appears to be a problem with our dataset because we map to the canonical AC not the isoform:

`qc/logs/mouse_proteoform_glycosylation_sites_embl.csv
"A2RS43-1","1130","Asn","G31852PQ","N-linked","protein_xref_doi","10.1101/2023.09.13.557529","protein_xref_glygen_ds","GLY_000889","UBERON:0000955","brain","","","1130","1130","N","N","TTPDGSVGEAEHMENDSR","high mannose","Pcdh7","HexNAc(2)Hex(7)","1540.5285","","","pos_out_of_seq_range"

But in the source file /embl/current/glygen_upload.csv the UniProt AC for this glycosite uses a different isoform, and if you use "A0A0A6YY83" then the glycosylation site is correct

"A0A0A6YY83","Pcdh7","1130","Asn","N-linked","1116","1133","TTPDGSVGEAEHMENDSR","10090","mus musculus","HexNAc(2)Hex(7) % 1540.5285","1540.5285","high mannose","","UBERON:0000955","brain","","","https://www.biorxiv.org/content/10.1101/2023.09.13.557529v1.full"

In this case both isoforms are unreviewed but A2RS43 has a higher sequence similarity to the human reviewed entry. However, there are human unreviewed entries that have high sequence similarity to A0A0A6YY83, and evidence at the protein level - but has not been merged into the reviewed entry

https://github.com/glygener/glygen-issues/issues/1287

katewarner commented 1 month ago

These appear to be issues on their side - I will send them an email

2,pos_out_of_seq_range,human_proteoform_glycosylation_sites_embl "Q96S16","JMJD8","279","Asn","N-linked","273","280","TPEFHPNK","9606","homo sapiens","HexNAc(2)Hex(5) % 1216.4229","1216.4229","high mannose","","UBERON:0002113","kidney","CVCL_0063","HEK293T","https://www.biorxiv.org/content/10.1101/2023.09.13.557529v1.full" Peptide position is 203-210 and glycosite is N209

"Q96S15","WDR24","848","Asn","N-linked","834","854","AVSCLNQASTTLHVNCSHCKR","9606","homo sapiens","HexNAc(8)Hex(5)Fuc(1)NeuAc(2) % 3163.1479","3163.1479","sialylated","","UBERON:0002113","kidney","CVCL_0063","HEK293T","https://www.biorxiv.org/content/10.1101/2023.09.13.557529v1.full" Peptide position is 704-724 and glycosite is N718

32,pos_out_of_seq_range,mouse_proteoform_glycosylation_sites_embl "A2RS43-1","1130","Asn","G31852PQ","N-linked","protein_xref_doi","10.1101/2023.09.13.557529","protein_xref_glygen_ds","GLY_000889","UBERON:0000955","brain","","","1130","1130","N","N","TTPDGSVGEAEHMENDSR","high mannose","Pcdh7","HexNAc(2)Hex(7)","1540.5285","","","pos_out_of_seq_range"

"A0A0A6YY83","Pcdh7","1130","Asn","N-linked","1116","1133","TTPDGSVGEAEHMENDSR","10090","mus musculus","HexNAc(2)Hex(7) % 1540.5285","1540.5285","high mannose","","UBERON:0000955","brain","","","https://www.biorxiv.org/content/10.1101/2023.09.13.557529v1.full"

50,aa_mismatch,human_proteoform_glycosylation_sites_embl "O94901","SUN1","732","Asn","N-linked","700","738","GSQGYLVVRLSMMIHPAAFTLEHIPKTLSPTGNISSAPK","9606","homo sapiens","HexNAc(9)Hex(10)Fuc(1)NeuAc(3)NeuGc(1) % 4774.6772","4774.6772","sialylated","","UBERON:0002113","kidney","CVCL_0063","HEK293T","https://www.biorxiv.org/content/10.1101/2023.09.13.557529v1.full" Peptide position is 673-711 and glycosite is N705

"Q7Z7G0","ABI3BP","778","Asn","N-linked","752","788","PTGTPLERIETDIKQPTVPASGEELENITDFSSSPTR","9606","homo sapiens","HexNAc(8)Hex(7) % 2759.0048","2759.0048","complex/hybrid","","UBERON:0002113","kidney","CVCL_0063","HEK293T","https://www.biorxiv.org/content/10.1101/2023.09.13.557529v1.full" Peptide position is 745-781 and glycosite is N771

katewarner commented 1 month ago

o_glycan_aa_mismatch - This is a similar issue to the pdc_ccrc dataset

G57006OK G85582CB G01194JA G03019OW G16136AD G27391WQ G29068FM G29879MS G30706TN G43061MO G48698AX G53434XO G57006OK G58001LT G71784JC G84862VB G98877FE

katewarner commented 3 weeks ago

Discuss at next main meeting: Could map the sites in the isoform to the canonical protein using the isoform mapper. If so we would need to add a new field to datasets called "notes" to tell user it was mapped to canonical

ReneRanzinger commented 1 week ago

From meeting:

We will use the isoform mapper to fix this if the provided protein is an isoform
After mapper the check needs to be performed again

glygener / glygen-issues

Investigate why the EMBL datasets are failing global QC checks #1855