glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

Global QC for glycosylation/phosphorylation/glycation site dataset files #1641

Closed rykahsay closed 1 week ago

rykahsay commented 2 months ago

I am applying strict QC to all glycosylation/phosphorylation/glycation site dataset files, and the filtered out rows are stored in logs/proteoform.global.log files, and also I have created a script to analyze stats of rows in those files (see bottom).

Please read carefully and document. As you can see at the bottom, some of the flags are filtering out significant amount of volume from the datasets and we need to discuss if we want to apply them in our next Wed group meeting (please add it to the agenda).

Types of flags

aa_mismatch

Amino acid reported at the position does not match the amino acid in canonical sequence

glycan_without_glytype

Reported glycan is not in the glycan_classification.csv dataset file, meaning we do not know the type of the glycan (if it is n-linked, o-linked etc)

gtc2glytype_n-linked|o-linked

Reported glycan has both types of types (n-linked and o-linked) in glycan_classification.csv dataset. See example below

$ cat logs/human_proteoform_glycosylation_sites_unicarbkb.global.log | grep gtc2glytype_n |head -1
"P01861-2","177","Asn","G68318VE","N-linked","protein_xref_pubmed","24841998","protein_xref_unicarbkb_ds","GLY_000040","177","177","Asn","Asn","N","known_site","known_legacy_human_mouse_rat_glygen","","comp_HexNAc5Hex4dHex1NeuAc0NeuGc0Pent0S0P0KDN0HexA0","","","","","","","","","","NST","NXT","gtc2glytype_n-linked|o-linked

$ cat reviewed/glycan_classification.csv | awk -F, '{print $1, $2}' | grep G68318VE |sort -u
"G68318VE" "N-linked"
"G68318VE" "O-linked"

n_glycan_aa_mismatch

N-linked glycan, based on glycan_classification.csv, is being reported to be glycosylating o-linked amino acid based on misc/aadict.csv. In the example given below

$ cat logs/human_proteoform_glycosylation_sites_unicarbkb.global.log | grep n_glycan_aa_mismatch | grep G36670VW 
"P05155-1","31","Ser","G36670VW","O-linked","protein_xref_pubmed","30459171","protein_xref_unicarbkb_ds","GLY_000040","31","31","Ser","Ser","S","known_site","known_legacy_human_mouse_rat_glygen","","comp_HexNAc5Hex5dHex1NeuAc2NeuGc0Pent0S0P0KDN0HexA0","","","","","","","","","","","","n_glycan_aa_mismatch"

$ cat reviewed/glycan_classification.csv | awk -F, '{print $1, $2}' | grep G36670VW |sort -u
"G36670VW" "N-linked"

$ cat generated/misc/aadict.csv | grep Ser
"Serine","Ser","S","o-linked"

o_glycan_aa_mismatch

O-linked glycan, based on glycan_classification.csv, is being reported to be glycosylating n-linked amino acid based on misc/aadict.csv. In the example given below

$ cat logs/human_proteoform_glycosylation_sites_unicarbkb.global.log | grep o_glycan_aa_mismatch | head -1
"P11279-1","249","Asn","G58001LT","N-linked","protein_xref_pubmed","29741879","protein_xref_unicarbkb_ds","GLY_000040","249","249","Asn","Asn","N","known_site","known_legacy_human_mouse_rat_glygen","","comp_HexNAc2Hex1dHex0NeuAc0NeuGc0Pent0S0P0KDN0HexA0","","","","","","","","","","NTT","NXT","o_glycan_aa_mismatch"

$ cat reviewed/glycan_classification.csv | awk -F, '{print $1, $2}' | grep G58001LT |sort -u
"G58001LT" "O-linked"

$ cat generated/misc/aadict.csv | grep Asn
"Asparagine","Asn","N","n-linked"

pos_out_of_seq_range

reported position is out of canonical sequence range

Dumping global qc stats

$ cd /software/glygen/
$ python3 dump-global-qc-stats.py  > logs/rykahsay_global_qc_stats.csv &

Viewing global qc stats

$ cat logs/rykahsay_global_qc_stats.csv |sort -n
n_filtered_out,flag,file_name
1,aa_mismatch,mouse_proteoform_glycosylation_sites_predicted_isoglyp
1,aa_mismatch,rat_proteoform_glycosylation_sites_predicted_isoglyp
1,gtc2glytype_n-linked|o-linked,mouse_proteoform_glycosylation_sites_unicarbkb
1,o_glycan_aa_mismatch,mouse_proteoform_glycosylation_sites_unicarbkb
1,o_glycan_aa_mismatch,pig_proteoform_glycosylation_sites_glyconnect
2,glycan_without_glytype,sarscov2_proteoform_glycosylation_sites_unicarbkb
2,pos_out_of_seq_range,human_proteoform_glycosylation_sites_embl
4,glycan_without_glytype,rat_proteoform_glycosylation_sites_unicarbkb
5,glycan_without_glytype,human_proteoform_glycosylation_sites_carbbank
6,glycan_without_glytype,dicty_proteoform_glycosylation_sites_glyconnect
8,glycan_without_glytype,fruitfly_proteoform_glycosylation_sites_glyconnect
9,n_glycan_aa_mismatch,human_proteoform_glycosylation_sites_glyconnect
11,o_glycan_aa_mismatch,rat_proteoform_glycosylation_sites_unicarbkb
14,n_glycan_aa_mismatch,human_proteoform_glycosylation_sites_unicarbkb
15,aa_mismatch,human_proteoform_glycosylation_sites_predicted_isoglyp
15,glycan_without_glytype,chicken_proteoform_glycosylation_sites_glyconnect
16,glycan_without_glytype,sarscov2_proteoform_glycosylation_sites_glyconnect
16,gtc2glytype_n-linked|o-linked,hcv1a_proteoform_glycosylation_sites_literature
22,glycan_without_glytype,rat_proteoform_glycosylation_sites_glyconnect
23,glycan_without_glytype,pig_proteoform_glycosylation_sites_glyconnect
24,o_glycan_aa_mismatch,sarscov2_proteoform_glycosylation_sites_glyconnect
30,glycan_without_glytype,human_proteoform_glycosylation_sites_platelet
32,pos_out_of_seq_range,mouse_proteoform_glycosylation_sites_embl
38,gtc2glytype_n-linked|o-linked,fruitfly_proteoform_glycosylation_sites_glyconnect
50,aa_mismatch,human_proteoform_glycosylation_sites_embl
53,glycan_without_glytype,human_proteoform_glycosylation_sites_unicarbkb
55,o_glycan_aa_mismatch,human_proteoform_glycosylation_sites_unicarbkb
76,gtc2glytype_n-linked|o-linked,pig_proteoform_glycosylation_sites_glyconnect
79,glycan_without_glytype,mouse_proteoform_glycosylation_sites_glyconnect
87,o_glycan_aa_mismatch,human_proteoform_glycosylation_sites_glyconnect
89,gtc2glytype_n-linked|o-linked,rat_proteoform_glycosylation_sites_glyconnect
99,gtc2glytype_n-linked|o-linked,chicken_proteoform_glycosylation_sites_glyconnect
157,o_glycan_aa_mismatch,mouse_proteoform_glycosylation_sites_glyconnect
184,glycan_without_glytype,human_proteoform_glycosylation_sites_o_gluc
214,gtc2glytype_n-linked|o-linked,rat_proteoform_glycosylation_sites_unicarbkb
236,gtc2glytype_n-linked|o-linked,human_proteoform_glycosylation_sites_gptwiki
365,aa_mismatch,mouse_proteoform_glycosylation_sites_embl
374,o_glycan_aa_mismatch,human_proteoform_glycosylation_sites_embl
374,o_glycan_aa_mismatch,mouse_proteoform_glycosylation_sites_embl
428,gtc2glytype_n-linked|o-linked,sarscov2_proteoform_glycosylation_sites_glyconnect
562,gtc2glytype_n-linked|o-linked,human_proteoform_glycosylation_sites_unicarbkb
650,glycan_without_glytype,human_proteoform_glycosylation_sites_glyconnect
938,gtc2glytype_n-linked|o-linked,mouse_proteoform_glycosylation_sites_glyconnect
1505,gtc2glytype_n-linked|o-linked,human_proteoform_glycosylation_sites_embl
1505,gtc2glytype_n-linked|o-linked,mouse_proteoform_glycosylation_sites_embl
1620,glycan_without_glytype,human_proteoform_glycosylation_sites_embl
1620,glycan_without_glytype,mouse_proteoform_glycosylation_sites_embl
3494,gtc2glytype_n-linked|o-linked,human_proteoform_glycosylation_sites_glyconnect
47558,o_glycan_aa_mismatch,human_proteoform_glycosylation_sites_pdc_ccrcc
80256,aa_mismatch,human_proteoform_glycosylation_sites_pdc_ccrcc
239823,glycan_without_glytype,human_proteoform_glycosylation_sites_pdc_ccrcc
1391092,gtc2glytype_n-linked|o-linked,human_proteoform_glycosylation_sites_pdc_ccrcc
rykahsay commented 2 months ago

@ReneRanzinger @edwardsnj ... I am adding you to this ticket so that you have some background when we discuss this in our next meeting to make decisions on some of these flags.

rykahsay commented 2 months ago

@ubhuiyan ... following Rene's recommendation, the following cases has been relaxed (please document):

For a given "glytoucan_ac", if "glytoucan_type"  field in "glycan_masterlist.csv" is in ["Composition", "BaseComposition"], the qc flags "glycan_without_glytype" and "gtc2glytype_n-linked|o-linked" are not applied to glycosylation_site rows involving "glytoucan_ac".
ubhuiyan commented 2 months ago

@rykahsay I have documented this in the Dataset Checking (QC) document. Please let me know if there's any information I have missed.

rykahsay commented 2 months ago

Given below is the new stats on removed rows:

$ cat logs/rykahsay_global_qc_stats.csv |sort -n
n_filtered_out,flag,file_name
1,aa_mismatch,mouse_proteoform_glycosylation_sites_predicted_isoglyp
1,aa_mismatch,rat_proteoform_glycosylation_sites_predicted_isoglyp
1,o_glycan_aa_mismatch,mouse_proteoform_glycosylation_sites_unicarbkb
1,o_glycan_aa_mismatch,pig_proteoform_glycosylation_sites_glyconnect
2,glycan_without_glytype,sarscov2_proteoform_glycosylation_sites_glyconnect
2,pos_out_of_seq_range,human_proteoform_glycosylation_sites_embl
5,glycan_without_glytype,human_proteoform_glycosylation_sites_carbbank
6,glycan_without_glytype,dicty_proteoform_glycosylation_sites_glyconnect
8,glycan_without_glytype,chicken_proteoform_glycosylation_sites_glyconnect
8,glycan_without_glytype,fruitfly_proteoform_glycosylation_sites_glyconnect
9,n_glycan_aa_mismatch,human_proteoform_glycosylation_sites_glyconnect
11,o_glycan_aa_mismatch,rat_proteoform_glycosylation_sites_unicarbkb
14,n_glycan_aa_mismatch,human_proteoform_glycosylation_sites_unicarbkb
15,aa_mismatch,human_proteoform_glycosylation_sites_predicted_isoglyp
20,glycan_without_glytype,rat_proteoform_glycosylation_sites_glyconnect
21,glycan_without_glytype,mouse_proteoform_glycosylation_sites_glyconnect
21,glycan_without_glytype,pig_proteoform_glycosylation_sites_glyconnect
22,glycan_without_glytype,human_proteoform_glycosylation_sites_platelet
24,o_glycan_aa_mismatch,sarscov2_proteoform_glycosylation_sites_glyconnect
32,pos_out_of_seq_range,mouse_proteoform_glycosylation_sites_embl
50,aa_mismatch,human_proteoform_glycosylation_sites_embl
55,o_glycan_aa_mismatch,human_proteoform_glycosylation_sites_unicarbkb
87,o_glycan_aa_mismatch,human_proteoform_glycosylation_sites_glyconnect
141,o_glycan_aa_mismatch,human_proteoform_glycosylation_sites_embl
157,o_glycan_aa_mismatch,mouse_proteoform_glycosylation_sites_glyconnect
184,glycan_without_glytype,human_proteoform_glycosylation_sites_o_gluc
233,o_glycan_aa_mismatch,mouse_proteoform_glycosylation_sites_embl
289,glycan_without_glytype,human_proteoform_glycosylation_sites_glyconnect
365,aa_mismatch,mouse_proteoform_glycosylation_sites_embl
47558,o_glycan_aa_mismatch,human_proteoform_glycosylation_sites_pdc_ccrcc
80256,aa_mismatch,human_proteoform_glycosylation_sites_pdc_ccrcc
ReneRanzinger commented 2 months ago

For now we will investigate the cases with the biggest numbers:

Once these are figured out we can rerun the statistic and see what is left.