Closed rykahsay closed 1 week ago
@ReneRanzinger @edwardsnj ... I am adding you to this ticket so that you have some background when we discuss this in our next meeting to make decisions on some of these flags.
@ubhuiyan ... following Rene's recommendation, the following cases has been relaxed (please document):
For a given "glytoucan_ac", if "glytoucan_type" field in "glycan_masterlist.csv" is in ["Composition", "BaseComposition"], the qc flags "glycan_without_glytype" and "gtc2glytype_n-linked|o-linked" are not applied to glycosylation_site rows involving "glytoucan_ac".
@rykahsay I have documented this in the Dataset Checking (QC) document. Please let me know if there's any information I have missed.
Given below is the new stats on removed rows:
$ cat logs/rykahsay_global_qc_stats.csv |sort -n
n_filtered_out,flag,file_name
1,aa_mismatch,mouse_proteoform_glycosylation_sites_predicted_isoglyp
1,aa_mismatch,rat_proteoform_glycosylation_sites_predicted_isoglyp
1,o_glycan_aa_mismatch,mouse_proteoform_glycosylation_sites_unicarbkb
1,o_glycan_aa_mismatch,pig_proteoform_glycosylation_sites_glyconnect
2,glycan_without_glytype,sarscov2_proteoform_glycosylation_sites_glyconnect
2,pos_out_of_seq_range,human_proteoform_glycosylation_sites_embl
5,glycan_without_glytype,human_proteoform_glycosylation_sites_carbbank
6,glycan_without_glytype,dicty_proteoform_glycosylation_sites_glyconnect
8,glycan_without_glytype,chicken_proteoform_glycosylation_sites_glyconnect
8,glycan_without_glytype,fruitfly_proteoform_glycosylation_sites_glyconnect
9,n_glycan_aa_mismatch,human_proteoform_glycosylation_sites_glyconnect
11,o_glycan_aa_mismatch,rat_proteoform_glycosylation_sites_unicarbkb
14,n_glycan_aa_mismatch,human_proteoform_glycosylation_sites_unicarbkb
15,aa_mismatch,human_proteoform_glycosylation_sites_predicted_isoglyp
20,glycan_without_glytype,rat_proteoform_glycosylation_sites_glyconnect
21,glycan_without_glytype,mouse_proteoform_glycosylation_sites_glyconnect
21,glycan_without_glytype,pig_proteoform_glycosylation_sites_glyconnect
22,glycan_without_glytype,human_proteoform_glycosylation_sites_platelet
24,o_glycan_aa_mismatch,sarscov2_proteoform_glycosylation_sites_glyconnect
32,pos_out_of_seq_range,mouse_proteoform_glycosylation_sites_embl
50,aa_mismatch,human_proteoform_glycosylation_sites_embl
55,o_glycan_aa_mismatch,human_proteoform_glycosylation_sites_unicarbkb
87,o_glycan_aa_mismatch,human_proteoform_glycosylation_sites_glyconnect
141,o_glycan_aa_mismatch,human_proteoform_glycosylation_sites_embl
157,o_glycan_aa_mismatch,mouse_proteoform_glycosylation_sites_glyconnect
184,glycan_without_glytype,human_proteoform_glycosylation_sites_o_gluc
233,o_glycan_aa_mismatch,mouse_proteoform_glycosylation_sites_embl
289,glycan_without_glytype,human_proteoform_glycosylation_sites_glyconnect
365,aa_mismatch,mouse_proteoform_glycosylation_sites_embl
47558,o_glycan_aa_mismatch,human_proteoform_glycosylation_sites_pdc_ccrcc
80256,aa_mismatch,human_proteoform_glycosylation_sites_pdc_ccrcc
For now we will investigate the cases with the biggest numbers:
Once these are figured out we can rerun the statistic and see what is left.
I am applying strict QC to all glycosylation/phosphorylation/glycation site dataset files, and the filtered out rows are stored in logs/proteoform.global.log files, and also I have created a script to analyze stats of rows in those files (see bottom).
Please read carefully and document. As you can see at the bottom, some of the flags are filtering out significant amount of volume from the datasets and we need to discuss if we want to apply them in our next Wed group meeting (please add it to the agenda).
Types of flags
aa_mismatch
Amino acid reported at the position does not match the amino acid in canonical sequence
glycan_without_glytype
Reported glycan is not in the glycan_classification.csv dataset file, meaning we do not know the type of the glycan (if it is n-linked, o-linked etc)
gtc2glytype_n-linked|o-linked
Reported glycan has both types of types (n-linked and o-linked) in glycan_classification.csv dataset. See example below
n_glycan_aa_mismatch
N-linked glycan, based on glycan_classification.csv, is being reported to be glycosylating o-linked amino acid based on misc/aadict.csv. In the example given below
o_glycan_aa_mismatch
O-linked glycan, based on glycan_classification.csv, is being reported to be glycosylating n-linked amino acid based on misc/aadict.csv. In the example given below
pos_out_of_seq_range
reported position is out of canonical sequence range
Dumping global qc stats
Viewing global qc stats