Open ReneRanzinger opened 2 months ago
@ReneRanzinger Attached is a csv table of all the GlyToucan IDs in our datasets (see below) that are not integrated into Glygen due to the global QC flag "glycan_without_glytype"; these are glycans that are not in the glycan_classification.csv dataset file, meaning we do not know the type of the glycan (if it is n-linked, o-linked etc)
5,glycan_without_glytype,human_proteoform_glycosylation_sites_carbbank
2,glycan_without_glytype,sarscov2_proteoform_glycosylation_sites_glyconnect
6,glycan_without_glytype,dicty_proteoform_glycosylation_sites_glyconnect
8,glycan_without_glytype,chicken_proteoform_glycosylation_sites_glyconnect
8,glycan_without_glytype,fruitfly_proteoform_glycosylation_sites_glyconnect
20,glycan_without_glytype,rat_proteoform_glycosylation_sites_glyconnect
21,glycan_without_glytype,mouse_proteoform_glycosylation_sites_glyconnect
21,glycan_without_glytype,pig_proteoform_glycosylation_sites_glyconnect
184,glycan_without_glytype,human_proteoform_glycosylation_sites_o_gluc
I've added additional columns to the table to help the assessment, such as taxonomy, xref, publication etc. glycan_without_glytype_logs.csv
@edwardsnj the attached Excel file has the unique GlyTouCan IDs from @katewarner file. These are glycans reported on proteins (N and O). Since these glycans do not have a class they trigger these sites to be filtered out. Screening over them I see 3 major issues:
@ReneRanzinger Could you provide me with examples of each of the cases you enumerate? There are 170 accessions listed in the spreadsheet with no context. The first few are Glc-core O-glycans.
@edwardsnj added another column with my classification. If you want to know the databases records (glyconnect) its in Kates files. GlyTouCan IDS.xlsx
Lets put that on the agenda in two weeks.
@mtiemeyer0919 and @ReneRanzinger met and reviewed the spreadsheet provided by @edwardsnj. I moved the changes we want to do in the classification into a separate ticket (#1787) and will use this ticket for the reporting of errors by @katewarner to the dataset owners.
@katewarner the following spreadsheet identifies the GlyTouCan IDs that we consider invalid as glycans on proteins and therefore should not show up as glycans on sites.
Glycosylation glycan error filter - Kate.xlsx
Please compile a report to the dataset owners with a request of review and fixing the corresponding datasets. It would be good to send them spreadsheet(s) with the rows from their original dataset containing the erroneous entries. Just to make it easier for them to find the corresponding rows. Similar to #1671:
@katewarner please wait till @edwardsnj has reviewed #1787. He might push back on some structures that we may have to include as errors as well.
Based on #1641. Provide a table with the two columns:
for all cases of _glycan_withoutglytype.
After this assign the ticket to @ReneRanzinger to look into this. We should figure out why Nathan does not assign them glycan types although they are used in glycosylation.