glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

Look into glycan_without_glytype in the QC pipeline #1672

Open ReneRanzinger opened 2 months ago

ReneRanzinger commented 2 months ago

Based on #1641. Provide a table with the two columns:

for all cases of _glycan_withoutglytype.

After this assign the ticket to @ReneRanzinger to look into this. We should figure out why Nathan does not assign them glycan types although they are used in glycosylation.

katewarner commented 2 months ago

@ReneRanzinger Attached is a csv table of all the GlyToucan IDs in our datasets (see below) that are not integrated into Glygen due to the global QC flag "glycan_without_glytype"; these are glycans that are not in the glycan_classification.csv dataset file, meaning we do not know the type of the glycan (if it is n-linked, o-linked etc)

5,glycan_without_glytype,human_proteoform_glycosylation_sites_carbbank
2,glycan_without_glytype,sarscov2_proteoform_glycosylation_sites_glyconnect
6,glycan_without_glytype,dicty_proteoform_glycosylation_sites_glyconnect
8,glycan_without_glytype,chicken_proteoform_glycosylation_sites_glyconnect
8,glycan_without_glytype,fruitfly_proteoform_glycosylation_sites_glyconnect
20,glycan_without_glytype,rat_proteoform_glycosylation_sites_glyconnect
21,glycan_without_glytype,mouse_proteoform_glycosylation_sites_glyconnect
21,glycan_without_glytype,pig_proteoform_glycosylation_sites_glyconnect
184,glycan_without_glytype,human_proteoform_glycosylation_sites_o_gluc

I've added additional columns to the table to help the assessment, such as taxonomy, xref, publication etc. glycan_without_glytype_logs.csv

ReneRanzinger commented 2 months ago

@edwardsnj the attached Excel file has the unique GlyTouCan IDs from @katewarner file. These are glycans reported on proteins (N and O). Since these glycans do not have a class they trigger these sites to be filtered out. Screening over them I see 3 major issues:

GlyTouCan IDS.xlsx

edwardsnj commented 2 months ago

@ReneRanzinger Could you provide me with examples of each of the cases you enumerate? There are 170 accessions listed in the spreadsheet with no context. The first few are Glc-core O-glycans.

ReneRanzinger commented 2 months ago

@edwardsnj added another column with my classification. If you want to know the databases records (glyconnect) its in Kates files. GlyTouCan IDS.xlsx

ReneRanzinger commented 2 months ago

Lets put that on the agenda in two weeks.

ReneRanzinger commented 1 month ago

@mtiemeyer0919 and @ReneRanzinger met and reviewed the spreadsheet provided by @edwardsnj. I moved the changes we want to do in the classification into a separate ticket (#1787) and will use this ticket for the reporting of errors by @katewarner to the dataset owners.

ReneRanzinger commented 1 month ago

@katewarner the following spreadsheet identifies the GlyTouCan IDs that we consider invalid as glycans on proteins and therefore should not show up as glycans on sites.

Glycosylation glycan error filter - Kate.xlsx

Please compile a report to the dataset owners with a request of review and fixing the corresponding datasets. It would be good to send them spreadsheet(s) with the rows from their original dataset containing the erroneous entries. Just to make it easier for them to find the corresponding rows. Similar to #1671:

ReneRanzinger commented 1 month ago

@katewarner please wait till @edwardsnj has reviewed #1787. He might push back on some structures that we may have to include as errors as well.