Closed katewarner closed 2 months ago
Looks good, just a couple things:
/data/projects/glygen/generated/misc/pdc_glytoucan_mapping.csv
as an example. You can name it platelet_o_linked_mapping.csv
Source Field | Output Field | Notes |
---|---|---|
Glycans NHFAGNa | saccharide | For these compositions in Glycans NHFAGNa, use the following GlyTouCan Acs: Hex(1) = G81399MY, Fuc(1) = G96881BQ, HexNac(1) = G29068FM, Hex(1)Fuc(1) = G42494UJ, Hex(1)Pent(2) = G42518JM, HexNAc(1)Hex(1)NeuAc(1) = G17015OC, HexNAc(1)Hex(1)NeuAc(2) = G23729WG |
downloads/download_scripts/xlsx_to_tsv.py
file (make sure to navigate to the downloads/user_submission/platelet_o_linked/current/
folder first. Please keep the csv files in this folder (not compiled per Robel)@kmartinez834 and @katewarner My apologies for not doing this sooner. I've converted the user submitted data to tsv format. The new source should be as follows:
Source: downloads/user_submission/platelet_o_linked/current/1-s2.0-S1535947624000070-mmc7.tsv
@jeet-vora @kmartinez834 Please review my draft instructions for creating the user dataset.
For now we are only going to extract the manually verified ("unambiguous") sites and c-mannosylation data from the user data file. The c-mannosylation data is all ambiguous and difficult to explain easily, so I've added those sites at the bottom of this ticket for Robel to integrate into the dataset.
Draft ticket
Source = 1-s2.0-S1535947624000070-mmc7.tsv Output = human_proteoform_glycosylation_sites_platelet_olinked.csv
Mapping Files: unreviewed/human_protein_masterlist.csv unreviewed/human_protein_allsequences.fasta misc/platelet_olinked_mapping.csv
The output file should have the following headers:
Instructions for Robel:
For this release we are only going to to extract the manually verified ("unambiguous") sites and c-mannosylation data from the user data file so ignore all the rows with "unambiguous" in the "Glycosylation site Localisation Assignment" column.
The instructions below are for extracting the unambiguous sites which are all Ser and Thr glycosylation sites.
The c-mannosylation data is all ambiguous, is spread across a few tables and the manuscript. It will be easier to provide instructions for these sites when we also start to extract all the ambiguous data from the dataset, so for this release I've added those sites at the bottom of this ticket for you to integrate straight into the dataset.
Once the glycosylation site position and amino acid is extracted please check it's correct by mapping them to the proteins in our human_protein_allsequences.fasta file.
misc/platelet_olinked_mapping.csv
Example
Input file:
downloads/user_submission/platelet_olinked/current/1-s2.0-S1535947624000070-mmc7.tsv
Output file:
human_proteoform_glycosylation_sites_platelet_olinked.csv
C-mannosylation sites