glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

Process O-GlcNAc Atlas files #50

Closed jeet-vora closed 1 year ago

jeet-vora commented 1 year ago

Process both the data files, ambiguous and unambiguous

README (Read the instructions for proper processing)

image

O-GlcNAc Atlas

Input /data/projects/glygen/downloads/atlas_oglcnac/current (ambiguous_sites_version_2.0.csv | unambiguous_sites_version_2.0.csv) masterlists canonical.fasta misc/aadict.csv data/projects/glygen/generated/misc/sample_mapping.csv

Species human, mouse, rat and fruitfly

Output

human_proteoform_glycosylation_sites_o_glcnac_atlas.csv` mouse_proteoform_glycosylation_sites_o_glcnac_atlas.csv rat_proteoform_glycosylation_sites_o_glcnac_atlas.csv fruitfly_proteoform_glycosylation_sites_o_glcnac_atlas.csv

human_proteoform_citations_glycosylation_o_glcnac_atlas.csv
mouse_proteoform_citations_glycosylation_o_glcnac_atlas.csv
rat_proteoform_citations_glycosylation_o_glcnac_atlas.csv
fruitfly_proteoform_citations_glycosylation_o_glcnac_atlas.csv

human_protein_xref_oglcnac_atlas.csv
mouse_protein_xref_oglcnac_atlas.csv
rat_protein_xref_oglcnac_atlas.csv
fruitfly_protein_xref_oglcnac_atlas.csv

log files
image image

Step 0 Merge both the input files and extract the entires as per species - human, mouse, rat and drosophila

Step 1 From the accession field extract the UniProt ac and canonize them as per the masterlist into the field uniprotkb_canonical_ac

Step 2 From the oglycnac sites, extract the single letter amino acid and the position and map it to the three-letter aa using aadict.csv and the glygen canonical sequence. Add the mapped amino acid and position to the glycosylation_site_uniprotkb amino_acid fields.

Step 3 Add the field saccharide|glycosylation_type to the dataset with value G49108TO|O-linked respectively to all entries

Step 4 Add the field glycosylation_type with the value as O-linked for all the entries

Step 5 Add the field glycosylation_type with the value as O-linked and field carb_name with the value as O-GlcNAc for all the entries. Make sure the casing is correct for O-GlcNAc

Step 6 From the fields PMIDS extract the PMIDs and from uniprotkb_canonical_ac add it to the fields xref_key xref_id src_xref_key src_xref_id : protein_xref_oglcnac_atlas | protein_xref_pubmed

Step 7 Based on the sample_type map the value to the ontological values from /misc/sample_mapping.csv and add it to the ontology_term_id | ontology_term_name | sample_type. There could be multiple sample type in the field sometimes seperated by , or (). The individual terms have been mapped in the sample_mapping file. Thus there should be mutiple rows for such entries as defined by the sample_type.

Please note the fields source_tissue_id source_tissue_name source_cell_line_cellosaurus_id source_cell_line_cellosaurus_name aren't a good fit here because there are another ontologies other than uberon and cellosaurus.

Step 8 Extract comments into notes field. If you are using comments in other datasets we need to be uniform across

Step 9 Add the fields method and analytical_throughput with the corresponding values from the source file

Step 10 Generate the log files for unmapped accessions and aminoacid values

Step 11 Create citations and xref files

Output files human_proteoform_glycosylation_sites_o_glcnac_atlas.csv mouse_proteoform_glycosylation_sites_o_glcnac_atlas.csv rat_proteoform_glycosylation_sites_o_glcnac_atlas.csv fruitfly_proteoform_glycosylation_sites_o_glcnac_atlas.csv

human_proteoform_citations_glycosylation_o_glcnac_atlas.csv
mouse_proteoform_citations_glycosylation_o_glcnac_atlas.csv
rat_proteoform_citations_glycosylation_o_glcnac_atlas.csv
fruitfly_proteoform_citations_glycosylation_o_glcnac_atlas.csv

human_protein_xref_oglcnac_atlas.csv
mouse_protein_xref_oglcnac_atlas.csv
rat_protein_xref_oglcnac_atlas.csv
fruitfly_protein_xref_oglcnac_atlas.csv

Log files
rykahsay commented 1 year ago

@jeet-vora I thought I did this, please give details on what are the remaining issue(s) ONLY!

jeet-vora commented 1 year ago

@rykahsay The ticket was half done and is here beacuse you processed only one file out of two. You had only processed unambigous file, ambiguous file is yet to be processed.

Also the file has not been processed as per the readme as some fields are missing. The current dataset that you have processed is processed as per o_glcnac_mcw workflow which is not needed for this dataset. The o_glcnac_mcw dataset was processed in that way for paper.

Once reprocessed update the citations_o_glcnac_atlas.csv

rykahsay commented 1 year ago

@jeet-vora "You had only processed unambigous file, ambiguous file is yet to be processed." --> NOT TRUE

" ... processed as per the readme as some fields are missing" --> which fields????

Like I said before, you need to be specific and tell me what is wrong with the dataset I have created so for. I need bullet point items of issues with the dataset "unreviewed/human_proteoform_glycosylation_sites_o_glcnac_atlas.csv" that was created on Dec 8

jeet-vora commented 1 year ago

@rykahsay The dataset has not been processed as per the readme. Issues with the dataset are below 1) The ambigous dataset hasn't been processed - downloads/atlas_oglcnac/current/ambiguous_sites_version_2.0.csv e.g. O88737 ac is missing from the unreviewed dataset. Please process ambiguous and unambiguous dataset.

2) The dataset has been processed like o-glcnac MCW. This dataset processing is different than o-glcnac MCW as not all fields are required.

3) Remove the fields from the dataset 'blank',sample_type,accession,accession_source,peptide_seq,site_residue,position_in_peptide,position_in_protein, start_pos, end_pos, start_aa, end_aa,site_seq, glycosylation_subtype, status, uniprotkb_id, gene_name,recommended_name_full,

4) These should be fields in the dataset - see the above readme and screenshot for the fields uniprotkb_canonical_ac glycosylation_site_uniprotkb amino_acid saccharide glycosylation_type glycosylation_subtype carb_name xref_key xref_id src_xref_key src_xref_id ontology_term_id ontology_term_name sample_type notes method analytical_throughput

rykahsay commented 1 year ago

These are the fields you wanted on the dataset:

And these are the fields I have now (the extra fields I have are required in all proteoform datasets):

rykahsay commented 1 year ago

I see O88737 in the mouse dataset:

cat unreviewed/mouse_proteoform_glycosylation_sites_o_glcnac_atlas.csv | grep O88737

"O88737-1","1005","Ser","G49108TO","Olinked","protein_xref_oglcnac_db","O88737","protein_xref_oglcnac_db","O88737","synaptosome","MS","HTP","","GlcNac","YTSGT-S-PTSLS","","","","","","","1005","1005","Ser","Ser","S"

"O88737-1","1005","Ser","G49108TO","Olinked","protein_xref_pubmed","34678516","protein_xref_oglcnac_db","O88737","synaptosome","MS","HTP","","GlcNac","YTSGT-S-PTSLS","","","","","","","1005","1005","Ser","Ser","S"

jeet-vora commented 1 year ago

Issues:

1) The sample types other than cellosuarus name and uberon names are not being extracted eg in human dataset brain (synaptosome) or T cells. The mapping is present in sample_mapping.csv U000167 human brain (synaptosome) P36507 UniProt MP2K2_HUMAN Dual specificity mitogen-activated protein kinase kinase 2 MAP2K2 LNQPGTPTRTAV T 8 396 MS LTP 20563614

2) The carb name should be GlcNAc, A should be in uppercase.

3)There are entries in the source file that have multiple sample types that have to extracted as separated entries in the dataset. U000001 rat lens, heart P23928 UniProt CRYAB_RAT Alpha-crystallin B chain Cryab EEKPAVTAAPK T 7 170 Edman degradation LTP 8639509

4) There are large number of entries in logs file for unknown_canon. Once the server is up will check those.

rykahsay commented 1 year ago

filend_name1, field_name2, .... U000167 human brain (synaptosome) P36507 UniProt MP2K2_HUMAN Dual specificity mitogen-activated protein kinase kinase 2 MAP2K2 LNQPGTPTRTAV T 8 396 MS LTP 20563614

jeet-vora commented 1 year ago

Issues:

See issue 1 and 2 together. 1) The sample types other than cellosuarus name and uberon names are not being extracted from source dataset eg brain (synaptosome) or T cells. The mapping is present in sample_mapping.csv. Please find the term in field "in_dataset" and use field "term_name" for ontology name.

Example entry in the source file with header (target field_3 sample_type) Header -id,species,sample_type,accession,accession_source,entry_name,protein_name,gene_name,peptide_seq,site_residue,position_in_peptide,position_in_protein,method,analytical_throughput,pmid,comments Example - 167,U000167,human,brain (synaptosome),P36507,UniProt,MP2K2_HUMAN,Dual specificity mitogen-activated protein kinase kinase 2,MAP2K2,LNQPGTPTRTAV,T,8,396,MS,LTP,20563614,

sample_mapping.csv Header- term_id,term_name,sample_type,ontology_name,in_dataset Example -MESH:D013574,synaptosomes,artificial_structure,Medical Subject Headings,brain (synaptosome),synaptosome Note - some sample type "in_dataset" field are comma seperated.

2)There are entries in the source ds that have multiple sample_type (field_3) that have to extracted as separated entries in the dataset. The multiple sample type are separated by "," and also by ", and" and are in double qoutes. Header -id,species,sample_type,accession,accession_source,entry_name,protein_name,gene_name,peptide_seq,site_residue,position_in_peptide,position_in_protein,method,analytical_throughput,pmid,comments example - 1,U000001,rat,"lens, heart",P23928,UniProt,CRYAB_RAT,Alpha-crystallin B chain,Cryab,EEKPAVTAAPK,T,7,170,Edman degradation,LTP,8639509,9569,U009569,human,HEK 293T cells,Q12934,UniProt,BFSP1_HUMAN,Filensin,BFSP1,SYVFQTRK,T,6,10,MS,HTP,30620550,

3) The carb name should be GlcNAc, A should be in uppercase.

4) In log files - There are large number of entries in logs file for unknown_canon that do not belong to the organism. eg. Rat accession entries in fruitfly log file. Please keep only erroneous entries in the log file that are of the corresponding species for all log files. Also entries are being repeated, one with ac and one without ac. See below

/data/projects/glygen/generated/datasets/logs/fruitfly_proteoform_glycosylation_sites_o_glcnac_atlas.log Header - uniprotkb_canonical_ac","glycosylation_site_uniprotkb","amino_acid","saccharide","glycosylation_type","xref_key","xref_id","src_xref_key","src_xref_id","sample_type","method","analytical_throughput","notes","carb_name","glycosylation_subtype","source_tissue_id","source_tissue_name","source_cell_line_cellosaurus_id","source_cell_line_cellosaurus_name","ontology_term_id","ontology_term_name","filter_flags

Example without ac - "","170","Thr","G49108TO","O-linked","protein_xref_oglcnac_db","","protein_xref_oglcnac_db","","lens, heart","Edman degradation","LTP","","","","uknown_canon"

Example with ac "P23928","170","Thr","G49108TO","O-linked","protein_xref_pubmed","8639509","protein_xref_oglcnac_db","","lens, heart","Edman degradation","LTP","","","","uknown_canon"

image
rykahsay commented 1 year ago

@jeet-vora -- I have corrected your sample_mapping.csv file (now moved to generated/misc/sample_mapping.csv-backup). Next time, please make a proper CSV file with values in double quotes and with the same number of fields in all rows.

rykahsay commented 1 year ago

@jeet-vora --> All issues like (1) and (2) will go away once you edit the sample_mapping.csv file more so that "lens, heart" appears in the "in_dataset" field. These kind of issues will go away once you put all possible sample_type values from the dataset in the "in_dataset" field of the sample_mapping.csv.

I have fixed the other issues --> please check

kmartinez834 commented 1 year ago

Issue is dependent on update of sample_mapping.csv file, see ticket https://github.com/glygener/glygen-issues/issues/70

kmartinez834 commented 1 year ago

@rykahsay generated/misc/sample_mapping.csv file has been updated (#70 ), please reprocess the following datasets:

fruitfly_proteoform_glycosylation_sites_o_glcnac_atlas.csv mouse_proteoform_glycosylation_sites_o_glcnac_atlas.csv human_proteoform_glycosylation_sites_o_glcnac_atlas.csv rat_proteoform_glycosylation_sites_o_glcnac_atlas.csv

NOTE: In the example above, "lens, heart" refers to two tissues and need to be mapped separately as "lens" and "heart" to the following IDs:

"term_id","term_name","sample_type","ontology_name","in_dataset"
"NCIT:C12743","Lens","tissue","NCI Thesaurus","lens"
"UBERON:0000948","heart","tissue","Uber-anatomy ontology","heart"
rykahsay commented 1 year ago

That will not be the correct way: the "in_dataset" field should contain the exact value from the input file. Given below is what you should have. Please edit to correct all such cases.

"term_id","term_name","sample_type","ontology_name","in_dataset" "NCIT:C12743","Lens","tissue","NCI Thesaurus","lens, heart" "UBERON:0000948","heart","tissue","Uber-anatomy ontology","lens, heart"

jeet-vora commented 1 year ago

I have updated the sample_mapping.csv. Please check now.

Best,

Jeet Vora Senior Research Associate Scientific Coordinator for GlyGen.org Project Manager for Glycosciences-NIH CFDE The George Washington University Ross Hall, Room 559 2300 Eye Street N.W. Washington, DC 20052 @.** Pronouns - He/him/his*

On Thu, Mar 2, 2023 at 2:17 PM Robel Kahsay @.***> wrote:

That will not be the correct way: the "in_dataset" field should contain the exact value from the input file. Given below is what you should have. Please edit to correct all such cases.

"term_id","term_name","sample_type","ontology_name","in_dataset" "NCIT:C12743","Lens","tissue","NCI Thesaurus","lens, heart" "UBERON:0000948","heart","tissue","Uber-anatomy ontology","lens, heart"

— Reply to this email directly, view it on GitHub https://github.com/glygener/glygen-issues/issues/50#issuecomment-1452425442, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGFA74URJHGISDZJ5PLJ753W2DW3DANCNFSM6AAAAAAU32LDBM . You are receiving this because you authored the thread.Message ID: @.***>

kmartinez834 commented 1 year ago

I updated with UnicarbKB terms as well.

@rykahsay do we need to worry about case-sensitivity and leading/trailing spaces?

Examples from datasets: " cerebrospinal fluid" "Synaptosomes" "synaptosomes"

rykahsay commented 1 year ago

Please check now -- I have modified the script to make it case insensitive

kmartinez834 commented 1 year ago
kmartinez834 commented 1 year ago

@rykahsay the following terms did not map to cell line:

"DU145 cells, U2OS cells" "HEK 293T cells, HeLa cells" "Bel-7402 cells, SMMC-7721 cells" "HEK 293F cells, HEK 293T cells" "MCF-7, T47D and MDA-MB-231 cells" "BCPAP, KTC-1, and TPC-1 cells"

Specific rows from datasets are below...

"Q8IZD2-1","435","Ser","G49108TO","O-linked","protein_xref_oglcnac_db","Q8IZD2","protein_xref_oglcnac_db","Q8IZD2","HEK 293T cells, HeLa cells","MS","LTP","","GlcNAc","IYSIH-S-IPKGT","","","","","","","435","435","Ser","Ser","S" "Q8IZD2-1","435","Ser","G49108TO","O-linked","protein_xref_pubmed","26678539","protein_xref_oglcnac_db","Q8IZD2","HEK 293T cells, HeLa cells","MS","LTP","","GlcNAc","IYSIH-S-IPKGT","","","","","","","435","435","Ser","Ser","S" "Q8IZD2-1","440","Thr","G49108TO","O-linked","protein_xref_oglcnac_db","Q8IZD2","protein_xref_oglcnac_db","Q8IZD2","HEK 293T cells, HeLa cells","MS","LTP","","GlcNAc","SIPKG-T-EITIA","","","","","","","440","440","Thr","Thr","T" "Q8IZD2-1","440","Thr","G49108TO","O-linked","protein_xref_pubmed","26678539","protein_xref_oglcnac_db","Q8IZD2","HEK 293T cells, HeLa cells","MS","LTP","","GlcNAc","SIPKG-T-EITIA","","","","","","","440","440","Thr","Thr","T"

"P46937-1","241","Thr","G49108TO","O-linked","protein_xref_oglcnac_db","P46937","protein_xref_oglcnac_db","P46937","Bel-7402 cells, SMMC-7721 cells","MS, site-directed mutagenesis","LTP","","GlcNAc","WEQAM-T-QDGEI","","","","","","","241","241","Thr","Thr","T" "P46937-1","241","Thr","G49108TO","O-linked","protein_xref_pubmed","28474680","protein_xref_oglcnac_db","P46937","Bel-7402 cells, SMMC-7721 cells","MS, site-directed mutagenesis","LTP","","GlcNAc","WEQAM-T-QDGEI","","","","","","","241","241","Thr","Thr","T"

"Q13330-1","237","Ser","G49108TO","O-linked","protein_xref_oglcnac_db","Q13330","protein_xref_oglcnac_db","Q13330","MCF-7, T47D and MDA-MB-231 cells","Site-specific mutagenesis","LTP","","GlcNAc","SVRQP-S-LHMSA","","","","","","","237","237","Ser","Ser","S" "Q13330-1","237","Ser","G49108TO","O-linked","protein_xref_pubmed","34019948","protein_xref_oglcnac_db","Q13330","MCF-7, T47D and MDA-MB-231 cells","Site-specific mutagenesis","LTP","","GlcNAc","SVRQP-S-LHMSA","","","","","","","237","237","Ser","Ser","S" "Q13330-1","241","Ser","G49108TO","O-linked","protein_xref_oglcnac_db","Q13330","protein_xref_oglcnac_db","Q13330","MCF-7, T47D and MDA-MB-231 cells","Site-specific mutagenesis","LTP","","GlcNAc","PSLHM-S-AAAAS","","","","","","","241","241","Ser","Ser","S" "Q13330-1","241","Ser","G49108TO","O-linked","protein_xref_pubmed","34019948","protein_xref_oglcnac_db","Q13330","MCF-7, T47D and MDA-MB-231 cells","Site-specific mutagenesis","LTP","","GlcNAc","PSLHM-S-AAAAS","","","","","","","241","241","Ser","Ser","S" "Q13330-1","246","Ser","G49108TO","O-linked","protein_xref_oglcnac_db","Q13330","protein_xref_oglcnac_db","Q13330","MCF-7, T47D and MDA-MB-231 cells","Site-specific mutagenesis","LTP","","GlcNAc","SAAAA-S-RDITL","","","","","","","246","246","Ser","Ser","S" "Q13330-1","246","Ser","G49108TO","O-linked","protein_xref_pubmed","34019948","protein_xref_oglcnac_db","Q13330","MCF-7, T47D and MDA-MB-231 cells","Site-specific mutagenesis","LTP","","GlcNAc","SAAAA-S-RDITL","","","","","","","246","246","Ser","Ser","S"

"P46937-1","109","Ser","G49108TO","O-linked","protein_xref_oglcnac_db","P46937","protein_xref_oglcnac_db","P46937","BCPAP, KTC-1, and TPC-1 cells","Site-specific mutagenesis","LTP","","GlcNAc","HSRQA-S-TDAGT","","","","","","","109","109","Ser","Ser","S" "P46937-1","109","Ser","G49108TO","O-linked","protein_xref_pubmed","34155345","protein_xref_oglcnac_db","P46937","BCPAP, KTC-1, and TPC-1 cells","Site-specific mutagenesis","LTP","","GlcNAc","HSRQA-S-TDAGT","","","","","","","109","109","Ser","Ser","S"


- mouse_proteoform_glycosylation_sites_oglcnac_atlas.csv

"Q61985-1","448","Ser","G49108TO","O-linked","protein_xref_oglcnac_db","Q61985","protein_xref_oglcnac_db","Q61985","HEK 293F cells, HEK 293T cells","prediction and site-directed mutagenesis","LTP","No mass spec. data provided; the peptide sequence is inferred from tryptic cleavage","GlcNAc","EFDSD-S-GLSLD","","","","","","","448","448","Ser","Ser","S" "Q61985-1","448","Ser","G49108TO","O-linked","protein_xref_pubmed","29941490","protein_xref_oglcnac_db","Q61985","HEK 293F cells, HEK 293T cells","prediction and site-directed mutagenesis","LTP","No mass spec. data provided; the peptide sequence is inferred from tryptic cleavage","GlcNAc","EFDSD-S-GLSLD","","","","","","","448","448","Ser","Ser","S" "Q61985-1","451","Ser","G49108TO","O-linked","protein_xref_oglcnac_db","Q61985","protein_xref_oglcnac_db","Q61985","HEK 293F cells, HEK 293T cells","prediction and site-directed mutagenesis","LTP","No mass spec. data provided; the peptide sequence is inferred from tryptic cleavage","GlcNAc","SDSGL-S-LDSSH","","","","","","","451","451","Ser","Ser","S" "Q61985-1","451","Ser","G49108TO","O-linked","protein_xref_pubmed","29941490","protein_xref_oglcnac_db","Q61985","HEK 293F cells, HEK 293T cells","prediction and site-directed mutagenesis","LTP","No mass spec. data provided; the peptide sequence is inferred from tryptic cleavage","GlcNAc","SDSGL-S-LDSSH","","","","","","","451","451","Ser","Ser","S"

rykahsay commented 1 year ago

Tried to fix it -- please check

image image
kmartinez834 commented 1 year ago

👍 Looks good