Closed kmartinez834 closed 1 year ago
@jeet-vora
why are the header changing? Will it change again?
Headers will not change again. We are now taking the final allbiomarkers-all.csv file from data.oncomx.org rather than a file that was manually prepared/edited.
Done --> please check unreviewed/human_protein_biomarkers.csv
@jeet-vora see #13 for comment about sample mapping
👍 Dataset created, moved issues to https://github.com/glygener/glygen-issues/issues/135 for next data release
Biomarker DB (aka OncoMX) source file now includes protein and glycan entries. See below to update existing human_protein_biomarkers_cancer.csv dataset (change file name to human_protein_biomarkers.csv)
Input file: /data/projects/glygen/downloads/biomarkerdb/current/allbiomarkers-all.csv Output file: reviewed/human_protein_biomarkers.csv
1. Input file column names have changed
From: biomarker_id,main_xref,assessed_biomarker_entity,biomarker,best_biomarker_type,specimen_type,loinc_code,disease_name,literature_evidence,pmid,notes
To: "Assessed biomarker entity ID","Main x-ref","Assessed biomarker entity","Biomarker","BEST biomarker type","Specimen type","LOINC code","Disease name","Assessed entity type","Literature evidence","Notes"
Note: "pmid" column removed, "Assessed entitiy type" is a new field
2. Add new column "assessed_entity_type" to output file from source "Assessed entity type"
3. Extract only entries where "Main x-ref" starts with UPKB:
Note: There may be more than one PMID for each entry
5. All other processing steps same as last update
6. Create citations file: citations_human_protein_biomarkers.csv