Update human_protein_biomarkers_cancer.csv dataset (new name human_protein_biomarkers.csv)

kmartinez834 commented 1 year ago

Biomarker DB (aka OncoMX) source file now includes protein and glycan entries. See below to update existing human_protein_biomarkers_cancer.csv dataset (change file name to human_protein_biomarkers.csv)

Input file: /data/projects/glygen/downloads/biomarkerdb/current/allbiomarkers-all.csv Output file: reviewed/human_protein_biomarkers.csv

1. Input file column names have changed

From: biomarker_id,main_xref,assessed_biomarker_entity,biomarker,best_biomarker_type,specimen_type,loinc_code,disease_name,literature_evidence,pmid,notes
To: "Assessed biomarker entity ID","Main x-ref","Assessed biomarker entity","Biomarker","BEST biomarker type","Specimen type","LOINC code","Disease name","Assessed entity type","Literature evidence","Notes"
Note: "pmid" column removed, "Assessed entitiy type" is a new field

2. Add new column "assessed_entity_type" to output file from source "Assessed entity type"

New headers for output file: "uniprotkb_canonical_ac","biomarker_id","assessed_biomarker_entity","biomarker","best_biomarker_type","loinc_code","notes","anatomical_entity","uberon_id","do_id","do_name","assessed_entity_type","xref_key","xref_id","src_xref_key","src_xref_id"

3. Extract only entries where "Main x-ref" starts with UPKB:

Ex. UPKB:P05231

4. Extract PMID from "Literature evidence" field	Input	Output
The blood count results showed anaemia in 21 (75%) patients, leucopaenia in 9 (32.1%) patients, and lymphopaenia in 23 (82.1%) patients. Patients developed severe clinical events; 6 (21.4%) patients were admitted to ICU, 10 (35.7%) patients had life-threatening complications, and 8 (28.6%) of the patients died. [PMID:32224151] Post-COVID-19 infection, lower hemoglobin levels, higher total white blood cell (WBC) counts, and higher absolute neutrophil counts were associated with increased mortality (Table 3). Analysis of other serologic biomarkers demonstrated that elevated D-dimer, lactate, and lactate dehydrogenase (LDH) in patients were significantly correlated with dying (Table 3). [PMID:32357994]	32224151, 32357994

Note: There may be more than one PMID for each entry

5. All other processing steps same as last update

6. Create citations file: citations_human_protein_biomarkers.csv

kmartinez834 commented 1 year ago

@jeet-vora

rykahsay commented 1 year ago

why are the header changing? Will it change again?

kmartinez834 commented 1 year ago

Headers will not change again. We are now taking the final allbiomarkers-all.csv file from data.oncomx.org rather than a file that was manually prepared/edited.

rykahsay commented 1 year ago

Done --> please check unreviewed/human_protein_biomarkers.csv

kmartinez834 commented 1 year ago

@jeet-vora see #13 for comment about sample mapping

kmartinez834 commented 1 year ago

👍 Dataset created, moved issues to https://github.com/glygener/glygen-issues/issues/135 for next data release

glygener / glygen-issues

Update human_protein_biomarkers_cancer.csv dataset (new name human_protein_biomarkers.csv) #12