Update biomarker file - Githubissues

kmartinez834 commented 4 weeks ago

The source file we are consuming is here: https:/data.glygen.org/ln2downloads/biomarkerdb/current/allbiomarkers-all.csv

Provide new csv file to the GlyGen data management team (Karina, Urnisha, Kate)

seankim658 commented 4 weeks ago

From our call, what I understood is the GlyGen team wanted to add the ID structure based on how the biomarker partnership handles IDs. Right now, the existing ID column in the GlyGen biomarker CSV is the same as the biomarker partnership canonical ID.

The canonical ID (and existing Assessed biomarker entity ID field) criteria is based on the biomarker field indicating the change in the entity.

There is a slight difference in the way the biomarker partnership delineates a second level ID (called biomarker_id) vs how GlyGen delineates instances. The biomarker partnership catalogues second level IDs based on the canonical ID and the disease, whereas GlyGen has some additional criteria for each instance. In order to create the second level ID on the allbiomarkers-all.csv file, I just mapped the existing Assessed biomarker entity ID against the disease for that row.

The script that did that is in this repository called id_assign.py. The updated file is called allbiomarkers-updated.csv. The intermediate mapping file that the script creates to keep track of the ID mapping is called id_map.json.

rykahsay commented 3 weeks ago

@seankim658 Some issues with the latest tsv file and the data model:

looks like values for biomarker_id are plain integers, I recommend you create A00* accessions to make them more authoritative
Looks like you are not done changing your data model -- no biomarker_canonical_id in your recent version. Are you moving away from the concept of canonical vs isoform IDs? The biomarker APIs rely on the field names you are using and I don't want to keep changing back and forth

seankim658 commented 3 weeks ago

@rykahsay Updated the repository here: https://github.com/GW-HIVE/biomarker-glygen-conversion.

The ID-assigned and finalized TSV file is allbiomarker-final.tsv. The annotations file with the synonym data is annotations.json.

rykahsay commented 3 weeks ago

Why do you have "evidence_source" within each "biomarker_component" object and then outside of "biomarker_component" (looks like it is redundant).

 {
        "biomarker_id": "AA4686-1",
        "biomarker_component": [
            {
                "biomarker": "increased IL6 level",
                "assessed_biomarker_entity": {
                    "recommended_name": "Interleukin-6",
                    "synonyms": [
                        {
                            "synonym": "IL-6"
                        },
                        {
                            "synonym": "B-cell stimulatory factor 2"
                        },
                        {
                            "synonym": "BSF-2"
                        },
                        {
                            "synonym": "CTL differentiation factor"
                        },
                        {
                            "synonym": "CDF"
                        },
                        {
                            "synonym": "Hybridoma growth factor"
                        },
                        {
                            "synonym": "Interferon beta-2"
                        },
                        {
                            "synonym": "IFN-beta-2"
                        }
                    ]
                },
                "assessed_biomarker_entity_id": "UPKB:P05231",
                "assessed_entity_type": "protein",
                "specimen": [
                    {
                        "name": "blood",
                        "id": "UBERON:0000178",
                        "name_space": "Uberon",
                        "url": "http://purl.obolibrary.org/obo/UBERON_0000178",
                        "loinc_code": "26881-3"
                    },
                    {
                        "name": "blood serum",
                        "id": "UBERON:0001977",
                        "name_space": "Uberon",
                        "url": "http://purl.obolibrary.org/obo/UBERON_0001977",
                        "loinc_code": "26881-3"
                    }
                ],
                "evidence_source": [
                    {
                        "id": "10914713",
                        "database": "Pubmed",
                        "url": "https://pubmed.ncbi.nlm.nih.gov/10914713",
                        "evidence_list": [
                            {
                                "evidence": "Univariate analysis of all patients demonstrated that an extent of disease (EOD) on bone scanning > or = 1, IL-6 > or = 7 pg/ml, PS > or = 1, PSA > 100 ng/ml, and ALP > 620 IU/
liter were associated with a significantly lower survival rate than their respective counterparts. In multivariate analysis, however, the only two significant prognostic factors were EOD and IL-6. These results indicate t
hat the serum IL-6 level is a significant prognostic factor for prostate cancer as well as EOD."
                            }
                        ],
                        "tags": [
                            {
                                "tag": "biomarker"
                            },
                            {
                                "tag": "assessed_biomarker_entity"
                            },
                            {
                                "tag": "assessed_biomarker_entity_id"
                            },
                            {
                                "tag": "assessed_entity_type"
                            }
                        ]
                    },
                    {
                        "id": "32479790",
                        "database": "Pubmed",
                        "url": "https://pubmed.ncbi.nlm.nih.gov/32479790",
                        "evidence_list": [
                            {
                                "evidence": "IL-6 plays multifaceted roles in regulation of vascular leakage, complement activation, and coagulation pathways, which ultimately causes poor outcomes for acute respiratory di
stress syndrome, multiple organ dysfunction syndrome, and SARS."
                            }
                        ],
                        "tags": [
                            {
                                "tag": "biomarker"
                            },
                            {
                                "tag": "assessed_biomarker_entity"
                            },
                            {
                                "tag": "assessed_biomarker_entity_id"
                            },
                            {
                                "tag": "assessed_entity_type"
                            }
                        ]
                    },
                    {
                        "id": "32369209",
                        "database": "Pubmed",
                        "url": "https://pubmed.ncbi.nlm.nih.gov/32369209",
                        "evidence_list": [
                            {
                                "evidence": "Low lymphocytes, increased IL-6, CRP, PCT, D dimer, and LDH, these finds were similar to previous studies. The increase of these inflammatory indexes indicates that the infecte
d patients were in inflammatory state, which may be closely related to the inflammatory storm. The increase of the cancer biomarkers in patients with COVID-19, especially in severe and critical patients, suggests that inf
lammation is closely related to the development of COVID-19."
                            }
                        ],
                        "tags": [
                            {
                                "tag": "biomarker"
                            },
                            {
                                "tag": "assessed_biomarker_entity"
                            },
                            {
                                "tag": "assessed_biomarker_entity_id"
                            },
                            {
                                "tag": "assessed_entity_type"
                            }
                        ]
                    }
                ]
            }
        ],
        "best_biomarker_role": [
            {
                "role": "prognostic"
            }
        ],
        "condition": {
            "id": "DOID:10283",
            "recommended_name": {
                "id": "DOID:10283",
                "name": "prostate cancer",
                "description": "A male reproductive organ cancer that is located_in the prostate.",
                "resource": "Disease Ontology",
                "url": "http://purl.obolibrary.org/obo/DOID_10283"
            },
            "synonyms": [
                {
                    "id": "DOID:10283",
                    "name": "prostate neoplasm",
                    "resource": "Disease Ontology",
                    "url": "http://purl.obolibrary.org/obo/DOID_10283"
                },
                {
                    "id": "DOID:10283",
                    "name": "NGP - new growth of prostate",
                    "resource": "Disease Ontology",
                    "url": "http://purl.obolibrary.org/obo/DOID_10283"
                },
                {
                    "id": "DOID:10283",
                    "name": "tumor of the prostate",
                    "resource": "Disease Ontology",
                    "url": "http://purl.obolibrary.org/obo/DOID_10283"
                },
                {
                    "id": "DOID:10283",
                    "name": "prostate cancer, familial",
                    "resource": "Disease Ontology",
                    "url": "http://purl.obolibrary.org/obo/DOID_10283"
                },
                {
                    "id": "DOID:10283",
                    "name": "prostatic neoplasm",
                    "resource": "Disease Ontology",
                    "url": "http://purl.obolibrary.org/obo/DOID_10283"
                },
                {
                    "id": "DOID:10283",
                    "name": "malignant tumor of the prostate",
                    "resource": "Disease Ontology",
                    "url": "http://purl.obolibrary.org/obo/DOID_10283"
                },
                {
                    "id": "DOID:10283",
                    "name": "hereditary prostate cancer",
                    "resource": "Disease Ontology",
                    "url": "http://purl.obolibrary.org/obo/DOID_10283"
                },
                {
                    "id": "DOID:10283",
                    "name": "prostatic cancer",
                    "resource": "Disease Ontology",
                    "url": "http://purl.obolibrary.org/obo/DOID_10283"
                }
            ]
        },
        "evidence_source": [
            {
                "id": "10914713",
                "database": "Pubmed",
                "url": "https://pubmed.ncbi.nlm.nih.gov/10914713",
                "evidence_list": [
                    {
                        "evidence": "Univariate analysis of all patients demonstrated that an extent of disease (EOD) on bone scanning > or = 1, IL-6 > or = 7 pg/ml, PS > or = 1, PSA > 100 ng/ml, and ALP > 620 IU/liter we
re associated with a significantly lower survival rate than their respective counterparts. In multivariate analysis, however, the only two significant prognostic factors were EOD and IL-6. These results indicate that the 
serum IL-6 level is a significant prognostic factor for prostate cancer as well as EOD."
                    }
                ],
                "tags": [
                    {
                        "tag": "condition"
                    }
                ]
            },
            {
                "id": "32479790",
                "database": "Pubmed",
                "url": "https://pubmed.ncbi.nlm.nih.gov/32479790",
                "evidence_list": [
                    {
                        "evidence": "IL-6 plays multifaceted roles in regulation of vascular leakage, complement activation, and coagulation pathways, which ultimately causes poor outcomes for acute respiratory distress s
yndrome, multiple organ dysfunction syndrome, and SARS."
                    }
                ],
                "tags": [
                    {
                        "tag": "condition"
                    }
                ]
            },
            {
                "id": "32369209",
                "database": "Pubmed",
                "url": "https://pubmed.ncbi.nlm.nih.gov/32369209",
                "evidence_list": [
                    {
                        "evidence": "Low lymphocytes, increased IL-6, CRP, PCT, D dimer, and LDH, these finds were similar to previous studies. The increase of these inflammatory indexes indicates that the infected patien
ts were in inflammatory state, which may be closely related to the inflammatory storm. The increase of the cancer biomarkers in patients with COVID-19, especially in severe and critical patients, suggests that inflammatio
n is closely related to the development of COVID-19."
                    }
                ],
                "tags": [
                    {
                        "tag": "condition"
                    }
                ]
            }
        ],
        "citation": [
            {
                "title": "Serum interleukin 6 as a prognostic factor in patients with prostate cancer.",
                "journal": "Clinical cancer research : an official journal of the American Association for Cancer Research",
                "authors": "Nakashima J, Tachibana M, Horiguchi Y, Oya M, Ohigashi T, Asakura H, Murai M",
                "date": "2000-07-29",
                "evidence": [],
                "reference": [
                    {
                        "id": "10914713",
                        "type": "Pubmed",
                        "url": "https://pubmed.ncbi.nlm.nih.gov/10914713"
                    }
                ]
            },
            {
                "title": "Clinical characteristics and risk factors associated with COVID-19 disease severity in patients with cancer in Wuhan, China: a multicentre, retrospective, cohort study.",
                "journal": "The Lancet. Oncology",
                "authors": "Tian J, Yuan X, Xiao J, Zhong Q, Yang C, Liu B, Cai Y, Lu Z, Wang J, Wang Y, Liu S, Cheng B, Wang J, Zhang M, Wang L, Niu S, Yao Z, Deng X, Zhou F, Wei W, Li Q, Chen X, Chen W, Yang Q, Wu S, Fa
n J, Shu B, Hu Z, Wang S, Yang XP, Liu W, Miao X, Wang Z",
                "date": "2020-06-02",
                "evidence": [],
                "reference": [
                    {
                        "id": "32479790",
                        "type": "Pubmed",
                        "url": "https://pubmed.ncbi.nlm.nih.gov/32479790"
                    }
                ]
            },
            {
                "title": "Clinical characteristics and outcomes of cancer patients with COVID-19.",
                "journal": "Journal of medical virology",
                "authors": "Yang F, Shi S, Zhu J, Shi J, Dai K, Chen X",
                "date": "2020-05-06",
                "evidence": [],
                "reference": [
                    {
                        "id": "32369209",
                        "type": "Pubmed",
                        "url": "https://pubmed.ncbi.nlm.nih.gov/32369209"
                    }
                ]
            }
        ],
        "biomarker_canonical_id": "AA4686",
        "collision": 1
    }

rykahsay commented 3 weeks ago

@seankim658 --- looks like fields "exposure_agent" and "exposure_agent_id" are not being consumed in your data model

seankim658 commented 3 weeks ago

@rykahsay For the evidence source the idea is that for multi component biomarkers we allow for specific evidence that can be tied to a particular component field. So for example if there was a multi component biomakrer and in one particular component's assessed entity is tested through a different specimen then there should be evidence indicated for only that component and not the other component entries. The top level evidence_source key is for evidence relating to top level fields in the entry, so condition, exposure agent, biomarker role, etc.

For the exposure agent in our data model condition and exposure agent are mutually exclusive. So a biomarker can either be related to a specific condition or exposure agent and not both. Our JSON schema only allows for one or the other. Since the Glygen data doesn't have any exposure agent related biomarkers there's nothing captured from that field. @DaniallMasood can explain any of this further if you have more questions about this, I only understand from the data model perspective and not the science reasoning behind it.

seankim658 commented 3 weeks ago

@rykahsay I also noticed I left the collision key in the JSON file. That can be ignored. That is not part of our data model and I was using it for sanity checking the data conversion. Collision values of 1 mean that this biomarker is a duplicate/exact match of a biomarker we have in our database. I'll remove that field from the JSON and reupload it.

Edit: The JSON has been reuploaded.

rykahsay commented 3 weeks ago

@seankim658 ... from the tsv file ("evidence_source" field), how do you distinguish between PMIDs that are evidence at the component level vs evidence at the top level?

evidence_source Pubmed:10914713 Pubmed:32479790 Pubmed:32369209 Pubmed:10914713 Pubmed:32479790 Pubmed:32369209 Pubmed:32259560 Pubmed:32428990 Pubmed:32677844

seankim658 commented 3 weeks ago

@rykahsay That is done based on the tags for the specific evidence source.

If the PMID only relates to a top level field (such condition or best_biomarker_role) then it will only be in the top level evidence with the top level field tag. If the evidence is tagged as relating to only a specific biomarker component fields, then it is populated in that specific component's evidence source with the appropriate tags.

However, for this GlyGen data the tags are all the same in that they include mostly biomarker component evidence and the condition tag. In this case it will be populated in the component with the specific component tags and in the top level with the condition tag.

All the data evidence tags are the same right now in the GlyGen data but I know Raja wants this data QC'ed soon by Daniall and some students so when that happens the tags and evidence will be more specific. For now it will create some redundancy but in the future it allows us to be explicit about connecting evidence to specific parts of the biomarker data.

DaniallMasood commented 3 weeks ago

To answer about exposure agent and condition being mutually exclusive. As Sean said they will be one or the other. This is due to some biomarkers are associated with a condition directly (ie risk factor, diagnostic, prognostic) Exposure agents for biomarkers are treatments, therapies, or anything else a biomarker is exposed to while still being related to a condition. This means the the biomarker can predict or monitor the response of a treatment or exposure agent and how it is affecting an individual.

rykahsay commented 3 weeks ago

When I send @sujeetvkulkarni (the frontend) evidence objects, I am making the changes shown below. I am assuming information contained in "evidence_list" and "tags" will not be displayed in the frontend. I am doing this for consistency so that the frontend code does not need to change in displaying evidence icons.

On the other hand, I can leave it as is if @sujeetvkulkarni and @ReneRanzinger think the frontend can handle it without these changes.

From

"evidence_source": [
            {
                "id": "10914713",
                "database": "Pubmed",
                "url": "https://pubmed.ncbi.nlm.nih.gov/10914713",
                "evidence_list": [
                    {
                        "evidence": "Univariate analysis of all patients demonstrated that an extent of disease (EOD) on bone scanning > or = 1, IL-6 > or = 7 pg/ml, PS > or = 1, PSA > 100 ng/ml, and ALP > 620 IU/liter we
re associated with a significantly lower survival rate than their respective counterparts. In multivariate analysis, however, the only two significant prognostic factors were EOD and IL-6. These results indicate that the 
serum IL-6 level is a significant prognostic factor for prostate cancer as well as EOD."
                    }
                ],
                "tags": [
                    {
                        "tag": "condition"
                    }
                ]
            }
]

To

"evidence": [
            {
                "id": "10914713",
                "database": "Pubmed",
                "url": "https://pubmed.ncbi.nlm.nih.gov/10914713"
            }
]

ReneRanzinger commented 3 weeks ago

Remember, the aim is to make the two object models (GlyGen Biomarker, CFDE Biomarker) match. At least on the API level. Meaning both details API spit out the same JSON. If that means we have to change the current JSON, than this is fine. @sujeetvkulkarni will have to deal with this changes.

rykahsay commented 3 weeks ago

Then I will leave everything as is

sujeetvkulkarni commented 3 weeks ago

@ReneRanzinger @rykahsay ok, that's fine.

rykahsay commented 3 weeks ago

@seankim658 while comparing the JSON objects I am constructing from the tsv file to the the objects you created ( (given in github file allbiomarker-final.json), I discovered this issue: the combination assessed_entity_type=glycan and assessed_biomarker_entity_id=UPKB:P01009 does not make sense.

"biomarker_id": "AN4134-1", "assessed_biomarker_entity_id": "UPKB:P01009", "assessed_entity_type": "glycan",

rykahsay commented 3 weeks ago

@seankim658 ... the following rows in the tsv file are lacking PubMed/DOI evidences -- I don't why they should be considered in making the json objects

$ cat downloads/biomarkerdb/current/allbiomarkers.tsv | grep AN4108 |grep None
AN4108  AN4108-1    presence of COBAS lung cancer EGFR G719X, L858R, L861Q, S768I, T790M mutations panel    UPKB:P00533 gene    lung cancer DOID:1324           diagnostic  lung    UBERON:0002048  21665-None:None The Cobas EGFR Mutation Test v2 is a real-time PCR test for the qualitative detection of defined mutations of the epidermal growth factor receptor (EGFR) gene in non-small cell lung cancer (NSCLC) patients. Defined EGFR mutations are detected using DNA isolated from formalin-fixed paraffin-embedded tumor tissue (FFPET) or circulating-free tumor DNA (cfDNA) from plasma derived from EDTA anti-coagulated peripheral whole blood.The test is indicated as a companion diagnostic to aid in selecting NSCLC patients for treatment with the targeted therapies listed ... Drug FFPET PlasmaTARCEVA  (erlotinib) Exon 19 deletions and L858R Exon 19 deletions and L858R TAGRISSO (osimertinib) 790M T790M ... Table 2 below that are also detected by the cobas  EGFR Mutation Test v2: Table 2Drug FFPET PlasmaTARCEVA  (erlotinib) G719X, exon 20 insertions, T790M, S768I and L861Q G719X, exon 20 insertions, T790M, S768I and L861QTAGRISSO (osimertinib) G719X, exon 19 deletions, L858R, exon 20 insertions, S768I, and L861Q G719X, exon 19 deletions, L858R, exon 20 insertions, S768I, and L861Q.   biomarker;assessed_biomarker_entity;assessed_biomarker_entity_id;assessed_entity_type;condition
AN4108  AN4108-1    presence of COBAS lung cancer EGFR G719X, L858R, L861Q, S768I, T790M mutations panel    UPKB:P00533 gene    lung cancer DOID:1324           diagnostic  blood   UBERON:0000178  21665-None:None The Cobas EGFR Mutation Test v2 is a real-time PCR test for the qualitative detection of defined mutations of the epidermal growth factor receptor (EGFR) gene in non-small cell lung cancer (NSCLC) patients. Defined EGFR mutations are detected using DNA isolated from formalin-fixed paraffin-embedded tumor tissue (FFPET) or circulating-free tumor DNA (cfDNA) from plasma derived from EDTA anti-coagulated peripheral whole blood.The test is indicated as a companion diagnostic to aid in selecting NSCLC patients for treatment with the targeted therapies listed ... Drug FFPET PlasmaTARCEVA  (erlotinib) Exon 19 deletions and L858R Exon 19 deletions and L858R TAGRISSO (osimertinib) 790M T790M ... Table 2 below that are also detected by the cobas  EGFR Mutation Test v2: Table 2Drug FFPET PlasmaTARCEVA  (erlotinib) G719X, exon 20 insertions, T790M, S768I and L861Q G719X, exon 20 insertions, T790M, S768I and L861QTAGRISSO (osimertinib) G719X, exon 19 deletions, L858R, exon 20 insertions, S768I, and L861Q G719X, exon 19 deletions, L858R, exon 20 insertions, S768I, and L861Q.   biomarker;assessed_biomarker_entity;assessed_biomarker_entity_id;assessed_entity_type;condition
AN4108  AN4108-1    presence of THERASCREEN lung cancer EGFR mutation panel UPKB:P00533 gene    lung cancer DOID:1324           predictive  lung    UBERON:0002048  55764-5 None:None   THERASCREEN EGFR RGQ PCR KIT is a real-time pcr test for the qualitative detection of exon 19 deletions and exon 21 (L858R) substitution mutations of the epidermal growth factor receptor (EGFR) gene in DNA derived from formalin-fixed paraffin-embedded (FFPE) non-small cell lung cancer (NSCLC) tumor tissue. the test is intended to be used to select patients with NSCLC for whom gilotrjf (afatinib), an EGFR TYROSINE KINASE INHIBITOR (TKI), is indicated. Safety and efficacy of gilotrif (afatinib) have not been established in patients whose tumors have L861Q, G719X, 87681, exon 20 insertions, and T790M mutations, which are also detected by the THERASCREEN  EGFR RGQ PCR KIT. Specimens are processed using the QIAAMP  DSP DNA FFPE TISSUE KIT for manual sample preparation and the rotor-gene Q MDX instrument for automated amplification and detection.    biomarker;assessed_biomarker_entity;assessed_biomarker_entity_id;assessed_entity_type;condition

rykahsay commented 3 weeks ago

Why is "AA4761-2" missing best_biomarker_role=monitoring in your allbiomarker-final.json file? The tsv file, as shown below, contains rows for "AA4761-2" with best_biomarker_role=monitoring

$ cat downloads/biomarkerdb/current/allbiomarkers.tsv | grep AA4761-2  |grep monitoring 
AA4761  AA4761-2    Increased HGF level Hepatocyte growth factor    UPKB:P14210 protein COVID-19    DOID:0080600            monitoring  blood   UBERON:0000178  79394-3 Doi:10.1101/2020.05.31.20118315 Hepatocyte growth factor (HGF) classified severe from nonsevere COVID-19 patients with a sensitivity of 84.6% and a specificity of 97.9% under a cutoff value of 1128 pg/ml. The level of this cytokine did not increase in nonsevere patients but was significantly elevated in severe patients. Considering its potent antiinflammatory function, we suggest that HGF might be a new candidate therapy for critical COVID-19. biomarker;assessed_biomarker_entity;assessed_biomarker_entity_id;assessed_entity_type;condition
AA4761  AA4761-2    Increased HGF level Hepatocyte growth factor    UPKB:P14210 protein COVID-19    DOID:0080600            monitoring  blood serum UBERON:0001977  79394-3 Doi:10.1101/2020.05.31.2011831Hepatocyte growth factor (HGF) classified severe from nonsevere COVID-19 patients with a sensitivity of 84.6% and a specificity of 97.9% under a cutoff value of 1128 pg/ml. The level of this cytokine did not increase in nonsevere patients but was significantly elevated in severe patients. Considering its potent antiinflammatory function, we suggest that HGF might be a new candidate therapy for critical COVID-19.   biomarker;assessed_biomarker_entity;assessed_biomarker_entity_id;assessed_entity_type;condition

seankim658 commented 3 weeks ago

@rykahsay There is a decent amount of issues with the GlyGen biomarker data, that’s why Raja wants @DaniallMasood /students/interns QC’ing it over the coming months. I believe there are also rows in the GlyGen data that have UPKB’s despite having the assessed_entity_type as DNA. There are additional issues, especially with the panel/multicomponent biomarkers.

The missing evidence is another issue, those could be dropped if you decide. I just worked with the data to convert it into our data format and make it work from a data engineering perspective. The QC’ers and people with science knowledge will have to make the corrections most likely.

It looks like AA4761-2 got missed for some reason in converting the biomarker roles in the TSV file. There are four rows for AA4761-2, two with the role “prognostic” and two for the role “monitoring”. All four rows should have the best_biomarker_role value “prognostic;monitoring”.

rykahsay commented 3 weeks ago

@seankim658 ... in biomarker_id= AA4737-4 in the tsv file, why is assessed_biomarker_entity_id=UPKB:P02768 associated with

    "assessed_biomarker_entity": "Albumin",
    "biomarker": "decreased ALB level",

and

   "assessed_biomarker_entity": "Albumin to Globulin ratio",
    "biomarker": "decreased ALB-GLO ratio",

When you made your JSON object for AA4737-4, it looks like you chose the first one (shown in screenshot at the bottom).

In other words, in a given biomarker_id in the tsv file, there is one-2-many relationship between assessed_biomarker_entity_id and assessed_biomarker_entity/biomarker. If this relationship is real, then you should have multiple components with the same assessed_biomarker_entity_id to house the multiple assessed_biomarker_entity/biomarker values.

$ cat downloads/biomarkerdb/current/allbiomarkers.tsv | grep AA4737-4

{
    "biomarker_id": "AA4737-4",
    "assessed_biomarker_entity_id": "UPKB:P02768",
    "assessed_biomarker_entity": "Albumin",
    "biomarker": "decreased ALB level",
    "specimen": "blood",
    "specimen_id": "UBERON:0000178",
}
{
    "biomarker_id": "AA4737-4",
    "assessed_biomarker_entity_id": "UPKB:P02768",
    "assessed_biomarker_entity": "Albumin",
    "biomarker": "decreased ALB level",
    "specimen": "blood plasma",
    "specimen_id": "UBERON:0001969",
}
{
    "biomarker_id": "AA4737-4",
    "assessed_biomarker_entity_id": "UPKB:P02768",
    "assessed_biomarker_entity": "Albumin to Globulin ratio",
    "biomarker": "decreased ALB-GLO ratio",
    "specimen": "blood",
    "specimen_id": "UBERON:0000178",
}
{
    "biomarker_id": "AA4737-4",
    "assessed_biomarker_entity_id": "UPKB:P02768",
    "assessed_biomarker_entity": "Albumin to Globulin ratio",
    "biomarker": "decreased ALB-GLO ratio",
    "specimen": "blood plasma",
    "specimen_id": "UBERON:0001969",
}

rykahsay commented 3 weeks ago

@seankim658 ... there is no such thing "GlyGen biomarker data". The biomarker data in GlyGen came from the biomarker team. The GlyGen team shared the last biomarker input data we downloaded from your team so that your team can give us back an updated version of it.

The biomarker team has to look at the issues I am raising very carefully and address them urgently.

The biomarker functionalities are the most important aspects in 2.5 and we cannot afford to miss finishing all our tasks.

seankim658 commented 3 weeks ago

@rykahsay Talked the Raja, @kmartinez834 and @DaniallMasood will have to do the data QC today, especially on the panel biomarkers. This is out of my scope now.

seankim658 commented 3 weeks ago

@rykahsay Daniall and Karina are fixing the data, when they send it to me I'll re-assign IDs and then send it back to you.

seankim658 commented 3 weeks ago

@rykahsay Karina and Daniall sent the data back to me based on the issues you mentioned and I re-assigned IDs, the TSV and JSON are in the repo. There are no longer records with duplicate biomarker_ids and karina entered in the missing evidence sources.

rykahsay commented 3 weeks ago

@seankim658 ... for biomarker_id=AN4490-1, your value for "assessed_biomarker_entity_id" ("UPKB:Q8N726") is not UniProtKB canonical accession. From GlyGen perspective, we are mapping any non-canonical UniProtKB accession to corresponding canonical accession (in this case "P42771-1"). Unless you do the same, this will be one of the areas where my objects will different from yours.

rykahsay commented 3 weeks ago

@seankim658 --- in the "specimen" objects, please use "namespace" not "name_space"

DaniallMasood commented 3 weeks ago

for biomarker_id=AN4490-1 will it be a blocker to keep the assessed_biomarker_entity_id we have in place already? UPKB:Q8N726 is the reported entity ID on UniProt and was easy for us to map to. From our perspective we are mapping the UniProt ID to the matching UPKB and should not be an issue.

rykahsay commented 3 weeks ago

@DaniallMasood ... the GlyGen APIs and the partnership APIs need to respond with the same response for biomarker_id=AN4490-1.

rykahsay commented 3 weeks ago

@seankim658 ... I don't know where you are compiling disease/condition information from. I would imagine GlyGen has a more comprehensive integration of disease information. So, for us to synchronize on disease information, I suggest you use the following:

Download disease objects from our ftp site: https://data.glygen.org/ln2releases/ftp/v-2.4.1/

Give a specific DOID ("10283" for example), the function below can be used as:

disease_obj = get_disease_info("10283", "", "")

def get_disease_info(do_id, mondo_id, mim_id):
    disease_info = {}
    json_file = ""
    if do_id != "":
        json_file = "jsondb/diseasedb/doid.%s.json" % (do_id.replace("DOID:", ""))
    elif mondo_id != "":
        json_file = "jsondb/diseasedb/mondo.%s.json" % (mondo_id.replace("MONDO:", ""))
    elif mim_id != "":
        json_file = "jsondb/diseasedb/mim.%s.json" % (mim_id.replace("OMIM:", ""))
    if os.path.isfile(json_file):
        disease_info = json.loads(open(json_file, "r").read())

    return disease_info

As an example, consider "condition" for AN4208-1

in my JSON object

"condition": {
        "recommended_name": {
            "id": "DOID:2394",
            "resource": "Disease Ontology",
            "url": "http://disease-ontology.org/term/DOID:2394",
            "name": "ovarian cancer",
            "description": "A female reproductive organ cancer that is located_in the ovary."
        },
        "synonyms": [
            {
                "id": "DOID:2394",
                "resource": "Disease Ontology",
                "url": "http://disease-ontology.org/term/DOID:2394",
                "name": "malignant Ovarian tumor",
                "description": ""
            },
            {
                "id": "DOID:2394",
                "resource": "Disease Ontology",
                "url": "http://disease-ontology.org/term/DOID:2394",
                "name": "malignant tumour of ovary",
                "description": ""
            },
            {
                "id": "DOID:2394",
                "resource": "Disease Ontology",
                "url": "http://disease-ontology.org/term/DOID:2394",
                "name": "ovarian neoplasm",
                "description": ""
            },
            {
                "id": "DOID:2394",
                "resource": "Disease Ontology",
                "url": "http://disease-ontology.org/term/DOID:2394",
                "name": "ovary neoplasm",
                "description": ""
            },
            {
                "id": "DOID:2394",
                "resource": "Disease Ontology",
                "url": "http://disease-ontology.org/term/DOID:2394",
                "name": "primary ovarian cancer",
                "description": ""
            },
            {
                "id": "DOID:2394",
                "resource": "Disease Ontology",
                "url": "http://disease-ontology.org/term/DOID:2394",
                "name": "tumor of the Ovary",
                "description": ""
            },
            {
                "id": "MONDO:0011931",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0011931",
                "name": "ovarian cancer, susceptibility to, 1",
                "description": ""
            },
            {
                "id": "MONDO:0011931",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0011931",
                "name": "OVCAS1",
                "description": ""
            },
            {
                "id": "MONDO:0011931",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0011931",
                "name": "ovarian cancer, susceptibility to",
                "description": ""
            },
            {
                "id": "MONDO:0011931",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0011931",
                "name": "ovarian cancer, susceptibility to, 1",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "ovarian cancer",
                "description": "A primary or metastatic malignant neoplasm involving the ovary. Most primary malignant ovarian neoplasms are either carcinomas (serous, mucinous, or endometrioid adenocarcinomas) or malignant germ cell tumors. Metastatic malignant neoplasms to the ovary include carcinomas, lymphomas, and melanomas."
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "ovarian neoplasm",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "ovary neoplasm",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "tumor of the ovary",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "tumour of the ovary",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "cancer of ovary",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "cancer of the ovary",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "malignant neoplasm of ovary",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "malignant neoplasm of the ovary",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "malignant ovarian neoplasm",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "malignant ovarian tumor",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "malignant ovarian tumour",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "malignant ovary neoplasm",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "malignant tumor of ovary",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "malignant tumor of the ovary",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "malignant tumour of ovary",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "malignant tumour of the ovary",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "ovarian cancer",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "ovarian cancer, somatic",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "ovarian malignant tumor",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "ovarian malignant tumour",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "ovary cancer",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "ovarian cancer, epithelial",
                "description": ""
            },
            {
                "id": "MONDO:0008170",
                "resource": "MONDO",
                "url": "https://monarchinitiative.org/disease/MONDO:0008170",
                "name": "primary ovarian cancer",
                "description": ""
            },
            {
                "id": "MIM:167000",
                "resource": "MIM",
                "url": "https://www.omim.org/entry/167000",
                "name": "Ovarian cancer",
                "description": "Ovarian cancer"
            }
        ],
        "id": "DOID:2394"
    }

in your JSON object

"condition": {
            "id": "DOID:2394",
            "recommended_name": {
                "id": "DOID:2394",
                "name": "ovarian cancer",
                "description": "A female reproductive organ cancer that is located_in the ovary.",
                "resource": "Disease Ontology",
                "url": "http://purl.obolibrary.org/obo/DOID_2394"
            },
            "synonyms": [
                {
                    "id": "DOID:2394",
                    "name": "primary ovarian cancer",
                    "resource": "Disease Ontology",
                    "url": "http://purl.obolibrary.org/obo/DOID_2394"
                },
                {
                    "id": "DOID:2394",
                    "name": "tumor of the Ovary",
                    "resource": "Disease Ontology",
                    "url": "http://purl.obolibrary.org/obo/DOID_2394"
                },
                {
                    "id": "DOID:2394",
                    "name": "ovary neoplasm",
                    "resource": "Disease Ontology",
                    "url": "http://purl.obolibrary.org/obo/DOID_2394"
                },
                {
                    "id": "DOID:2394",
                    "name": "malignant tumour of ovary",
                    "resource": "Disease Ontology",
                    "url": "http://purl.obolibrary.org/obo/DOID_2394"
                },
                {
                    "id": "DOID:2394",
                    "name": "malignant Ovarian tumor",
                    "resource": "Disease Ontology",
                    "url": "http://purl.obolibrary.org/obo/DOID_2394"
                },
                {
                    "id": "DOID:2394",
                    "name": "ovarian neoplasm",
                    "resource": "Disease Ontology",
                    "url": "http://purl.obolibrary.org/obo/DOID_2394"
                }
            ]
        }

glygener / glygen-issues

Update biomarker file #1251

From

To

in my JSON object

in your JSON object