opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Load Biosample dataset and expose through API #3538

Open d0choa opened 1 month ago

d0choa commented 1 month ago

We want to load the biosampleIndex that @Tobi1kenobi has created in #3445. Actions:

Note: @jdhayhurst no need to expose the index at the API root-level for now. We don't have a good use case for that.

DSuveges commented 1 month ago

@Tobi1kenobi , could you please save a biosample dataset in json and share it with @jdhayhurst? Also pasting a .printSchema() under this ticket would help to know what fields are going to be picked up. The backend than can start working on the ingestion.

Tobi1kenobi commented 1 month ago

Sure thing, see here for the schema @jdhayhurst:

root
 |-- biosampleId: string (nullable = false)
 |-- biosampleName: string (nullable = false)
 |-- description: string (nullable = true)
 |-- xrefs: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- synonyms: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- parents: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- ancestors: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- children: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- descendants: array (nullable = true)
 |    |-- element: string (containsNull = true)
Tobi1kenobi commented 1 month ago

Subset of the dataset in JSONL:

{"biosampleId":"CL_0000653","biosampleName":"podocyte","description":"A specialized kidney epithelial cell, contained within a glomerulus, that contains \"feet\" that interdigitate with the \"feet\" of other podocytes.","xrefs":["BTO:0002295","FMA:70967","ZFA:0009285"],"synonyms":["epithelial cell of visceral layer of glomerular capsule","glomerular podocyte","glomerular visceral epithelial cell","kidney podocyte","renal podocyte"],"parents":["CL_0002522","CL_1000450"],"ancestors":["CL_1000450","CL_0002522"],"children":["CL_4030008","CL_0002525","CL_0002523"],"descendants":["CL_0002523","CL_0002525","CL_4030008"]}
{"biosampleId":"CL_0000654","biosampleName":"primary oocyte","description":"A primary oocyte is an oocyte that has not completed female meosis I.","xrefs":["BTO:0000512","FMA:18645"],"synonyms":["primary oogonium"]}
{"biosampleId":"CL_0000655","biosampleName":"secondary oocyte","description":"A secondary oocyte is an oocyte that has not completed meiosis II.","xrefs":["BTO:0003094","FMA:18646"],"synonyms":["primary oogonium"],"parents":["CL_0000023"],"ancestors":["CL_0000023"]}
{"biosampleId":"CL_0000656","biosampleName":"primary spermatocyte","description":"A diploid cell that has derived from a spermatogonium and can subsequently begin meiosis and divide into two haploid secondary spermatocytes.","xrefs":["BTO:0001115","CALOHA:TS-2194","FMA:72292"]}
{"biosampleId":"CL_0000657","biosampleName":"secondary spermatocyte","description":"One of the two haploid cells into which a primary spermatocyte divides, and which in turn gives origin to spermatids.","xrefs":["BTO:0000709","CALOHA:TS-2195","FBbt:00004941","FMA:72293"]}
{"biosampleId":"CL_0000658","biosampleName":"cuticle secreting cell","description":"An epithelial cell that secretes cuticle."}
{"biosampleId":"CL_0000659","biosampleName":"eggshell secreting cell","description":"An extracellular matrix secreting cell that secretes eggshell."}
{"biosampleId":"CL_1000451","biosampleName":"obsolete epithelial cell of visceral layer of glomerular capsule"}
d0choa commented 1 month ago

👌 Please pass the whole dataset to @jdhayhurst if you haven't done it already

jdhayhurst commented 1 month ago

@Tobi1kenobi There are a number of strange looking biosamples in the data e.g. {"biosampleId":"http://www.ncbi.nlm.nih.gov/gene/416018","biosampleName":"http://www.ncbi.nlm.nih.gov/gene/416018","xrefs":[],"synonyms":[],"ancestors":[],"descendants":[]}

Tobi1kenobi commented 1 month ago

If they're an issue I can drop these easily enough I think.

I chose to keep the logic for inclusion simple as the primary/only use of this index at present is a left join to existing biosample data (i.e. the strange entries will never be used).

d0choa commented 1 month ago

Out of curiosity, where do these come from? Indeed, they are strange-looking

Tobi1kenobi commented 1 month ago

So it's a combination of things. Firstly there are ontology descriptors in the JSON as nodes e.g.

{
      "id" : "http://purl.obolibrary.org/obo/uberon/core#defined_by_ordinal_series",
      "type" : "PROPERTY",
      "propertyType" : "ANNOTATION",
      "meta" : {
        "comments" : [ "classes that are defined by relative position counting from first in a series of elements along an axis in an individual organism rather than by strict homology" ]
      }
    }

But there are also some other things like a few (~10) genes:

{
      "id" : "http://www.ncbi.nlm.nih.gov/gene/396351",
      "type" : "CLASS"
    },

Other than the genes most of these instances could be caught by filtering for type == 'CLASS".

The id == name is a result of a discussion with @DSuveges to have neither be nullable so I do f.coalese(id, name).

I was afraid to break potential relationships before but I don't think it's a big deal to drop these.

d0choa commented 1 month ago

I think we will have to be pragmatic here and try not to build code that it's ad hoc or might not scale for future cases.

Unless there is any obvious thing we need to fix, I would continue and see what surprises the future brings

Tobi1kenobi commented 1 month ago

Perhaps everything not under anatomical entity? But it's not clear-cut:

image

If nothing is breaking, I would lean towards inclusiveness or at least minimal exclusiveness (such as dropping those with type != "CLASS"). As you say, ad-hoc solutions won't make sense for future ontologies and I think any attempt to carefully curate the ontologies would be more effort than it's worth, for our purposes.

jdhayhurst commented 1 month ago

As far as I can see it's not breaking anything in the backend