Load Biosample dataset and expose through API

d0choa commented 1 month ago

We want to load the biosampleIndex that @Tobi1kenobi has created in #3445. Actions:

[x] Create a version of the dataset in JSONL (pyspark write.json)
[x] Provide the schema of the dataset in this ticket
[x] Load the index in OpenSearch using the provided schema
[x] Use the index to resolve the biosampleFromSourceId in current StudyIndex #3357

Note: @jdhayhurst no need to expose the index at the API root-level for now. We don't have a good use case for that.

DSuveges commented 1 month ago

@Tobi1kenobi , could you please save a biosample dataset in json and share it with @jdhayhurst? Also pasting a .printSchema() under this ticket would help to know what fields are going to be picked up. The backend than can start working on the ingestion.

Tobi1kenobi commented 1 month ago

Sure thing, see here for the schema @jdhayhurst:

root
 |-- biosampleId: string (nullable = false)
 |-- biosampleName: string (nullable = false)
 |-- description: string (nullable = true)
 |-- xrefs: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- synonyms: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- parents: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- ancestors: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- children: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- descendants: array (nullable = true)
 |    |-- element: string (containsNull = true)

Tobi1kenobi commented 1 month ago

Subset of the dataset in JSONL:

{"biosampleId":"CL_0000653","biosampleName":"podocyte","description":"A specialized kidney epithelial cell, contained within a glomerulus, that contains \"feet\" that interdigitate with the \"feet\" of other podocytes.","xrefs":["BTO:0002295","FMA:70967","ZFA:0009285"],"synonyms":["epithelial cell of visceral layer of glomerular capsule","glomerular podocyte","glomerular visceral epithelial cell","kidney podocyte","renal podocyte"],"parents":["CL_0002522","CL_1000450"],"ancestors":["CL_1000450","CL_0002522"],"children":["CL_4030008","CL_0002525","CL_0002523"],"descendants":["CL_0002523","CL_0002525","CL_4030008"]}
{"biosampleId":"CL_0000654","biosampleName":"primary oocyte","description":"A primary oocyte is an oocyte that has not completed female meosis I.","xrefs":["BTO:0000512","FMA:18645"],"synonyms":["primary oogonium"]}
{"biosampleId":"CL_0000655","biosampleName":"secondary oocyte","description":"A secondary oocyte is an oocyte that has not completed meiosis II.","xrefs":["BTO:0003094","FMA:18646"],"synonyms":["primary oogonium"],"parents":["CL_0000023"],"ancestors":["CL_0000023"]}
{"biosampleId":"CL_0000656","biosampleName":"primary spermatocyte","description":"A diploid cell that has derived from a spermatogonium and can subsequently begin meiosis and divide into two haploid secondary spermatocytes.","xrefs":["BTO:0001115","CALOHA:TS-2194","FMA:72292"]}
{"biosampleId":"CL_0000657","biosampleName":"secondary spermatocyte","description":"One of the two haploid cells into which a primary spermatocyte divides, and which in turn gives origin to spermatids.","xrefs":["BTO:0000709","CALOHA:TS-2195","FBbt:00004941","FMA:72293"]}
{"biosampleId":"CL_0000658","biosampleName":"cuticle secreting cell","description":"An epithelial cell that secretes cuticle."}
{"biosampleId":"CL_0000659","biosampleName":"eggshell secreting cell","description":"An extracellular matrix secreting cell that secretes eggshell."}
{"biosampleId":"CL_1000451","biosampleName":"obsolete epithelial cell of visceral layer of glomerular capsule"}

d0choa commented 1 month ago

👌 Please pass the whole dataset to @jdhayhurst if you haven't done it already

jdhayhurst commented 1 month ago

@Tobi1kenobi There are a number of strange looking biosamples in the data e.g. {"biosampleId":"http://www.ncbi.nlm.nih.gov/gene/416018","biosampleName":"http://www.ncbi.nlm.nih.gov/gene/416018","xrefs":[],"synonyms":[],"ancestors":[],"descendants":[]}

Tobi1kenobi commented 1 month ago

If they're an issue I can drop these easily enough I think.

I chose to keep the logic for inclusion simple as the primary/only use of this index at present is a left join to existing biosample data (i.e. the strange entries will never be used).

d0choa commented 1 month ago

Out of curiosity, where do these come from? Indeed, they are strange-looking

Tobi1kenobi commented 1 month ago

So it's a combination of things. Firstly there are ontology descriptors in the JSON as nodes e.g.

{
      "id" : "http://purl.obolibrary.org/obo/uberon/core#defined_by_ordinal_series",
      "type" : "PROPERTY",
      "propertyType" : "ANNOTATION",
      "meta" : {
        "comments" : [ "classes that are defined by relative position counting from first in a series of elements along an axis in an individual organism rather than by strict homology" ]
      }
    }

But there are also some other things like a few (~10) genes:

{
      "id" : "http://www.ncbi.nlm.nih.gov/gene/396351",
      "type" : "CLASS"
    },

Other than the genes most of these instances could be caught by filtering for type == 'CLASS".

The id == name is a result of a discussion with @DSuveges to have neither be nullable so I do f.coalese(id, name).

I was afraid to break potential relationships before but I don't think it's a big deal to drop these.

d0choa commented 1 month ago

I think we will have to be pragmatic here and try not to build code that it's ad hoc or might not scale for future cases.

@jdhayhurst is there any real issue with these or it was just flagging potential problems in the data?
@Tobi1kenobi, if you visualise the ontology (e.g. using the app Protégé), is there any obvious branch we should eliminate?

Unless there is any obvious thing we need to fix, I would continue and see what surprises the future brings

Tobi1kenobi commented 1 month ago

Perhaps everything not under anatomical entity? But it's not clear-cut:

If nothing is breaking, I would lean towards inclusiveness or at least minimal exclusiveness (such as dropping those with type != "CLASS"). As you say, ad-hoc solutions won't make sense for future ontologies and I think any attempt to carefully curate the ontologies would be more effort than it's worth, for our purposes.

jdhayhurst commented 1 month ago

As far as I can see it's not breaking anything in the backend

opentargets / issues

Load Biosample dataset and expose through API #3538