Open d0choa opened 1 month ago
@Tobi1kenobi , could you please save a biosample dataset in json and share it with @jdhayhurst? Also pasting a .printSchema()
under this ticket would help to know what fields are going to be picked up. The backend than can start working on the ingestion.
Sure thing, see here for the schema @jdhayhurst:
root
|-- biosampleId: string (nullable = false)
|-- biosampleName: string (nullable = false)
|-- description: string (nullable = true)
|-- xrefs: array (nullable = true)
| |-- element: string (containsNull = true)
|-- synonyms: array (nullable = true)
| |-- element: string (containsNull = true)
|-- parents: array (nullable = true)
| |-- element: string (containsNull = false)
|-- ancestors: array (nullable = true)
| |-- element: string (containsNull = true)
|-- children: array (nullable = true)
| |-- element: string (containsNull = false)
|-- descendants: array (nullable = true)
| |-- element: string (containsNull = true)
Subset of the dataset in JSONL:
{"biosampleId":"CL_0000653","biosampleName":"podocyte","description":"A specialized kidney epithelial cell, contained within a glomerulus, that contains \"feet\" that interdigitate with the \"feet\" of other podocytes.","xrefs":["BTO:0002295","FMA:70967","ZFA:0009285"],"synonyms":["epithelial cell of visceral layer of glomerular capsule","glomerular podocyte","glomerular visceral epithelial cell","kidney podocyte","renal podocyte"],"parents":["CL_0002522","CL_1000450"],"ancestors":["CL_1000450","CL_0002522"],"children":["CL_4030008","CL_0002525","CL_0002523"],"descendants":["CL_0002523","CL_0002525","CL_4030008"]}
{"biosampleId":"CL_0000654","biosampleName":"primary oocyte","description":"A primary oocyte is an oocyte that has not completed female meosis I.","xrefs":["BTO:0000512","FMA:18645"],"synonyms":["primary oogonium"]}
{"biosampleId":"CL_0000655","biosampleName":"secondary oocyte","description":"A secondary oocyte is an oocyte that has not completed meiosis II.","xrefs":["BTO:0003094","FMA:18646"],"synonyms":["primary oogonium"],"parents":["CL_0000023"],"ancestors":["CL_0000023"]}
{"biosampleId":"CL_0000656","biosampleName":"primary spermatocyte","description":"A diploid cell that has derived from a spermatogonium and can subsequently begin meiosis and divide into two haploid secondary spermatocytes.","xrefs":["BTO:0001115","CALOHA:TS-2194","FMA:72292"]}
{"biosampleId":"CL_0000657","biosampleName":"secondary spermatocyte","description":"One of the two haploid cells into which a primary spermatocyte divides, and which in turn gives origin to spermatids.","xrefs":["BTO:0000709","CALOHA:TS-2195","FBbt:00004941","FMA:72293"]}
{"biosampleId":"CL_0000658","biosampleName":"cuticle secreting cell","description":"An epithelial cell that secretes cuticle."}
{"biosampleId":"CL_0000659","biosampleName":"eggshell secreting cell","description":"An extracellular matrix secreting cell that secretes eggshell."}
{"biosampleId":"CL_1000451","biosampleName":"obsolete epithelial cell of visceral layer of glomerular capsule"}
👌 Please pass the whole dataset to @jdhayhurst if you haven't done it already
@Tobi1kenobi There are a number of strange looking biosamples in the data e.g.
{"biosampleId":"http://www.ncbi.nlm.nih.gov/gene/416018","biosampleName":"http://www.ncbi.nlm.nih.gov/gene/416018","xrefs":[],"synonyms":[],"ancestors":[],"descendants":[]}
If they're an issue I can drop these easily enough I think.
I chose to keep the logic for inclusion simple as the primary/only use of this index at present is a left join to existing biosample data (i.e. the strange entries will never be used).
Out of curiosity, where do these come from? Indeed, they are strange-looking
So it's a combination of things. Firstly there are ontology descriptors in the JSON as nodes e.g.
{
"id" : "http://purl.obolibrary.org/obo/uberon/core#defined_by_ordinal_series",
"type" : "PROPERTY",
"propertyType" : "ANNOTATION",
"meta" : {
"comments" : [ "classes that are defined by relative position counting from first in a series of elements along an axis in an individual organism rather than by strict homology" ]
}
}
But there are also some other things like a few (~10) genes:
{
"id" : "http://www.ncbi.nlm.nih.gov/gene/396351",
"type" : "CLASS"
},
Other than the genes most of these instances could be caught by filtering for type == 'CLASS"
.
The id == name is a result of a discussion with @DSuveges to have neither be nullable so I do f.coalese(id, name).
I was afraid to break potential relationships before but I don't think it's a big deal to drop these.
I think we will have to be pragmatic here and try not to build code that it's ad hoc or might not scale for future cases.
Unless there is any obvious thing we need to fix, I would continue and see what surprises the future brings
Perhaps everything not under anatomical entity? But it's not clear-cut:
If nothing is breaking, I would lean towards inclusiveness or at least minimal exclusiveness (such as dropping those with type != "CLASS"). As you say, ad-hoc solutions won't make sense for future ontologies and I think any attempt to carefully curate the ontologies would be more effort than it's worth, for our purposes.
As far as I can see it's not breaking anything in the backend
We want to load the
biosampleIndex
that @Tobi1kenobi has created in #3445. Actions:write.json
)biosampleFromSourceId
in current StudyIndex #3357Note: @jdhayhurst no need to expose the index at the API root-level for now. We don't have a good use case for that.