laminlabs / bionty-base

Access public biological ontologies.
https://bionty-base.lamin.ai
Apache License 2.0
15 stars 3 forks source link

make ontology work with any kind of parental relationships #540

Open jkobject opened 4 months ago

jkobject commented 4 months ago

makes it much slower if one chooses something like "part_of" for certain kinds of ontologies (90mns for "part_of" on all tissues)

netlify[bot] commented 4 months ago

Deploy Preview for bionty-base-091d ready!

Name Link
Latest commit fdd2c4ce445e1631b53940b9bb5f2d6a79aef8ad
Latest deploy log https://app.netlify.com/sites/bionty-base-091d/deploys/65d4bd1442ee3500098d418e
Deploy Preview https://deploy-preview-540--bionty-base-091d.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

sunnyosun commented 4 months ago

Thank you for this PR @jkobject ! Just that it won't have an effect yet for users who want customized parsing, since we pull the reference parquet files from the s3.

To enable user to define their own relationship types, we need to either enable a local mode for the reference files (the risk is that users won't be able to get new references automatically when it's updated); or we figure something out specifically for lamindb.

@Zethson @falexwolf how do you think we should go about this?

Related issue: https://github.com/laminlabs/bionty-base/issues/529

jkobject commented 4 months ago

I am not sure I understand what you mean but on my end doing: bt.Tissue.public().pronto().to_df(include_rel ="part_of") seems to work

sunnyosun commented 4 months ago

I am not sure I understand what you mean but on my end doing: bt.Tissue.public().pronto().to_df(include_rel ="part_of") seems to work

Yes, this works, but it won't work for canonical accessors such as .from_public() or .from_values() or public.df().

Does accessing the df with pronto().to_df(...) alone already meet what you need?

jkobject commented 4 months ago

I might be able to rewrite my code but right now it is true that I am using bt.Tissue.filter().df(include=["parents__ontology_id"]) and it wouldn't work for this..

Zethson commented 4 months ago

@jkobject thank you for the PR and sorry for chipping in so late. I think I accidentally deleted the Github notification email.

So we could of course upload monstrous .parquet files to our S3 that include all relationships, but this will probably greatly increase the download and parsing times. Maybe there'd be a way to provide partial reading support for a subset of columns of the parquet file then if we store things appropriately.

I'd probably suggest a solution that involves lamindb?