laminlabs / lamindb

A data framework for biology.
https://docs.lamin.ai
Apache License 2.0
127 stars 10 forks source link

✨ Enable to run `save_vitessce_config()` on artifacts with custom storage paths #1950

Closed keller-mark closed 1 month ago

keller-mark commented 1 month ago

Transferred artifacts do not necessarily include a lamindb uid within the trailing filename of their S3 path.

For example:

s3://cellxgene-data-public/cell-census/2023-12-15/h5ads/abcd-12345-not-a-lamindb-uid.h5ad

However the current save_vitessce_config assumes that all artifacts include a lamindb uid in their S3 path, so the error on this line is thrown for transferred artifacts from cellxgene: https://github.com/laminlabs/lamindb/blob/ca4ce4a9f54cded358a01e73be8494e28dc8a796/lamindb/integrations/_vitessce.py#L72

The current assumption is that S3 paths are structured like

s3://lamin-us-east-1/TwPtYAyQwlyO/.lamindb/abc-some-lamindb-uid.h5ad
falexwolf commented 1 month ago

Right, we'll need to rethink save_vitessce_config() for artifacts with manually generated storage paths. It was out-of-scope for the initial version of this function and that's why the validation error is thrown.

Can you give a PR a try? It'd be very meaningful for the public use cases of the Vitessce integration, both cellxgene and the Vitessce example datasets aren't managed via lamindb and hence violated the f".lamindb/{UID}{SUFFIX}" pattern.

I suggest to pass a mapping between artifact uids & storage paths to save_vitessce_config() or within the VitessceConfig option. But the UX here should be mainly dictated by what you'd see as most convenient, @keller-mark. It comes down to identifying whatever is a storage url with an artifact (and doing this for potentially multiple artifacts).