single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
https://tiledbsoma.readthedocs.io
MIT License
88 stars 25 forks source link

H5ad to Soma Ingestion on Databricks (POSIX Error) #2619

Closed danishzmalik closed 4 months ago

danishzmalik commented 4 months ago

Im trying to ingest multiple h5ad files to a single SOMA object. Using the following link: https://tiledbsoma.readthedocs.io/en/latest/notebooks/tutorial_soma_append_mode.html

Im doing this on databricks. Things are working fine when I set the SOMA experiment path to the cluster driver storage. But my goal is to create the experiment on an s3 bucket which is mounted on dbfs.

While doing so im getting the following error: image

Config:

cfg = tiledb.Config({"vfs.s3.no_sign_request": True, "vfs.file.posix_file_permissions":644})
vfs = tiledb.VFS(config=cfg)

Ingestion function:

#intial load
tiledbsoma.io.from_anndata(
experiment_uri = "/dbfs/mnt/s3_bucket_alias/soma_object/",
anndata=  adata,
measurement_name="RNA",

)

Require some guidance regarding this.

eddelbuettel commented 4 months ago

Hi @danishzmalik -- Thanks for filing an issue. As you may know, we build, test and run rather extensively on AWS S3, GCS and Azure FS. We test the S3 behavior via the Minio driver. I do not think we have any access to the databricks emulation of S3 so this may be difficult for us to replicate and debug. We will discuss and may come back asking you to run with specific debug flags.

eddelbuettel commented 4 months ago

@danishzmalik With the caveat that we do not have a databricks instance here to test, we think that as our code usually writes to s3 instances with URI string that start with s3://... I might be worth a try to rewrite your URIs as s3://dbfs/mnt/s3_bucket_alias/soma_object/ so that the dispatching to the relevant s3 code happens.

danishzmalik commented 4 months ago

HI @eddelbuettel , the issue was indeed with the url string. I tried giving the full s3 path i.e. s3:// , instead of the dbfs mount location, and it worked. Thank you

eddelbuettel commented 4 months ago

@danishzmalik -- thanks for reporting back, confirming and closing the circle. The issue will provide a useful documentation snippet until we get around to including platforms such as databricks in our regular tests.