scverse / spatialdata-io

BSD 3-Clause "New" or "Revised" License
43 stars 29 forks source link

Wrong file loaded when building new SpatialData #147

Open dawe opened 6 months ago

dawe commented 6 months ago

I am working with some data with many .h5ad files stored in the same directory. Whenever I create a new SpatialData object I specify the path to the correct AnnData counts file. I noticed that the file that is loaded is typically the wrong one and the first .h5ad in the directory is loaded instead. I believe this is because the first element of a list is considered here

https://github.com/scverse/spatialdata-io/blob/e3c53b5d6789bbaec49181f33dfe368170747ce4/src/spatialdata_io/readers/dbit.py#L282

I guess the same happens for the barcodes that are specified in the line below

https://github.com/scverse/spatialdata-io/blob/e3c53b5d6789bbaec49181f33dfe368170747ce4/src/spatialdata_io/readers/dbit.py#L285

The only workaround right now is to have multiple paths for different files.

LucaMarconato commented 6 months ago

From a quick look I think the reason is what you described. @lillux could you have a look at this please?

lillux commented 6 months ago

@dawe @LucaMarconato This has been solved in #139, that also describe how the parsing behavior has been changed. Now the reader gives priority the file of which the path has been specified, instead of prioritizing the pattern matching. I have just tested the above scenario with spatialdata-io version 0.1.3.dev117+g3c03009, at commit 3c03009, and an Exception is raised when multiple matching files are found. Please @dawe let me know if you still see the problem with the latest version of spatialdata-io.

Implementation details on path resolution behavior

This is in part to describe implementation details, in part to get opinions in improving it. Just to clarify why we take index [0] of _check_path() output in the 2 calls below:

https://github.com/scverse/spatialdata-io/blob/3c03009408da2eebfb6ddf12fafd8b93e87f0fda/src/spatialdata_io/readers/dbit.py#L282-L287

In the actual implementation, _check_path() takes some arguments and return a tuple: https://github.com/scverse/spatialdata-io/blob/3c03009408da2eebfb6ddf12fafd8b93e87f0fda/src/spatialdata_io/readers/dbit.py#L26-L32

Index [0] of the tuple can be a path or None.\ Index [1] is always a bool that indicates that the path has been resolved. This is needed for downstream tasks with optional arguments (optional_arg = True).

Index [0] is always a path for mandatory arguments (optional_arg = False), like the .h5ad and barcode_list, and an Error is raised by _check_path() if the path is not resolved, or if multiple match are found when optional_arg = False. We do not care about index [1] when optional_arg = False, so we do not save its value when checking for .h5ad and barcode_list.

Instead we care about index [1] when optional_arg = True, for example in the case of an optional image: https://github.com/scverse/spatialdata-io/blob/3c03009408da2eebfb6ddf12fafd8b93e87f0fda/src/spatialdata_io/readers/dbit.py#L288-L294

Here we unpack the tuple, and take in account both the values in the downstream tasks. When optional_arg = True and multiple match are found, a warning is printed and None is returned instead of a path, neglecting the optional argument while warning the user.