scverse / spatialdata-io

BSD 3-Clause "New" or "Revised" License
42 stars 27 forks source link

Issues with reading 10X VISIUM Cytassist data SpaceRanger Output #76

Closed thjimmylee closed 1 year ago

thjimmylee commented 1 year ago

Hi, This is a cool spatial tool, but I run into issue might be specific to the new Visium Cytassist SpaceRanger output For instance, if I directly read the spaceranger output using spatialdata_io.visium, I would get the error below:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[21], line 1
----> 1 spatialdata_io.visium('./spaceranger210_count_47058_WTSI_GRCh38-2020-A')

File ~/mambaforge/envs/spatialdata/lib/python3.9/site-packages/spatialdata_io/readers/visium.py:92, in visium(path, dataset_id, counts_file, fullres_image_file, tissue_positions_file, scalefactors_file, imread_kwargs, image_models_kwargs, **kwargs)
     90         library_id = first_file.replace(f"_{VisiumKeys.COUNTS_FILE}", "")
     91     else:
---> 92         raise ValueError(
     93             f"Cannot determine the library_id. Expecting a file with format <library_id>_{VisiumKeys.COUNTS_FILE}. Has "
     94             f"the files been renamed?"
     95         )
     96     counts_file = f"{library_id}_{VisiumKeys.COUNTS_FILE}"
     97 except IndexError as e:

ValueError: Cannot determine the library_id. Expecting a file with format <library_id>_filtered_feature_bc_matrix.h5. Has the files been renamed?

By reading the error message, I got that the tool was expecting to have a library_id for the matrix.h5 file, which is not essentially included in the spaceranger output, but I renamed it with some random string and it worked, but then I encounter another error message as shown below:

/Users/tl7/mambaforge/envs/spatialdata/lib/python3.9/site-packages/anndata/_core/anndata.py:1840: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
/Users/tl7/mambaforge/envs/spatialdata/lib/python3.9/site-packages/anndata/_core/anndata.py:1840: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[22], line 1
----> 1 spatialdata_io.visium('./spaceranger210_count_47058_WTSI_GRCh38-2020-A')

File ~/mambaforge/envs/spatialdata/lib/python3.9/site-packages/spatialdata_io/readers/visium.py:166, in visium(path, dataset_id, counts_file, fullres_image_file, tissue_positions_file, scalefactors_file, imread_kwargs, image_models_kwargs, **kwargs)
    161 transform_hires = Scale(
    162     np.array([scalefactors[VisiumKeys.SCALEFACTORS_HIRES], scalefactors[VisiumKeys.SCALEFACTORS_HIRES]]),
    163     axes=("y", "x"),
    164 )
    165 shapes = {}
--> 166 circles = ShapesModel.parse(
    167     coords,
    168     geometry=0,
    169     radius=scalefactors["spot_diameter_fullres"] / 2.0,
    170     index=adata.obs["spot_id"].copy(),
    171     transformations={
    172         "global": transform_original,
    173         "downscaled_hires": transform_hires,
    174         "downscaled_lowres": transform_lowres,
    175     },
    176 )
    177 shapes[dataset_id] = circles
    178 adata.obs["region"] = dataset_id

File ~/mambaforge/envs/spatialdata/lib/python3.9/functools.py:938, in singledispatchmethod.__get__.<locals>._method(*args, **kwargs)
    936 def _method(*args, **kwargs):
    937     method = self.dispatcher.dispatch(args[0].__class__)
--> 938     return method.__get__(obj, cls)(*args, **kwargs)

File ~/mambaforge/envs/spatialdata/lib/python3.9/site-packages/spatialdata/models/models.py:382, in ShapesModel._(cls, data, geometry, offsets, radius, index, transformations)
    370 @parse.register(np.ndarray)
    371 @classmethod
    372 def _(
   (...)
    379     transformations: MappingToCoordinateSystem_t | None = None,
    380 ) -> GeoDataFrame:
    381     geometry = GeometryType(geometry)
--> 382     data = from_ragged_array(geometry_type=geometry, coords=data, offsets=offsets)
    383     geo_df = GeoDataFrame({"geometry": data})
    384     if GeometryType(geometry).name == "POINT":

File ~/mambaforge/envs/spatialdata/lib/python3.9/site-packages/shapely/_ragged_array.py:440, in from_ragged_array(geometry_type, coords, offsets)
    438 if geometry_type == GeometryType.POINT:
    439     assert offsets is None or len(offsets) == 0
--> 440     return _point_from_flatcoords(coords)
    441 if geometry_type == GeometryType.LINESTRING:
    442     return _linestring_from_flatcoords(coords, *offsets)

File ~/mambaforge/envs/spatialdata/lib/python3.9/site-packages/shapely/_ragged_array.py:303, in _point_from_flatcoords(coords)
    302 def _point_from_flatcoords(coords):
--> 303     result = creation.points(coords)
    305     # Older versions of GEOS (<= 3.9) don't automatically convert NaNs
    306     # to empty points -> do manually
    307     empties = np.isnan(coords).all(axis=1)

File ~/mambaforge/envs/spatialdata/lib/python3.9/site-packages/shapely/decorators.py:77, in multithreading_enabled.<locals>.wrapped(*args, **kwargs)
     75     for arr in array_args:
     76         arr.flags.writeable = False
---> 77     return func(*args, **kwargs)
     78 finally:
     79     for arr, old_flag in zip(array_args, old_flags):

File ~/mambaforge/envs/spatialdata/lib/python3.9/site-packages/shapely/creation.py:74, in points(coords, y, z, indices, out, **kwargs)
     72 coords = _xyz_to_coords(coords, y, z)
     73 if indices is None:
---> 74     return lib.points(coords, out=out, **kwargs)
     75 else:
     76     return simple_geometries_1d(coords, indices, GeometryType.POINT, out=out)

TypeError: ufunc 'points' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Below is the file tree of the spaceranger output:

./spaceranger210_count_47058_WTSI_GRCh38-2020-A
├── _invocation
├── analysis
│   ├── clustering
│   │   ├── gene_expression_graphclust
│   │   │   └── clusters.csv
│   │   ├── gene_expression_kmeans_10_clusters
│   │   │   └── clusters.csv
│   │   ├── gene_expression_kmeans_2_clusters
│   │   │   └── clusters.csv
│   │   ├── gene_expression_kmeans_3_clusters
│   │   │   └── clusters.csv
│   │   ├── gene_expression_kmeans_4_clusters
│   │   │   └── clusters.csv
│   │   ├── gene_expression_kmeans_5_clusters
│   │   │   └── clusters.csv
│   │   ├── gene_expression_kmeans_6_clusters
│   │   │   └── clusters.csv
│   │   ├── gene_expression_kmeans_7_clusters
│   │   │   └── clusters.csv
│   │   ├── gene_expression_kmeans_8_clusters
│   │   │   └── clusters.csv
│   │   └── gene_expression_kmeans_9_clusters
│   │       └── clusters.csv
│   ├── diffexp
│   │   ├── gene_expression_graphclust
│   │   │   └── differential_expression.csv
│   │   ├── gene_expression_kmeans_10_clusters
│   │   │   └── differential_expression.csv
│   │   ├── gene_expression_kmeans_2_clusters
│   │   │   └── differential_expression.csv
│   │   ├── gene_expression_kmeans_3_clusters
│   │   │   └── differential_expression.csv
│   │   ├── gene_expression_kmeans_4_clusters
│   │   │   └── differential_expression.csv
│   │   ├── gene_expression_kmeans_5_clusters
│   │   │   └── differential_expression.csv
│   │   ├── gene_expression_kmeans_6_clusters
│   │   │   └── differential_expression.csv
│   │   ├── gene_expression_kmeans_7_clusters
│   │   │   └── differential_expression.csv
│   │   ├── gene_expression_kmeans_8_clusters
│   │   │   └── differential_expression.csv
│   │   └── gene_expression_kmeans_9_clusters
│   │       └── differential_expression.csv
│   ├── pca
│   │   └── gene_expression_10_components
│   │       ├── components.csv
│   │       ├── dispersion.csv
│   │       ├── features_selected.csv
│   │       ├── projection.csv
│   │       └── variance.csv
│   ├── tsne
│   │   └── gene_expression_2_components
│   │       └── projection.csv
│   └── umap
│       └── gene_expression_2_components
│           └── projection.csv
├── cloupe.cloupe
├── deconvolution
│   ├── deconvolution_k10
│   │   ├── deconvolution_topic_features_k10.csv
│   │   └── deconvolved_spots_k10.csv
│   ├── deconvolution_k11
│   │   ├── deconvolution_topic_features_k11.csv
│   │   └── deconvolved_spots_k11.csv
│   ├── deconvolution_k12
│   │   ├── deconvolution_topic_features_k12.csv
│   │   └── deconvolved_spots_k12.csv
│   ├── deconvolution_k13
│   │   ├── deconvolution_topic_features_k13.csv
│   │   └── deconvolved_spots_k13.csv
│   ├── deconvolution_k14
│   │   ├── deconvolution_topic_features_k14.csv
│   │   └── deconvolved_spots_k14.csv
│   ├── deconvolution_k15
│   │   ├── deconvolution_topic_features_k15.csv
│   │   └── deconvolved_spots_k15.csv
│   ├── deconvolution_k16
│   │   ├── deconvolution_topic_features_k16.csv
│   │   └── deconvolved_spots_k16.csv
│   ├── deconvolution_k17
│   │   ├── deconvolution_topic_features_k17.csv
│   │   └── deconvolved_spots_k17.csv
│   ├── deconvolution_k18
│   │   ├── deconvolution_topic_features_k18.csv
│   │   └── deconvolved_spots_k18.csv
│   ├── deconvolution_k19
│   │   ├── deconvolution_topic_features_k19.csv
│   │   └── deconvolved_spots_k19.csv
│   ├── deconvolution_k2
│   │   ├── deconvolution_topic_features_k2.csv
│   │   └── deconvolved_spots_k2.csv
│   ├── deconvolution_k3
│   │   ├── deconvolution_topic_features_k3.csv
│   │   └── deconvolved_spots_k3.csv
│   ├── deconvolution_k4
│   │   ├── deconvolution_topic_features_k4.csv
│   │   └── deconvolved_spots_k4.csv
│   ├── deconvolution_k5
│   │   ├── deconvolution_topic_features_k5.csv
│   │   └── deconvolved_spots_k5.csv
│   ├── deconvolution_k6
│   │   ├── deconvolution_topic_features_k6.csv
│   │   └── deconvolved_spots_k6.csv
│   ├── deconvolution_k7
│   │   ├── deconvolution_topic_features_k7.csv
│   │   └── deconvolved_spots_k7.csv
│   ├── deconvolution_k8
│   │   ├── deconvolution_topic_features_k8.csv
│   │   └── deconvolved_spots_k8.csv
│   ├── deconvolution_k9
│   │   ├── deconvolution_topic_features_k9.csv
│   │   └── deconvolved_spots_k9.csv
│   ├── dendrogram_k19.png
│   └── dendrogram_k19_distances.png
├── filtered_feature_bc_matrix
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── filtered_feature_bc_matrix.h5
├── metrics_summary.csv
├── molecule_info.h5
├── probe_set.csv
├── raw_feature_bc_matrix
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── raw_feature_bc_matrix.h5
├── raw_probe_bc_matrix.h5
├── spaceranger210_count_47058_GRCh38-2020-A.html
├── spatial
│   ├── aligned_fiducials.jpg
│   ├── aligned_tissue_image.jpg
│   ├── cytassist_image.tiff
│   ├── detected_tissue_image.jpg
│   ├── scalefactors_json.json
│   ├── spatial_enrichment.csv
│   ├── tissue_hires_image.png
│   ├── tissue_lowres_image.png
│   ├── tissue_positions.csv
│   └── tissue_positions_list.csv

I am currently using the latest version of Space Ranger 2.0.1 (January 18, 2023).

LucaMarconato commented 1 year ago

Hi Jimmy, thanks for reporting. Can you try using the latest main version? @giovp worked on a related problem on https://github.com/scverse/spatialdata-io/pull/51 and therefore it could be fixed now.

Otherwise, @giovp could you please have a look? Maybe we could test the various SpaceRanger versions with scripts in the spatialdata-sandbox that I run nightly, wdyt?

thjimmylee commented 1 year ago

Hi @LucaMarconato , Thanks for your reply. Yes I am using the latest version 0.0.7 that has this error and this is how I read the spacerange output:

import spatialdata_io
sp_data=spatialdata_io.visium('./spaceranger210_count_47058_WTSI_GRCh38-2020-A')
ilia-kats commented 1 year ago

Having the same issue here, and I think it's definitely related to #51. @giovp, which data sets did you test it on? I know that files downloaded from the 10x website do have a library_id prepended, but this is never the case for actual spaceranger output, which is why I had removed it in #44.

grst commented 1 year ago

Same issue here... the IO function should ideally support both h5 files with and without library_id prefix.

benedekp commented 1 year ago

I had the same issue now with the naming of the files, it would be very useful to have the option to load without the prefix. I also encountered the second error message: TypeError: ufunc 'points' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' For me it was related to the two types of SpaceRanger outputs : "tissue_positions.csv" and "tissue_positions_list.csv". When using squidpy I have renamed this file to use the sc.read_visium() command and then correcting for the format, that's why it caused now the problem. I see that @thjimmylee also had both files under the spatial folder and probably caused the same mismatched naming and format.

LucaMarconato commented 1 year ago

The PR https://github.com/scverse/spatialdata-io/pull/91 should fix the problem.

I haven't made a full test like @giovp did in https://github.com/scverse/spatialdata-io/pull/51 of the various SpaceRanger versions, but I am testing against three datasets (see details in the PR), including one that doesn't contain the dataset_id in the file name. Also now I am testing these three datasets in a nightly job, so this should prevent coming back to the same bug in the future.

Please @grst @benedekp @ilia-kats @thjimmylee, if you have the change let me know if this fixes your problem. If not I am happy to be more systematic and include more datasets in the nightly job.