single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
MIT License
81 stars 21 forks source link

Error while Ingesting Multiple H5ad to One SOMA Object : ArrowInvalid: Failed to parse string: '<category column>' as a scalar of type double[Bug] #2696

Closed danishzmalik closed 1 month ago

danishzmalik commented 1 month ago

Describe the bug I'm trying to ingest multiple h5ad files to One Soma object in s3. Steps:

Both files have the same obs and var schemas with the same data type. If I ingest the problematic anndata individually it works fine, only gives an error when I try to append it.

To Reproduce

def readfile(filepath):
    with vfs.open(filepath) as h5ad:
        adata = ad.read_h5ad(h5ad)

    adata.obs = pd.DataFrame(adata.obs[[<required_columns"]])
     adata.var = pd.DataFrame(adata.var[[<required_columns>]])
    adata.obsp = None
    adata.varp = None

    return adata
def initial_ingestion(adata):
    print("Initial Ingestion")
    tiledbsoma.io.from_anndata(
    experiment_uri = s3_target,
    anndata= adata1,
    uns_keys= [],
    measurement_name="RNA",

)
def register_adata(adata2):
    print("Registering Anndata")
    rd = tiledbsoma.io.register_anndatas(
    experiment_uri = s3_target,
    adatas = [adata2],
    measurement_name="RNA",
    obs_field_name="obs_id",
    var_field_name="var_id",
    append_obsm_varm=True

)
    return rd

def ingest_to_soma(rd, adata):
    print("Ingesting to SOMA")
    tiledbsoma.io.from_anndata(
    experiment_uri = s3_target,
    anndata=  adata,
    measurement_name="RNA",
    uns_keys=[],
    ingest_mode="write",
    registration_mapping=rd,
    )

The Error Suggests that the script is trying to convert some of the category columns to double. Why could this be happening?

Versions (please complete the following information):

Additional context Error Trace is as Follows:.

> ArrowInvalid: Failed to parse string: '<Category Column>' as a scalar of type double
> ---------------------------------------------------------------------------
> ArrowInvalid                              Traceback (most recent call last)
> File <command-2655924637900263>, line 1
> ----> 1 tiledbsoma.io.from_anndata(
>       2     experiment_uri = "s3://rit-rdm-dev-zone0-n1q5n51i2wyc/raw/rdm/soma_object2",
>       3     anndata= adata_test2,
>       4     registration_mapping=rd,
>       5     uns_keys= [],
>       6     measurement_name="RNA",
>       7     
>       8 )
> 
> File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c3e6456f-5da4-4137-873c-abf3762cb9a6/lib/python3.10/site-packages/tiledbsoma/io/ingest.py:495, in from_anndata(experiment_uri, anndata, measurement_name, context, platform_config, obs_id_name, var_id_name, X_layer_name, raw_X_layer_name, ingest_mode, use_relative_uri, X_kind, registration_mapping, uns_keys, additional_metadata)
>     492 # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>     493 # OBS
>     494 df_uri = _util.uri_joinpath(experiment_uri, "obs")
> --> 495 with _write_dataframe(
>     496     df_uri,
>     497     conversions.decategoricalize_obs_or_var(anndata.obs),
>     498     id_column_name=obs_id_name,
>     499     platform_config=platform_config,
>     500     axis_mapping=jidmaps.obs_axis,
>     501     **ingest_ctx,
>     502 ) as obs:
>     503     _maybe_set(experiment, "obs", obs, use_relative_uri=use_relative_uri)
>     505 # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>     506 # MS
>     507 
>    (...)
>     511 # be of the form tiledb://namespace/uuid. Only for the former is it suitable
>     512 # to append "/ms" so that is what we do here.
> 
> File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c3e6456f-5da4-4137-873c-abf3762cb9a6/lib/python3.10/site-packages/tiledbsoma/io/ingest.py:1168, in _write_dataframe(df_uri, df, id_column_name, ingestion_params, additional_metadata, platform_config, context, axis_mapping)
>    1164 df[SOMA_JOINID] = np.asarray(axis_mapping.data, dtype=np.int64)
>    1166 df.set_index(SOMA_JOINID, inplace=True)
> -> 1168 return _write_dataframe_impl(
>    1169     df,
>    1170     df_uri,
>    1171     id_column_name,
>    1172     ingestion_params=ingestion_params,
>    1173     additional_metadata=additional_metadata,
>    1174     original_index_name=original_index_name,
>    1175     platform_config=platform_config,
>    1176     context=context,
>    1177 )
> 
> File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c3e6456f-5da4-4137-873c-abf3762cb9a6/lib/python3.10/site-packages/tiledbsoma/io/ingest.py:1242, in _write_dataframe_impl(df, df_uri, id_column_name, ingestion_params, additional_metadata, original_index_name, platform_config, context)
>    1239 tiledb_create_options = TileDBCreateOptions.from_platform_config(platform_config)
>    1241 if arrow_table:
> -> 1242     _write_arrow_table(arrow_table, soma_df, tiledb_create_options)
>    1244 # Save the original index name for outgest. We use JSON for elegant indication of index name
>    1245 # being None (in Python anyway).
>    1246 soma_df.metadata[_DATAFRAME_ORIGINAL_INDEX_NAME_JSON] = json.dumps(
>    1247     original_index_name
>    1248 )
> 
> File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c3e6456f-5da4-4137-873c-abf3762cb9a6/lib/python3.10/site-packages/tiledbsoma/io/ingest.py:1128, in _write_arrow_table(arrow_table, handle, tiledb_create_options)
>    1123 else:
>    1124     logging.log_io(
>    1125         None,
>    1126         f"Write Arrow table num_rows={len(arrow_table)} num_bytes={arrow_table.nbytes} cap={cap}",
>    1127     )
> -> 1128     handle.write(arrow_table)
> 
> File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c3e6456f-5da4-4137-873c-abf3762cb9a6/lib/python3.10/site-packages/tiledbsoma/_dataframe.py:448, in DataFrame.write(self, values, platform_config)
>     444 _util.check_type("values", values, (pa.Table,))
>     446 clib_dataframe = self._handle._handle
> --> 448 values = _util.cast_values_to_target_schema(clib_dataframe, values, self.schema)
>     450 for batch in values.to_batches():
>     451     clib_dataframe.write(batch)
> 
> File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c3e6456f-5da4-4137-873c-abf3762cb9a6/lib/python3.10/site-packages/tiledbsoma/_util.py:384, in cast_values_to_target_schema(clib_array, values, schema)
>     380         target_schema.append(target_field)
>     382 new_schema = pa.schema(target_schema, values.schema.metadata)
> --> 384 return values.cast(new_schema)
> 
> File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c3e6456f-5da4-4137-873c-abf3762cb9a6/lib/python3.10/site-packages/pyarrow/table.pxi:4457, in pyarrow.lib.Table.cast()
> 
> File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c3e6456f-5da4-4137-873c-abf3762cb9a6/lib/python3.10/site-packages/pyarrow/table.pxi:574, in pyarrow.lib.ChunkedArray.cast()
> 
> File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c3e6456f-5da4-4137-873c-abf3762cb9a6/lib/python3.10/site-packages/pyarrow/compute.py:404, in cast(arr, target_type, safe, options, memory_pool)
>     402     else:
>     403         options = CastOptions.safe(target_type)
> --> 404 return call_function("cast", [arr], options, memory_pool)
> 
> File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c3e6456f-5da4-4137-873c-abf3762cb9a6/lib/python3.10/site-packages/pyarrow/_compute.pyx:590, in pyarrow._compute.call_function()
> 
> File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c3e6456f-5da4-4137-873c-abf3762cb9a6/lib/python3.10/site-packages/pyarrow/_compute.pyx:385, in pyarrow._compute.Function.call()
> 
> File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c3e6456f-5da4-4137-873c-abf3762cb9a6/lib/python3.10/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()
> 
> File /local_disk0/.ephemeral_nfs/envs/pythonEnv-c3e6456f-5da4-4137-873c-abf3762cb9a6/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()
> 
> ArrowInvalid: Failed to parse string: '<Category Column>' as a scalar of type double
> 
danishzmalik commented 1 month ago

PS. It doesnt happen with all category columns. I tried keeping the 'orig.ident' as the only category column in my obs layer, it worked fine i.e. i was able to append all the files. So the behaviour is kinda unpredictable

Also, in the error is actually the column value not the column name.

danishzmalik commented 1 month ago

@johnkerl
Hi, any update on this, could it be a potential bug?

johnkerl commented 1 month ago

My apologies @danishzmalik -- I'll look into this today

johnkerl commented 1 month ago

@danishzmalik can you please do the following and then share the outputs?

danishzmalik commented 1 month ago

@johnkerl

For URI1:

> soma_joinid: int64 not null
> obs_id: large_string
> orig.ident: dictionary<values=string, indices=int32, ordered=0> not null
> nFeature_RNA: int64 not null
> nCount_RNA: float not null
> percent.mt: float not null
> high_mt: bool not null
> high_ct: bool not null
> basic_qc: dictionary<values=string, indices=int32, ordered=0> not null
> sample_id: dictionary<values=string, indices=int32, ordered=0> not null
> author_cell_type: double
> author_cell_type_cell_ontology_id: double
> author_cell_type_cell_ontology_name: double
> study_id: dictionary<values=string, indices=int32, ordered=0> not null
> sample_name: dictionary<values=string, indices=int32, ordered=0> not null
> donor_id: dictionary<values=string, indices=int32, ordered=0> not null
> sample_type: dictionary<values=string, indices=int32, ordered=0> not null

For URI2:

> soma_joinid: int64 not null
> obs_id: large_string
> orig.ident: dictionary<values=string, indices=int32, ordered=0> not null
> nFeature_RNA: int64 not null
> nCount_RNA: float not null
> percent.mt: float not null
> high_mt: bool not null
> high_ct: bool not null
> basic_qc: dictionary<values=string, indices=int32, ordered=0> not null
> sample_id: dictionary<values=string, indices=int32, ordered=0> not null
> author_cell_type: dictionary<values=string, indices=int32, ordered=0> not null
> author_cell_type_cell_ontology_id: dictionary<values=string, indices=int32, ordered=0> not null
> author_cell_type_cell_ontology_name: dictionary<values=string, indices=int32, ordered=0> not null
> study_id: dictionary<values=string, indices=int32, ordered=0> not null
> sample_name: dictionary<values=string, indices=int32, ordered=0> not null
> donor_id: dictionary<values=string, indices=int32, ordered=0> not null
> sample_type: dictionary<values=string, indices=int32, ordered=0> not null
johnkerl commented 1 month ago

Thanks @danishzmalik ! Diffing these I see:

11,13c11,13
< > author_cell_type: double
< > author_cell_type_cell_ontology_id: double
< > author_cell_type_cell_ontology_name: double
---
> > author_cell_type: dictionary<values=string, indices=int32, ordered=0> not null
> > author_cell_type_cell_ontology_id: dictionary<values=string, indices=int32, ordered=0> not null
> > author_cell_type_cell_ontology_name: dictionary<values=string, indices=int32, ordered=0> not null

which suggests that these three columns contain floating-point data for the first input, but strings for the second input ...

Are you able to share the data contents for these columns for both inputs?

Also: I forgot to ask: can you share

?

danishzmalik commented 1 month ago

@johnkerl Apologies for the delay in response.

Thanks for your last reply, I did notice there was a difference in datatypes in both files, so before processing I converted the float columns to category columns.

column_dtypes = {
        "author_cell_type": 'category',
        "author_cell_type_cell_ontology_id": 'category',
        "author_cell_type_cell_ontology_name": 'category'

    }
    for col,dtype in column_dtypes.items():
        adata.obs[col] = adata.obs[col].astype(dtype)

After doing so, I'm no longer seeing the error I posted earlier. However, there is another issue now that Im facing. It is to do with ID columns.

The Error is: ArrowIndexError: Index -1 out of bounds

I have tried to troubleshoot this, but Ive seemed to hit a roadblock.

Following is what the dtypes look like now:

adata1.obs.dtypes

orig.ident category nFeature_RNA int64 nCount_RNA float32 percent.mt float32 high_mt bool high_ct bool basic_qc category sample_id category author_cell_type category author_cell_type_cell_ontology_id category author_cell_type_cell_ontology_name category study_id category sample_name category donor_id category sample_type category dtype: object

adata1.var.dtypes

gene_ids object highly_variable bool means float64 dispersions float64 dispersions_norm float32 dtype: object

adata2.obs.dtypes

orig.ident category nFeature_RNA int64 nCount_RNA float32 percent.mt float32 high_mt bool high_ct bool basic_qc category sample_id category author_cell_type category author_cell_type_cell_ontology_id category author_cell_type_cell_ontology_name category study_id category sample_name category donor_id category sample_type category dtype: object

adata2.var.dtypes

gene_ids object highly_variable bool means float64 dispersions float64 dispersions_norm float32 dtype: object

Full Error Trace

ArrowIndexError: Index -1 out of bounds

ArrowIndexError Traceback (most recent call last) File , line 12 10 else:
11 rd = register_adata(adata) ---> 12 ingest_to_soma(rd, adata)

File , line 3, in ingest_to_soma(rd, adata) 1 def ingest_to_soma(rd, adata): 2 print("Ingesting to SOMA") ----> 3 tiledbsoma.io.from_anndata( 4 experiment_uri = s3_target, 5 anndata= adata, 6 measurement_name="RNA", 7 uns_keys=[], 8 ingest_mode="write", 9 registration_mapping=rd, 10 )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a56b4fec-a1fa-4341-bb15-a28f687139c5/lib/python3.10/site-packages/tiledbsoma/io/ingest.py:495, in from_anndata(experiment_uri, anndata, measurement_name, context, platform_config, obs_id_name, var_id_name, X_layer_name, raw_X_layer_name, ingest_mode, use_relative_uri, X_kind, registration_mapping, uns_keys, additional_metadata) 492 # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 493 # OBS 494 df_uri = _util.uri_joinpath(experiment_uri, "obs") --> 495 with _write_dataframe( 496 df_uri, 497 conversions.decategoricalize_obs_or_var(anndata.obs), 498 id_column_name=obs_id_name, 499 platform_config=platform_config, 500 axis_mapping=jidmaps.obs_axis, 501 **ingest_ctx, 502 ) as obs: 503 _maybe_set(experiment, "obs", obs, use_relative_uri=use_relative_uri) 505 # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 506 # MS 507 (...) 511 # be of the form tiledb://namespace/uuid. Only for the former is it suitable 512 # to append "/ms" so that is what we do here.

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a56b4fec-a1fa-4341-bb15-a28f687139c5/lib/python3.10/site-packages/tiledbsoma/io/ingest.py:1168, in _write_dataframe(df_uri, df, id_column_name, ingestion_params, additional_metadata, platform_config, context, axis_mapping) 1164 df[SOMA_JOINID] = np.asarray(axis_mapping.data, dtype=np.int64) 1166 df.set_index(SOMA_JOINID, inplace=True) -> 1168 return _write_dataframe_impl( 1169 df, 1170 df_uri, 1171 id_column_name, 1172 ingestion_params=ingestion_params, 1173 additional_metadata=additional_metadata, 1174 original_index_name=original_index_name, 1175 platform_config=platform_config, 1176 context=context, 1177 )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a56b4fec-a1fa-4341-bb15-a28f687139c5/lib/python3.10/site-packages/tiledbsoma/io/ingest.py:1204, in _write_dataframe_impl(df, df_uri, id_column_name, ingestion_params, additional_metadata, original_index_name, platform_config, context) 1200 if id_column_name is None: 1201 # Nominally, nil id_column_name only happens for uns append and we do not append uns, 1202 # which is a concern for our caller. This is a second-level check. 1203 raise ValueError("internal coding error: id_column_name unspecified") -> 1204 arrow_table = _extract_new_values_for_append( 1205 df_uri, arrow_table, id_column_name, context 1206 ) 1208 try: 1209 soma_df = DataFrame.create( 1210 df_uri, 1211 schema=arrow_table.schema, 1212 platform_config=platform_config, 1213 context=context, 1214 )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a56b4fec-a1fa-4341-bb15-a28f687139c5/lib/python3.10/site-packages/tiledbsoma/io/ingest.py:1095, in _extract_new_values_for_append(df_uri, arrow_table, id_column_name, context) 1091 with _factory.open( 1092 df_uri, "r", soma_type=DataFrame, context=context 1093 ) as previous_soma_dataframe: 1094 previous_table = previous_soma_dataframe.read().concat() -> 1095 previous_df = previous_table.to_pandas() 1096 previous_join_ids = set( 1097 int(e) for e in get_dataframe_values(previous_df, SOMA_JOINID) 1098 ) 1099 mask = [ 1100 e.as_py() not in previous_join_ids for e in arrow_table[SOMA_JOINID] 1101 ]

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a56b4fec-a1fa-4341-bb15-a28f687139c5/lib/python3.10/site-packages/pyarrow/array.pxi:872, in pyarrow.lib._PandasConvertible.to_pandas()

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a56b4fec-a1fa-4341-bb15-a28f687139c5/lib/python3.10/site-packages/pyarrow/table.pxi:4904, in pyarrow.lib.Table._to_pandas()

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a56b4fec-a1fa-4341-bb15-a28f687139c5/lib/python3.10/site-packages/pyarrow/pandas_compat.py:779, in table_to_dataframe(options, table, categories, ignore_metadata, types_mapper) 776 columns = _deserialize_column_index(table, all_columns, column_indexes) 778 column_names = table.column_names --> 779 result = pa.lib.table_to_blocks(options, table, categories, 780 list(ext_columns_dtypes.keys())) 781 if _pandas_api.is_ge_v3(): 782 from pandas.api.internals import create_dataframe_from_blocks

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a56b4fec-a1fa-4341-bb15-a28f687139c5/lib/python3.10/site-packages/pyarrow/table.pxi:3771, in pyarrow.lib.table_to_blocks()

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a56b4fec-a1fa-4341-bb15-a28f687139c5/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowIndexError: Index -1 out of bounds

johnkerl commented 1 month ago

Thanks @danishzmalik !

Can you please share the output of tiledbsoma::show_package_versions()?

danishzmalik commented 1 month ago

@johnkerl

show_package_versions() doesn't really work on databricks. I did %pip list instead.

image

Python Version: 3.10.12 Does this help?

johnkerl commented 1 month ago

That will work! That ArrowIndexError: Index -1 out of bounds is fixed as of 1.11.4 -- can you upgrade and test this out on your dataset?

danishzmalik commented 1 month ago

@johnkerl Thanks, let me try that and get back to you

danishzmalik commented 1 month ago

Hi @johnkerl Upgrading the tiledb_soma package to 1.11.4 resolved the issue. Thank you!

I started experiencing the data type issue again, however I was able to resolve that by converting datatypes to 'str' instead of 'category'.

column_dtypes = {
        "author_cell_type": 'string',
        "author_cell_type_cell_ontology_id": 'string',
        "author_cell_type_cell_ontology_name": 'string'

    }
    for col,dtype in column_dtypes.items():
        adata.obs[col] = adata.obs[col].astype(dtype)

I am currently running a script which should ingest around 200 anndata objects to a single Soma object. Incase I encounter any further issues, should i reopen another thread or shall I continue using this thread?

johnkerl commented 1 month ago

Upgrading the tiledb_soma package to 1.11.4 resolved the issue. Thank you!

Fantastic! :)

In case I encounter any further issues, should i reopen another thread or shall I continue using this thread?

Please open a new issue -- thank you!