Closed danishzmalik closed 1 month ago
PS. It doesnt happen with all category columns. I tried keeping the 'orig.ident' as the only category column in my obs layer, it worked fine i.e. i was able to append all the files. So the behaviour is kinda unpredictable
Also,
@johnkerl
Hi, any update on this, could it be a potential bug?
My apologies @danishzmalik -- I'll look into this today
@danishzmalik can you please do the following and then share the outputs?
with tiledbsoma.Experiment.open(URI1) as exp: print(exp.obs.schema)
with tiledbsoma.Experiment.open(URI2) as exp: print(exp.obs.schema)
@johnkerl
For URI1:
> soma_joinid: int64 not null
> obs_id: large_string
> orig.ident: dictionary<values=string, indices=int32, ordered=0> not null
> nFeature_RNA: int64 not null
> nCount_RNA: float not null
> percent.mt: float not null
> high_mt: bool not null
> high_ct: bool not null
> basic_qc: dictionary<values=string, indices=int32, ordered=0> not null
> sample_id: dictionary<values=string, indices=int32, ordered=0> not null
> author_cell_type: double
> author_cell_type_cell_ontology_id: double
> author_cell_type_cell_ontology_name: double
> study_id: dictionary<values=string, indices=int32, ordered=0> not null
> sample_name: dictionary<values=string, indices=int32, ordered=0> not null
> donor_id: dictionary<values=string, indices=int32, ordered=0> not null
> sample_type: dictionary<values=string, indices=int32, ordered=0> not null
For URI2:
> soma_joinid: int64 not null
> obs_id: large_string
> orig.ident: dictionary<values=string, indices=int32, ordered=0> not null
> nFeature_RNA: int64 not null
> nCount_RNA: float not null
> percent.mt: float not null
> high_mt: bool not null
> high_ct: bool not null
> basic_qc: dictionary<values=string, indices=int32, ordered=0> not null
> sample_id: dictionary<values=string, indices=int32, ordered=0> not null
> author_cell_type: dictionary<values=string, indices=int32, ordered=0> not null
> author_cell_type_cell_ontology_id: dictionary<values=string, indices=int32, ordered=0> not null
> author_cell_type_cell_ontology_name: dictionary<values=string, indices=int32, ordered=0> not null
> study_id: dictionary<values=string, indices=int32, ordered=0> not null
> sample_name: dictionary<values=string, indices=int32, ordered=0> not null
> donor_id: dictionary<values=string, indices=int32, ordered=0> not null
> sample_type: dictionary<values=string, indices=int32, ordered=0> not null
Thanks @danishzmalik ! Diffing these I see:
11,13c11,13
< > author_cell_type: double
< > author_cell_type_cell_ontology_id: double
< > author_cell_type_cell_ontology_name: double
---
> > author_cell_type: dictionary<values=string, indices=int32, ordered=0> not null
> > author_cell_type_cell_ontology_id: dictionary<values=string, indices=int32, ordered=0> not null
> > author_cell_type_cell_ontology_name: dictionary<values=string, indices=int32, ordered=0> not null
which suggests that these three columns contain floating-point data for the first input, but strings for the second input ...
Are you able to share the data contents for these columns for both inputs?
np.nan
or pd.NA
within these columns?Also: I forgot to ask: can you share
adata1.obs.dtypes
adata2.obs.dtypes
?
@johnkerl Apologies for the delay in response.
Thanks for your last reply, I did notice there was a difference in datatypes in both files, so before processing I converted the float columns to category columns.
column_dtypes = {
"author_cell_type": 'category',
"author_cell_type_cell_ontology_id": 'category',
"author_cell_type_cell_ontology_name": 'category'
}
for col,dtype in column_dtypes.items():
adata.obs[col] = adata.obs[col].astype(dtype)
After doing so, I'm no longer seeing the error I posted earlier. However, there is another issue now that Im facing. It is to do with ID columns.
The Error is: ArrowIndexError: Index -1 out of bounds
I have tried to troubleshoot this, but Ive seemed to hit a roadblock.
Following is what the dtypes look like now:
adata1.obs.dtypes
orig.ident category nFeature_RNA int64 nCount_RNA float32 percent.mt float32 high_mt bool high_ct bool basic_qc category sample_id category author_cell_type category author_cell_type_cell_ontology_id category author_cell_type_cell_ontology_name category study_id category sample_name category donor_id category sample_type category dtype: object
adata1.var.dtypes
gene_ids object highly_variable bool means float64 dispersions float64 dispersions_norm float32 dtype: object
adata2.obs.dtypes
orig.ident category nFeature_RNA int64 nCount_RNA float32 percent.mt float32 high_mt bool high_ct bool basic_qc category sample_id category author_cell_type category author_cell_type_cell_ontology_id category author_cell_type_cell_ontology_name category study_id category sample_name category donor_id category sample_type category dtype: object
adata2.var.dtypes
gene_ids object highly_variable bool means float64 dispersions float64 dispersions_norm float32 dtype: object
Full Error Trace
ArrowIndexError: Index -1 out of bounds
ArrowIndexError Traceback (most recent call last) File
, line 12 10 else:
11 rd = register_adata(adata) ---> 12 ingest_to_soma(rd, adata)File
, line 3, in ingest_to_soma(rd, adata) 1 def ingest_to_soma(rd, adata): 2 print("Ingesting to SOMA") ----> 3 tiledbsoma.io.from_anndata( 4 experiment_uri = s3_target, 5 anndata= adata, 6 measurement_name="RNA", 7 uns_keys=[], 8 ingest_mode="write", 9 registration_mapping=rd, 10 ) File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a56b4fec-a1fa-4341-bb15-a28f687139c5/lib/python3.10/site-packages/tiledbsoma/io/ingest.py:495, in from_anndata(experiment_uri, anndata, measurement_name, context, platform_config, obs_id_name, var_id_name, X_layer_name, raw_X_layer_name, ingest_mode, use_relative_uri, X_kind, registration_mapping, uns_keys, additional_metadata) 492 # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 493 # OBS 494 df_uri = _util.uri_joinpath(experiment_uri, "obs") --> 495 with _write_dataframe( 496 df_uri, 497 conversions.decategoricalize_obs_or_var(anndata.obs), 498 id_column_name=obs_id_name, 499 platform_config=platform_config, 500 axis_mapping=jidmaps.obs_axis, 501 **ingest_ctx, 502 ) as obs: 503 _maybe_set(experiment, "obs", obs, use_relative_uri=use_relative_uri) 505 # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 506 # MS 507 (...) 511 # be of the form tiledb://namespace/uuid. Only for the former is it suitable 512 # to append "/ms" so that is what we do here.
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a56b4fec-a1fa-4341-bb15-a28f687139c5/lib/python3.10/site-packages/tiledbsoma/io/ingest.py:1168, in _write_dataframe(df_uri, df, id_column_name, ingestion_params, additional_metadata, platform_config, context, axis_mapping) 1164 df[SOMA_JOINID] = np.asarray(axis_mapping.data, dtype=np.int64) 1166 df.set_index(SOMA_JOINID, inplace=True) -> 1168 return _write_dataframe_impl( 1169 df, 1170 df_uri, 1171 id_column_name, 1172 ingestion_params=ingestion_params, 1173 additional_metadata=additional_metadata, 1174 original_index_name=original_index_name, 1175 platform_config=platform_config, 1176 context=context, 1177 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a56b4fec-a1fa-4341-bb15-a28f687139c5/lib/python3.10/site-packages/tiledbsoma/io/ingest.py:1204, in _write_dataframe_impl(df, df_uri, id_column_name, ingestion_params, additional_metadata, original_index_name, platform_config, context) 1200 if id_column_name is None: 1201 # Nominally, nil id_column_name only happens for uns append and we do not append uns, 1202 # which is a concern for our caller. This is a second-level check. 1203 raise ValueError("internal coding error: id_column_name unspecified") -> 1204 arrow_table = _extract_new_values_for_append( 1205 df_uri, arrow_table, id_column_name, context 1206 ) 1208 try: 1209 soma_df = DataFrame.create( 1210 df_uri, 1211 schema=arrow_table.schema, 1212 platform_config=platform_config, 1213 context=context, 1214 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a56b4fec-a1fa-4341-bb15-a28f687139c5/lib/python3.10/site-packages/tiledbsoma/io/ingest.py:1095, in _extract_new_values_for_append(df_uri, arrow_table, id_column_name, context) 1091 with _factory.open( 1092 df_uri, "r", soma_type=DataFrame, context=context 1093 ) as previous_soma_dataframe: 1094 previous_table = previous_soma_dataframe.read().concat() -> 1095 previous_df = previous_table.to_pandas() 1096 previous_join_ids = set( 1097 int(e) for e in get_dataframe_values(previous_df, SOMA_JOINID) 1098 ) 1099 mask = [ 1100 e.as_py() not in previous_join_ids for e in arrow_table[SOMA_JOINID] 1101 ]
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a56b4fec-a1fa-4341-bb15-a28f687139c5/lib/python3.10/site-packages/pyarrow/array.pxi:872, in pyarrow.lib._PandasConvertible.to_pandas()
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a56b4fec-a1fa-4341-bb15-a28f687139c5/lib/python3.10/site-packages/pyarrow/table.pxi:4904, in pyarrow.lib.Table._to_pandas()
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a56b4fec-a1fa-4341-bb15-a28f687139c5/lib/python3.10/site-packages/pyarrow/pandas_compat.py:779, in table_to_dataframe(options, table, categories, ignore_metadata, types_mapper) 776 columns = _deserialize_column_index(table, all_columns, column_indexes) 778 column_names = table.column_names --> 779 result = pa.lib.table_to_blocks(options, table, categories, 780 list(ext_columns_dtypes.keys())) 781 if _pandas_api.is_ge_v3(): 782 from pandas.api.internals import create_dataframe_from_blocks
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a56b4fec-a1fa-4341-bb15-a28f687139c5/lib/python3.10/site-packages/pyarrow/table.pxi:3771, in pyarrow.lib.table_to_blocks()
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a56b4fec-a1fa-4341-bb15-a28f687139c5/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()
ArrowIndexError: Index -1 out of bounds
Thanks @danishzmalik !
Can you please share the output of tiledbsoma::show_package_versions()
?
@johnkerl
show_package_versions() doesn't really work on databricks. I did %pip list instead.
Python Version: 3.10.12 Does this help?
That will work! That ArrowIndexError: Index -1 out of bounds
is fixed as of 1.11.4 -- can you upgrade and test this out on your dataset?
@johnkerl Thanks, let me try that and get back to you
Hi @johnkerl Upgrading the tiledb_soma package to 1.11.4 resolved the issue. Thank you!
I started experiencing the data type issue again, however I was able to resolve that by converting datatypes to 'str' instead of 'category'.
column_dtypes = {
"author_cell_type": 'string',
"author_cell_type_cell_ontology_id": 'string',
"author_cell_type_cell_ontology_name": 'string'
}
for col,dtype in column_dtypes.items():
adata.obs[col] = adata.obs[col].astype(dtype)
I am currently running a script which should ingest around 200 anndata objects to a single Soma object. Incase I encounter any further issues, should i reopen another thread or shall I continue using this thread?
Upgrading the tiledb_soma package to 1.11.4 resolved the issue. Thank you!
Fantastic! :)
In case I encounter any further issues, should i reopen another thread or shall I continue using this thread?
Please open a new issue -- thank you!
Describe the bug I'm trying to ingest multiple h5ad files to One Soma object in s3. Steps:
Both files have the same obs and var schemas with the same data type. If I ingest the problematic anndata individually it works fine, only gives an error when I try to append it.
To Reproduce
The Error Suggests that the script is trying to convert some of the category columns to double. Why could this be happening?
Versions (please complete the following information):
Additional context Error Trace is as Follows:.