single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
https://tiledbsoma.readthedocs.io
MIT License
90 stars 25 forks source link

Census-builder unit test failing with 1.11.3 #2663

Closed johnkerl closed 4 months ago

johnkerl commented 4 months ago

From @prathapsridharan

pytest tools/cellxgene_census_builder/tests/test_builder.py::test_base_builder_creation[False-census_build_args0] -v
tiledb.cc.TileDBError: Enumeration: Unable to extend an enumeration without a data buffer.

[sc-48539]

johnkerl commented 4 months ago

@prathapsridharan to evaluate https://github.com/single-cell-data/TileDB-SOMA/pull/2629

prathapsridharan commented 4 months ago

Attaching script to test that this case here:

import pandas as pd
import pyarrow as pa

import tiledbsoma 

from cellxgene_census_builder.build_soma.globals import CENSUS_OBS_TABLE_SPEC

df_dict = {'cell_type_ontology_term_id': {0: 'CL:0000192', 1: 'CL:0000192', 2: 'CL:0000192', 3: 'CL:0000192', 4: 'CL:0000192', 5: 'CL:0000192', 6: 'CL:0000192', 7: 'CL:0000192', 8: 'CL:0000192', 9: 'CL:0000192', 10: 'CL:0000192', 11: 'CL:0000192', 12: 'CL:0000192', 13: 'CL:0000192', 14: 'CL:0000192', 15: 'CL:0000192'}, 'assay_ontology_term_id': {0: 'EFO:0009922', 1: 'EFO:0009922', 2: 'EFO:0009922', 3: 'EFO:0009922', 4: 'EFO:0008931', 5: 'EFO:0008931', 6: 'EFO:0008931', 7: 'EFO:0008931', 8: 'EFO:0009922', 9: 'EFO:0009922', 10: 'EFO:0009922', 11: 'EFO:0009922', 12: 'EFO:0008931', 13: 'EFO:0008931', 14: 'EFO:0008931', 15: 'EFO:0008931'}, 'disease_ontology_term_id': {0: 'PATO:0000461', 1: 'PATO:0000461', 2: 'PATO:0000461', 3: 'PATO:0000461', 4: 'PATO:0000461', 5: 'PATO:0000461', 6: 'PATO:0000461', 7: 'PATO:0000461', 8: 'PATO:0000461', 9: 'PATO:0000461', 10: 'PATO:0000461', 11: 'PATO:0000461', 12: 'PATO:0000461', 13: 'PATO:0000461', 14: 'PATO:0000461', 15: 'PATO:0000461'}, 'organism_ontology_term_id': {0: 'NCBITaxon:9606', 1: 'NCBITaxon:9606', 2: 'NCBITaxon:9606', 3: 'NCBITaxon:9606', 4: 'NCBITaxon:9606', 5: 'NCBITaxon:9606', 6: 'NCBITaxon:9606', 7: 'NCBITaxon:9606', 8: 'NCBITaxon:9606', 9: 'NCBITaxon:9606', 10: 'NCBITaxon:9606', 11: 'NCBITaxon:9606', 12: 'NCBITaxon:9606', 13: 'NCBITaxon:9606', 14: 'NCBITaxon:9606', 15: 'NCBITaxon:9606'}, 'sex_ontology_term_id': {0: 'unknown', 1: 'unknown', 2: 'unknown', 3: 'unknown', 4: 'unknown', 5: 'unknown', 6: 'unknown', 7: 'unknown', 8: 'unknown', 9: 'unknown', 10: 'unknown', 11: 'unknown', 12: 'unknown', 13: 'unknown', 14: 'unknown', 15: 'unknown'}, 'tissue_ontology_term_id': {0: 'CL:0000192', 1: 'CL:0000192', 2: 'CL:0000192', 3: 'CL:0000192', 4: 'CL:0000192', 5: 'CL:0000192', 6: 'CL:0000192', 7: 'CL:0000192', 8: 'CL:0000192', 9: 'CL:0000192', 10: 'CL:0000192', 11: 'CL:0000192', 12: 'CL:0000192', 13: 'CL:0000192', 14: 'CL:0000192', 15: 'CL:0000192'}, 'is_primary_data': {0: False, 1: False, 2: False, 3: False, 4: False, 5: False, 6: False, 7: False, 8: False, 9: False, 10: False, 11: False, 12: False, 13: False, 14: False, 15: False}, 'self_reported_ethnicity_ontology_term_id': {0: 'na', 1: 'na', 2: 'na', 3: 'na', 4: 'na', 5: 'na', 6: 'na', 7: 'na', 8: 'na', 9: 'na', 10: 'na', 11: 'na', 12: 'na', 13: 'na', 14: 'na', 15: 'na'}, 'development_stage_ontology_term_id': {0: 'MmusDv:0000003', 1: 'MmusDv:0000003', 2: 'MmusDv:0000003', 3: 'MmusDv:0000003', 4: 'MmusDv:0000003', 5: 'MmusDv:0000003', 6: 'MmusDv:0000003', 7: 'MmusDv:0000003', 8: 'MmusDv:0000003', 9: 'MmusDv:0000003', 10: 'MmusDv:0000003', 11: 'MmusDv:0000003', 12: 'MmusDv:0000003', 13: 'MmusDv:0000003', 14: 'MmusDv:0000003', 15: 'MmusDv:0000003'}, 'donor_id': {0: 'donor_2', 1: 'donor_2', 2: 'donor_2', 3: 'donor_2', 4: 'donor_2', 5: 'donor_2', 6: 'donor_2', 7: 'donor_2', 8: 'donor_2', 9: 'donor_2', 10: 'donor_2', 11: 'donor_2', 12: 'donor_2', 13: 'donor_2', 14: 'donor_2', 15: 'donor_2'}, 'suspension_type': {0: 'na', 1: 'na', 2: 'na', 3: 'na', 4: 'na', 5: 'na', 6: 'na', 7: 'na', 8: 'na', 9: 'na', 10: 'na', 11: 'na', 12: 'na', 13: 'na', 14: 'na', 15: 'na'}, 'assay': {0: 'test', 1: 'test', 2: 'test', 3: 'test', 4: 'test', 5: 'test', 6: 'test', 7: 'test', 8: 'test', 9: 'test', 10: 'test', 11: 'test', 12: 'test', 13: 'test', 14: 'test', 15: 'test'}, 'cell_type': {0: 'test', 1: 'test', 2: 'test', 3: 'test', 4: 'test', 5: 'test', 6: 'test', 7: 'test', 8: 'test', 9: 'test', 10: 'test', 11: 'test', 12: 'test', 13: 'test', 14: 'test', 15: 'test'}, 'development_stage': {0: 'test', 1: 'test', 2: 'test', 3: 'test', 4: 'test', 5: 'test', 6: 'test', 7: 'test', 8: 'test', 9: 'test', 10: 'test', 11: 'test', 12: 'test', 13: 'test', 14: 'test', 15: 'test'}, 'disease': {0: 'test', 1: 'test', 2: 'test', 3: 'test', 4: 'test', 5: 'test', 6: 'test', 7: 'test', 8: 'test', 9: 'test', 10: 'test', 11: 'test', 12: 'test', 13: 'test', 14: 'test', 15: 'test'}, 'self_reported_ethnicity': {0: 'test', 1: 'test', 2: 'test', 3: 'test', 4: 'test', 5: 'test', 6: 'test', 7: 'test', 8: 'test', 9: 'test', 10: 'test', 11: 'test', 12: 'test', 13: 'test', 14: 'test', 15: 'test'}, 'sex': {0: 'test', 1: 'test', 2: 'test', 3: 'test', 4: 'test', 5: 'test', 6: 'test', 7: 'test', 8: 'test', 9: 'test', 10: 'test', 11: 'test', 12: 'test', 13: 'test', 14: 'test', 15: 'test'}, 'tissue': {0: 'test', 1: 'test', 2: 'test', 3: 'test', 4: 'test', 5: 'test', 6: 'test', 7: 'test', 8: 'test', 9: 'test', 10: 'test', 11: 'test', 12: 'test', 13: 'test', 14: 'test', 15: 'test'}, 'organism': {0: 'test', 1: 'test', 2: 'test', 3: 'test', 4: 'test', 5: 'test', 6: 'test', 7: 'test', 8: 'test', 9: 'test', 10: 'test', 11: 'test', 12: 'test', 13: 'test', 14: 'test', 15: 'test'}, 'tissue_type': {0: 'tissue', 1: 'tissue', 2: 'tissue', 3: 'tissue', 4: 'tissue', 5: 'tissue', 6: 'tissue', 7: 'tissue', 8: 'tissue', 9: 'tissue', 10: 'tissue', 11: 'tissue', 12: 'tissue', 13: 'tissue', 14: 'tissue', 15: 'tissue'}, 'observation_joinid': {0: 'test', 1: 'test', 2: 'test', 3: 'test', 4: 'test', 5: 'test', 6: 'test', 7: 'test', 8: 'test', 9: 'test', 10: 'test', 11: 'test', 12: 'test', 13: 'test', 14: 'test', 15: 'test'}, 'dataset_id': {0: 'homo_sapiens_0', 1: 'homo_sapiens_0', 2: 'homo_sapiens_0', 3: 'homo_sapiens_0', 4: 'homo_sapiens_1', 5: 'homo_sapiens_1', 6: 'homo_sapiens_1', 7: 'homo_sapiens_1', 8: 'homo_sapiens_2', 9: 'homo_sapiens_2', 10: 'homo_sapiens_2', 11: 'homo_sapiens_2', 12: 'homo_sapiens_3', 13: 'homo_sapiens_3', 14: 'homo_sapiens_3', 15: 'homo_sapiens_3'}, 'soma_joinid': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15}, 'tissue_general_ontology_term_id': {0: 'CL:0000192', 1: 'CL:0000192', 2: 'CL:0000192', 3: 'CL:0000192', 4: 'CL:0000192', 5: 'CL:0000192', 6: 'CL:0000192', 7: 'CL:0000192', 8: 'CL:0000192', 9: 'CL:0000192', 10: 'CL:0000192', 11: 'CL:0000192', 12: 'CL:0000192', 13: 'CL:0000192', 14: 'CL:0000192', 15: 'CL:0000192'}, 'tissue_general': {0: 'smooth muscle cell', 1: 'smooth muscle cell', 2: 'smooth muscle cell', 3: 'smooth muscle cell', 4: 'smooth muscle cell', 5: 'smooth muscle cell', 6: 'smooth muscle cell', 7: 'smooth muscle cell', 8: 'smooth muscle cell', 9: 'smooth muscle cell', 10: 'smooth muscle cell', 11: 'smooth muscle cell', 12: 'smooth muscle cell', 13: 'smooth muscle cell', 14: 'smooth muscle cell', 15: 'smooth muscle cell'}, 'raw_sum': {0: 11.0, 1: 11.0, 2: 12.0, 3: 18.0, 4: 11.0, 5: 10.0, 6: 6.0, 7: 7.0, 8: 13.0, 9: 10.0, 10: 8.0, 11: 8.0, 12: 13.0, 13: 15.0, 14: 15.0, 15: 15.0}, 'nnz': {0: 4, 1: 4, 2: 4, 3: 4, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3, 9: 3, 10: 3, 11: 3, 12: 4, 13: 4, 14: 4, 15: 4}, 'raw_mean_nnz': {0: 2.75, 1: 2.75, 2: 3.0, 3: 4.5, 4: 3.6666666666666665, 5: 3.3333333333333335, 6: 2.0, 7: 2.3333333333333335, 8: 4.333333333333333, 9: 3.3333333333333335, 10: 2.6666666666666665, 11: 2.6666666666666665, 12: 3.25, 13: 3.75, 14: 3.75, 15: 3.75}, 'raw_variance_nnz': {0: 2.25, 1: 4.25, 2: 0.6666666666666666, 3: 1.0, 4: 2.3333333333333335, 5: 4.333333333333333, 6: 3.0, 7: 2.333333333333333, 8: 1.3333333333333335, 9: 0.33333333333333337, 10: 0.33333333333333337, 11: 2.3333333333333335, 12: 1.5833333333333333, 13: 0.25, 14: 3.5833333333333335, 15: 3.5833333333333335}, 'n_measured_vars': {0: 4, 1: 4, 2: 4, 3: 4, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3, 9: 3, 10: 3, 11: 3, 12: 4, 13: 4, 14: 4, 15: 4}}

obs_df = pd.DataFrame.from_dict(df_dict)

obs_df = CENSUS_OBS_TABLE_SPEC.recategoricalize(obs_df)

obs_schema = CENSUS_OBS_TABLE_SPEC.to_arrow_schema(obs_df)

pa_table = pa.Table.from_pandas(obs_df, preserve_index=False, schema=obs_schema)

df_uri = "test_dataframe"

tiledbsoma.DataFrame.create(df_uri, schema=obs_schema, index_column_names=["soma_joinid"]).close()
with tiledbsoma.DataFrame.open(df_uri, "w") as sdf:
    sdf.write(pa_table)
prathapsridharan commented 4 months ago

More info: In the dataframe included in the script to demonstrate the failure (comment above), the call to clib_array_extend_enumeration() fails for assay:

clib_array.extend_enumeration(assay, 
-- dictionary:
  [
    "test"
  ]
-- indices:
  [
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0
  ])

But it seems to have passed for dataset_id:

clib_array.extend_enumeration(dataset_id, 
-- dictionary:
  [
    "homo_sapiens_0",
    "homo_sapiens_1",
    "homo_sapiens_2",
    "homo_sapiens_3"
  ]
-- indices:
  [
    0,
    0,
    0,
    0,
    1,
    1,
    1,
    1,
    2,
    2,
    2,
    2,
    3,
    3,
    3,
    3
  ])

Note that one is of type string and the other is of type large_string and the assay, which is of type large_string fails with the error tiledb.cc.TileDBError: Enumeration: Unable to extend an enumeration without a data buffer.:

dataset_id: dictionary assay: dictionary

prathapsridharan commented 4 months ago

The issue seems to be fixed by this commit (verified by passing unit tests on census_builder): https://github.com/single-cell-data/TileDB-SOMA/commit/3869abcacb284ba654a61977d5543dbbddde3bfd

But it is part of a larger work that has yet to make it to main

johnkerl commented 4 months ago

Fix upcoming in 1.11.4