single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
https://tiledbsoma.readthedocs.io
MIT License
84 stars 25 forks source link

[python] Correctly handle string vs large string in `Enumeration`s #2668

Closed nguyenv closed 3 months ago

nguyenv commented 3 months ago

Issue and/or context:

This fix has been tested against the code below and confirmed passing:

import pandas as pd
import pyarrow as pa

import tiledbsoma 

from cellxgene_census_builder.build_soma.globals import CENSUS_OBS_TABLE_SPEC

df_dict = {'cell_type_ontology_term_id': {0: 'CL:0000192', 1: 'CL:0000192', 2: 'CL:0000192', 3: 'CL:0000192', 4: 'CL:0000192', 5: 'CL:0000192', 6: 'CL:0000192', 7: 'CL:0000192', 8: 'CL:0000192', 9: 'CL:0000192', 10: 'CL:0000192', 11: 'CL:0000192', 12: 'CL:0000192', 13: 'CL:0000192', 14: 'CL:0000192', 15: 'CL:0000192'}, 'assay_ontology_term_id': {0: 'EFO:0009922', 1: 'EFO:0009922', 2: 'EFO:0009922', 3: 'EFO:0009922', 4: 'EFO:0008931', 5: 'EFO:0008931', 6: 'EFO:0008931', 7: 'EFO:0008931', 8: 'EFO:0009922', 9: 'EFO:0009922', 10: 'EFO:0009922', 11: 'EFO:0009922', 12: 'EFO:0008931', 13: 'EFO:0008931', 14: 'EFO:0008931', 15: 'EFO:0008931'}, 'disease_ontology_term_id': {0: 'PATO:0000461', 1: 'PATO:0000461', 2: 'PATO:0000461', 3: 'PATO:0000461', 4: 'PATO:0000461', 5: 'PATO:0000461', 6: 'PATO:0000461', 7: 'PATO:0000461', 8: 'PATO:0000461', 9: 'PATO:0000461', 10: 'PATO:0000461', 11: 'PATO:0000461', 12: 'PATO:0000461', 13: 'PATO:0000461', 14: 'PATO:0000461', 15: 'PATO:0000461'}, 'organism_ontology_term_id': {0: 'NCBITaxon:9606', 1: 'NCBITaxon:9606', 2: 'NCBITaxon:9606', 3: 'NCBITaxon:9606', 4: 'NCBITaxon:9606', 5: 'NCBITaxon:9606', 6: 'NCBITaxon:9606', 7: 'NCBITaxon:9606', 8: 'NCBITaxon:9606', 9: 'NCBITaxon:9606', 10: 'NCBITaxon:9606', 11: 'NCBITaxon:9606', 12: 'NCBITaxon:9606', 13: 'NCBITaxon:9606', 14: 'NCBITaxon:9606', 15: 'NCBITaxon:9606'}, 'sex_ontology_term_id': {0: 'unknown', 1: 'unknown', 2: 'unknown', 3: 'unknown', 4: 'unknown', 5: 'unknown', 6: 'unknown', 7: 'unknown', 8: 'unknown', 9: 'unknown', 10: 'unknown', 11: 'unknown', 12: 'unknown', 13: 'unknown', 14: 'unknown', 15: 'unknown'}, 'tissue_ontology_term_id': {0: 'CL:0000192', 1: 'CL:0000192', 2: 'CL:0000192', 3: 'CL:0000192', 4: 'CL:0000192', 5: 'CL:0000192', 6: 'CL:0000192', 7: 'CL:0000192', 8: 'CL:0000192', 9: 'CL:0000192', 10: 'CL:0000192', 11: 'CL:0000192', 12: 'CL:0000192', 13: 'CL:0000192', 14: 'CL:0000192', 15: 'CL:0000192'}, 'is_primary_data': {0: False, 1: False, 2: False, 3: False, 4: False, 5: False, 6: False, 7: False, 8: False, 9: False, 10: False, 11: False, 12: False, 13: False, 14: False, 15: False}, 'self_reported_ethnicity_ontology_term_id': {0: 'na', 1: 'na', 2: 'na', 3: 'na', 4: 'na', 5: 'na', 6: 'na', 7: 'na', 8: 'na', 9: 'na', 10: 'na', 11: 'na', 12: 'na', 13: 'na', 14: 'na', 15: 'na'}, 'development_stage_ontology_term_id': {0: 'MmusDv:0000003', 1: 'MmusDv:0000003', 2: 'MmusDv:0000003', 3: 'MmusDv:0000003', 4: 'MmusDv:0000003', 5: 'MmusDv:0000003', 6: 'MmusDv:0000003', 7: 'MmusDv:0000003', 8: 'MmusDv:0000003', 9: 'MmusDv:0000003', 10: 'MmusDv:0000003', 11: 'MmusDv:0000003', 12: 'MmusDv:0000003', 13: 'MmusDv:0000003', 14: 'MmusDv:0000003', 15: 'MmusDv:0000003'}, 'donor_id': {0: 'donor_2', 1: 'donor_2', 2: 'donor_2', 3: 'donor_2', 4: 'donor_2', 5: 'donor_2', 6: 'donor_2', 7: 'donor_2', 8: 'donor_2', 9: 'donor_2', 10: 'donor_2', 11: 'donor_2', 12: 'donor_2', 13: 'donor_2', 14: 'donor_2', 15: 'donor_2'}, 'suspension_type': {0: 'na', 1: 'na', 2: 'na', 3: 'na', 4: 'na', 5: 'na', 6: 'na', 7: 'na', 8: 'na', 9: 'na', 10: 'na', 11: 'na', 12: 'na', 13: 'na', 14: 'na', 15: 'na'}, 'assay': {0: 'test', 1: 'test', 2: 'test', 3: 'test', 4: 'test', 5: 'test', 6: 'test', 7: 'test', 8: 'test', 9: 'test', 10: 'test', 11: 'test', 12: 'test', 13: 'test', 14: 'test', 15: 'test'}, 'cell_type': {0: 'test', 1: 'test', 2: 'test', 3: 'test', 4: 'test', 5: 'test', 6: 'test', 7: 'test', 8: 'test', 9: 'test', 10: 'test', 11: 'test', 12: 'test', 13: 'test', 14: 'test', 15: 'test'}, 'development_stage': {0: 'test', 1: 'test', 2: 'test', 3: 'test', 4: 'test', 5: 'test', 6: 'test', 7: 'test', 8: 'test', 9: 'test', 10: 'test', 11: 'test', 12: 'test', 13: 'test', 14: 'test', 15: 'test'}, 'disease': {0: 'test', 1: 'test', 2: 'test', 3: 'test', 4: 'test', 5: 'test', 6: 'test', 7: 'test', 8: 'test', 9: 'test', 10: 'test', 11: 'test', 12: 'test', 13: 'test', 14: 'test', 15: 'test'}, 'self_reported_ethnicity': {0: 'test', 1: 'test', 2: 'test', 3: 'test', 4: 'test', 5: 'test', 6: 'test', 7: 'test', 8: 'test', 9: 'test', 10: 'test', 11: 'test', 12: 'test', 13: 'test', 14: 'test', 15: 'test'}, 'sex': {0: 'test', 1: 'test', 2: 'test', 3: 'test', 4: 'test', 5: 'test', 6: 'test', 7: 'test', 8: 'test', 9: 'test', 10: 'test', 11: 'test', 12: 'test', 13: 'test', 14: 'test', 15: 'test'}, 'tissue': {0: 'test', 1: 'test', 2: 'test', 3: 'test', 4: 'test', 5: 'test', 6: 'test', 7: 'test', 8: 'test', 9: 'test', 10: 'test', 11: 'test', 12: 'test', 13: 'test', 14: 'test', 15: 'test'}, 'organism': {0: 'test', 1: 'test', 2: 'test', 3: 'test', 4: 'test', 5: 'test', 6: 'test', 7: 'test', 8: 'test', 9: 'test', 10: 'test', 11: 'test', 12: 'test', 13: 'test', 14: 'test', 15: 'test'}, 'tissue_type': {0: 'tissue', 1: 'tissue', 2: 'tissue', 3: 'tissue', 4: 'tissue', 5: 'tissue', 6: 'tissue', 7: 'tissue', 8: 'tissue', 9: 'tissue', 10: 'tissue', 11: 'tissue', 12: 'tissue', 13: 'tissue', 14: 'tissue', 15: 'tissue'}, 'observation_joinid': {0: 'test', 1: 'test', 2: 'test', 3: 'test', 4: 'test', 5: 'test', 6: 'test', 7: 'test', 8: 'test', 9: 'test', 10: 'test', 11: 'test', 12: 'test', 13: 'test', 14: 'test', 15: 'test'}, 'dataset_id': {0: 'homo_sapiens_0', 1: 'homo_sapiens_0', 2: 'homo_sapiens_0', 3: 'homo_sapiens_0', 4: 'homo_sapiens_1', 5: 'homo_sapiens_1', 6: 'homo_sapiens_1', 7: 'homo_sapiens_1', 8: 'homo_sapiens_2', 9: 'homo_sapiens_2', 10: 'homo_sapiens_2', 11: 'homo_sapiens_2', 12: 'homo_sapiens_3', 13: 'homo_sapiens_3', 14: 'homo_sapiens_3', 15: 'homo_sapiens_3'}, 'soma_joinid': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15}, 'tissue_general_ontology_term_id': {0: 'CL:0000192', 1: 'CL:0000192', 2: 'CL:0000192', 3: 'CL:0000192', 4: 'CL:0000192', 5: 'CL:0000192', 6: 'CL:0000192', 7: 'CL:0000192', 8: 'CL:0000192', 9: 'CL:0000192', 10: 'CL:0000192', 11: 'CL:0000192', 12: 'CL:0000192', 13: 'CL:0000192', 14: 'CL:0000192', 15: 'CL:0000192'}, 'tissue_general': {0: 'smooth muscle cell', 1: 'smooth muscle cell', 2: 'smooth muscle cell', 3: 'smooth muscle cell', 4: 'smooth muscle cell', 5: 'smooth muscle cell', 6: 'smooth muscle cell', 7: 'smooth muscle cell', 8: 'smooth muscle cell', 9: 'smooth muscle cell', 10: 'smooth muscle cell', 11: 'smooth muscle cell', 12: 'smooth muscle cell', 13: 'smooth muscle cell', 14: 'smooth muscle cell', 15: 'smooth muscle cell'}, 'raw_sum': {0: 11.0, 1: 11.0, 2: 12.0, 3: 18.0, 4: 11.0, 5: 10.0, 6: 6.0, 7: 7.0, 8: 13.0, 9: 10.0, 10: 8.0, 11: 8.0, 12: 13.0, 13: 15.0, 14: 15.0, 15: 15.0}, 'nnz': {0: 4, 1: 4, 2: 4, 3: 4, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3, 9: 3, 10: 3, 11: 3, 12: 4, 13: 4, 14: 4, 15: 4}, 'raw_mean_nnz': {0: 2.75, 1: 2.75, 2: 3.0, 3: 4.5, 4: 3.6666666666666665, 5: 3.3333333333333335, 6: 2.0, 7: 2.3333333333333335, 8: 4.333333333333333, 9: 3.3333333333333335, 10: 2.6666666666666665, 11: 2.6666666666666665, 12: 3.25, 13: 3.75, 14: 3.75, 15: 3.75}, 'raw_variance_nnz': {0: 2.25, 1: 4.25, 2: 0.6666666666666666, 3: 1.0, 4: 2.3333333333333335, 5: 4.333333333333333, 6: 3.0, 7: 2.333333333333333, 8: 1.3333333333333335, 9: 0.33333333333333337, 10: 0.33333333333333337, 11: 2.3333333333333335, 12: 1.5833333333333333, 13: 0.25, 14: 3.5833333333333335, 15: 3.5833333333333335}, 'n_measured_vars': {0: 4, 1: 4, 2: 4, 3: 4, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3, 9: 3, 10: 3, 11: 3, 12: 4, 13: 4, 14: 4, 15: 4}}

obs_df = pd.DataFrame.from_dict(df_dict)

obs_df = CENSUS_OBS_TABLE_SPEC.recategoricalize(obs_df)

obs_schema = CENSUS_OBS_TABLE_SPEC.to_arrow_schema(obs_df)

pa_table = pa.Table.from_pandas(obs_df, preserve_index=False, schema=obs_schema)

df_uri = "test_dataframe"

tiledbsoma.DataFrame.create(df_uri, schema=obs_schema, index_column_names=["soma_joinid"]).close()
with tiledbsoma.DataFrame.open(df_uri, "w") as sdf:
    sdf.write(pa_table)

Changes:

This is a quick fix that casts dictionary columns in the Arrow Table to what is on the schema on disk, thereby correctly casting large_string to string or large_binary to binary for dictionaries.

This will be replaced by https://github.com/single-cell-data/TileDB-SOMA/pull/2629 in C++ once complete (which also already been confirmed to work).

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 90.19%. Comparing base (b85e5c6) to head (9d8befd).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #2668 +/- ## ======================================= Coverage 90.19% 90.19% ======================================= Files 37 37 Lines 4018 4019 +1 ======================================= + Hits 3624 3625 +1 Misses 394 394 ``` | [Flag](https://app.codecov.io/gh/single-cell-data/TileDB-SOMA/pull/2668/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=single-cell-data) | Coverage Δ | | |---|---|---| | [python](https://app.codecov.io/gh/single-cell-data/TileDB-SOMA/pull/2668/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=single-cell-data) | `90.19% <100.00%> (+<0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=single-cell-data#carryforward-flags-in-the-pull-request-comment) to find out more. | [Components](https://app.codecov.io/gh/single-cell-data/TileDB-SOMA/pull/2668/components?src=pr&el=components&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=single-cell-data) | Coverage Δ | | |---|---|---| | [python_api](https://app.codecov.io/gh/single-cell-data/TileDB-SOMA/pull/2668/components?src=pr&el=component&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=single-cell-data) | `90.19% <100.00%> (+<0.01%)` | :arrow_up: | | [libtiledbsoma](https://app.codecov.io/gh/single-cell-data/TileDB-SOMA/pull/2668/components?src=pr&el=component&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=single-cell-data) | `∅ <ø> (∅)` | |