single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
https://tiledbsoma.readthedocs.io
MIT License
90 stars 25 forks source link

[c++] Apply `ArraySchemaEvolution` to specified timestamp #2895

Closed johnkerl closed 2 months ago

johnkerl commented 2 months ago

Issue and/or context: #2879

Changes: If we've got the array open at a timestamp, and we need to evolve it, evolve from that same timestamp.

Notes for Reviewer:

Material from @nguyenv to be incorporated into unit-test material:

Script:

$ cat ts.py
import tiledb
import tiledbsoma as soma
import pyarrow as pa
import tempfile
import numpy as np
import pandas as pd
import time

with tempfile.TemporaryDirectory() as uri:
    asch = pa.schema([("foo", pa.dictionary(pa.int8(), pa.large_string()))])
    soma.DataFrame.create(uri, schema=asch, tiledb_timestamp=1).close()

    pydict = {}
    pydict["soma_joinid"] = [0, 1, 2]
    pydict["foo"] = pd.Series(['a', 'b', 'a'], dtype='category')
    rb = pa.Table.from_pydict(pydict)

    with soma.DataFrame.open(uri, "w", tiledb_timestamp=2) as sdf:
        sdf.write(rb)

    pydict = {}
    pydict["soma_joinid"] = [3, 4]
    pydict["foo"] = pd.Series(['b', 'b'], dtype='category')
    rb = pa.Table.from_pydict(pydict)

    with soma.DataFrame.open(uri, "w", tiledb_timestamp=3) as sdf:
        sdf.write(rb)

    with soma.DataFrame.open(uri) as sdf:
        table = sdf.read().concat()
        print(table)

    with soma.DataFrame.open(uri, tiledb_timestamp=1) as sdf:
        table = sdf.read().concat()
        print(table)

    with soma.DataFrame.open(uri, tiledb_timestamp=2) as sdf:
        table = sdf.read().concat()
        print(table)

    with soma.DataFrame.open(uri, tiledb_timestamp=3) as sdf:
        table = sdf.read().concat()
        print(table)

Before:

pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[0,1,2,3,4]]
foo: [  -- dictionary:
["a","b"]  -- indices:
[0,1,0,0,0]]
pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[]]
foo: [  -- dictionary:
[]  -- indices:
[]]
pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[0,1,2]]
foo: [  -- dictionary:
[]  -- indices:
[0,1,0]]
pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[0,1,2,3,4]]
foo: [  -- dictionary:
[]  -- indices:
[0,1,0,0,0]]

After:

pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[0,1,2,3,4]]
foo: [  -- dictionary:
["a","b"]  -- indices:
[0,1,0,0,0]]
pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[]]
foo: [  -- dictionary:
[]  -- indices:
[]]
pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[0,1,2]]
foo: [  -- dictionary:
["a","b"]  -- indices:
[0,1,0]]
pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[0,1,2,3,4]]
foo: [  -- dictionary:
["a","b"]  -- indices:
[0,1,0,0,0]]

Difference:

23c23
< []  -- indices:
---
> ["a","b"]  -- indices:
31c31
< []  -- indices:
---
> ["a","b"]  -- indices:
codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 89.98%. Comparing base (9f82dad) to head (d87cca3). Report is 4 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #2895 +/- ## ========================================== + Coverage 89.83% 89.98% +0.15% ========================================== Files 37 37 Lines 3926 3926 ========================================== + Hits 3527 3533 +6 + Misses 399 393 -6 ``` | [Flag](https://app.codecov.io/gh/single-cell-data/TileDB-SOMA/pull/2895/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=single-cell-data) | Coverage Δ | | |---|---|---| | [python](https://app.codecov.io/gh/single-cell-data/TileDB-SOMA/pull/2895/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=single-cell-data) | `89.98% <ø> (+0.15%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=single-cell-data#carryforward-flags-in-the-pull-request-comment) to find out more. | [Components](https://app.codecov.io/gh/single-cell-data/TileDB-SOMA/pull/2895/components?src=pr&el=components&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=single-cell-data) | Coverage Δ | | |---|---|---| | [python_api](https://app.codecov.io/gh/single-cell-data/TileDB-SOMA/pull/2895/components?src=pr&el=component&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=single-cell-data) | `89.98% <ø> (+0.15%)` | :arrow_up: | | [libtiledbsoma](https://app.codecov.io/gh/single-cell-data/TileDB-SOMA/pull/2895/components?src=pr&el=component&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=single-cell-data) | `∅ <ø> (∅)` | |
eddelbuettel commented 2 months ago

Good. That should also help for R where we documented this first,

johnkerl commented 2 months ago

There is something going on, since

["a","b"]  -- indices:
[0,1,0,0,0]]

should have been

["a","b"]  -- indices:
[0,1,0,1,1]]

since we wrote 'b', 'b' at slots 3,4

    pydict = {}
    pydict["soma_joinid"] = [3, 4]
    pydict["foo"] = pd.Series(['b', 'b'], dtype='category')
    rb = pa.Table.from_pydict(pydict)
johnkerl commented 2 months ago

I've verified that ☝️ is a separate bug. Running this on a built checkout without this PR:

ubuntu@segge[prod][kerl/h2h][TileDB-SOMA]$ cat ts.py
import tiledb
import tiledbsoma as soma
import pyarrow as pa
import tempfile
import numpy as np
import pandas as pd
import time

with tempfile.TemporaryDirectory() as uri:

    asch = pa.schema([
        ("foo", pa.dictionary(pa.int8(), pa.large_string()))
    ])

    # Create at time 1
    soma.DataFrame.create(uri, schema=asch, tiledb_timestamp=1).close()

    # Write at time 2
    atbl = pa.Table.from_pydict({
        "soma_joinid": [0, 1, 2],
        "foo": pd.Series(['a', 'b', 'a'], dtype='category'),
    })
    with soma.DataFrame.open(uri, "w", tiledb_timestamp=2) as sdf:
        sdf.write(atbl)

    # Write at time 3
    atbl = pa.Table.from_pydict({
        "soma_joinid": [3, 4],
        #"foo": pd.Series(['b', 'c'], dtype='category'),
        "foo": pd.Series(['b', 'b'], dtype='category'),
    })
    with soma.DataFrame.open(uri, "w", tiledb_timestamp=3) as sdf:
        sdf.write(atbl)

    with soma.DataFrame.open(uri) as sdf:
        table = sdf.read().concat()
        print()
        print("TABLE AT NOW")
        print(table)
        print()
        print(table.to_pandas())
        print()

I get

TABLE AT NOW
pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[0,1,2,3,4]]
foo: [  -- dictionary:
["a","b"]  -- indices:
[0,1,0,0,0]]

   soma_joinid foo
0            0   a
1            1   b
2            2   a
3            3   a <------ note
4            4   a <------ note
johnkerl commented 2 months ago

Good. That should also help for R where we documented this first,

@eddelbuettel great! If you have a ilnk for that "where" handy, please share it here -- that will be helpful.

johnkerl commented 2 months ago

Given #2896 and the proof above that the issue I identified is orthogonal to this PR (and pre-dates it), I think we should move forward with this in order to facilitate ongoing R-side development. And #2896 remains open.

I'll add a Python unit-test case as noted above.

johnkerl commented 2 months ago

Good. That should also help for R where we documented this first,

@eddelbuettel great! If you have a ilnk for that "where" handy, please share it here -- that will be helpful.

Never mind -- I found #2879

eddelbuettel commented 2 months ago

@johnkerl Here is my actual test script from last week (which is not yet committed to the 'scraps' repo I use for these). With the SOMA version I have here (a few days old) it just comes up empty (no 'strings' for the enumeration). Worth testing on your patch.

```r !#!/usr/bin/env Rscript uri <- "/tmp/tiledb/somadf-evolve-ts" if (dir.exists(uri)) unlink(uri, recursive=TRUE) schema <- arrow::schema(arrow::field("soma_joinid", arrow::int64(), nullable = FALSE), arrow::field("int", arrow::int32(), nullable = FALSE), arrow::field("str", arrow::dictionary(index_type = arrow::int8(), value_type = arrow::utf8(), ordered = FALSE), nullable=FALSE)) print(schema) ts1 <- rep(as.POSIXct(1, tz="UTC"), 2) sdf <- tiledbsoma::SOMADataFrameCreate(uri, schema, tiledb_timestamp=ts1[1]) data <- arrow::arrow_table(soma_joinid = bit64::as.integer64(1L:5L), int = 101:105L, str = factor(c('a','b','b','a','b'))) cat("\n") print(tibble::as_tibble(data)) ts2 <- rep(as.POSIXct(2, tz="UTC"), 2) sdf$write(data, ts2) data <- arrow::arrow_table(soma_joinid = bit64::as.integer64(6L:10L), int = 106:110L, str = factor(c('c','b','c','c','b'))) cat("\n") print(tibble::as_tibble(data)) ts3 <- rep(as.POSIXct(3, tz="UTC"), 2) sdf$write(data, ts3) sdf$close() cat("\n") tiledb::tiledb_array(uri, return_as="data.table")[] sdf <- tiledbsoma::SOMADataFrameOpen(uri, tiledb_timestamp=ts3) res <- sdf$read()$concat() print(res) print(res$str) #tibble::as_tibble(res) ```
johnkerl commented 2 months ago

Worth testing on your patch.

@eddelbuettel before & after this PR I get the following when I run that:

Schema
soma_joinid: int64 not null
int: int32 not null
str: dictionary<values=string, indices=int8> not null
Error in self$open("WRITE", internal_use_only = "allowed_use") :
  tiledb_timestamp not yet supported for WRITE mode
Calls: <Anonymous> -> <Anonymous> -> <Anonymous> -> stopifnot
Execution halted

This PR should be merged to remove a known defect from the C++ implementation and we can continue from there.

eddelbuettel commented 2 months ago

@johnkerl when you said above

[...] when I run that:

where you on main or on my branch adding this feature? For me, on the branch, freshly rebased to the current main, all is now good. I altered the test script to show a tibble as we now get a correct arrow dictionary and all is well:

[ ... ]
# A tibble: 5 × 3
  soma_joinid   int str  
        <int> <int> <fct>
1           6   106 c    
2           7   107 b    
3           8   108 c    
4           9   109 c    
5          10   110 b    
$ 

But that is on the de/sc-52516/timestamprange branch.

johnkerl commented 2 months ago

Good to know -- awesome @eddelbuettel ! :)

eddelbuettel commented 2 months ago

Branch should hopefully be ready "Real Soon Now (TM)" too.