Closed johnkerl closed 2 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 89.98%. Comparing base (
9f82dad
) to head (d87cca3
). Report is 4 commits behind head on main.
Good. That should also help for R where we documented this first,
There is something going on, since
["a","b"] -- indices:
[0,1,0,0,0]]
should have been
["a","b"] -- indices:
[0,1,0,1,1]]
since we wrote 'b', 'b'
at slots 3,4
pydict = {}
pydict["soma_joinid"] = [3, 4]
pydict["foo"] = pd.Series(['b', 'b'], dtype='category')
rb = pa.Table.from_pydict(pydict)
I've verified that ☝️ is a separate bug. Running this on a built checkout without this PR:
ubuntu@segge[prod][kerl/h2h][TileDB-SOMA]$ cat ts.py
import tiledb
import tiledbsoma as soma
import pyarrow as pa
import tempfile
import numpy as np
import pandas as pd
import time
with tempfile.TemporaryDirectory() as uri:
asch = pa.schema([
("foo", pa.dictionary(pa.int8(), pa.large_string()))
])
# Create at time 1
soma.DataFrame.create(uri, schema=asch, tiledb_timestamp=1).close()
# Write at time 2
atbl = pa.Table.from_pydict({
"soma_joinid": [0, 1, 2],
"foo": pd.Series(['a', 'b', 'a'], dtype='category'),
})
with soma.DataFrame.open(uri, "w", tiledb_timestamp=2) as sdf:
sdf.write(atbl)
# Write at time 3
atbl = pa.Table.from_pydict({
"soma_joinid": [3, 4],
#"foo": pd.Series(['b', 'c'], dtype='category'),
"foo": pd.Series(['b', 'b'], dtype='category'),
})
with soma.DataFrame.open(uri, "w", tiledb_timestamp=3) as sdf:
sdf.write(atbl)
with soma.DataFrame.open(uri) as sdf:
table = sdf.read().concat()
print()
print("TABLE AT NOW")
print(table)
print()
print(table.to_pandas())
print()
I get
TABLE AT NOW
pyarrow.Table
soma_joinid: int64
foo: dictionary<values=string, indices=int8, ordered=0>
----
soma_joinid: [[0,1,2,3,4]]
foo: [ -- dictionary:
["a","b"] -- indices:
[0,1,0,0,0]]
soma_joinid foo
0 0 a
1 1 b
2 2 a
3 3 a <------ note
4 4 a <------ note
Good. That should also help for R where we documented this first,
@eddelbuettel great! If you have a ilnk for that "where" handy, please share it here -- that will be helpful.
Given #2896 and the proof above that the issue I identified is orthogonal to this PR (and pre-dates it), I think we should move forward with this in order to facilitate ongoing R-side development. And #2896 remains open.
I'll add a Python unit-test case as noted above.
Good. That should also help for R where we documented this first,
@eddelbuettel great! If you have a ilnk for that "where" handy, please share it here -- that will be helpful.
Never mind -- I found #2879
@johnkerl Here is my actual test script from last week (which is not yet committed to the 'scraps' repo I use for these). With the SOMA version I have here (a few days old) it just comes up empty (no 'strings' for the enumeration). Worth testing on your patch.
Worth testing on your patch.
@eddelbuettel before & after this PR I get the following when I run that:
Schema
soma_joinid: int64 not null
int: int32 not null
str: dictionary<values=string, indices=int8> not null
Error in self$open("WRITE", internal_use_only = "allowed_use") :
tiledb_timestamp not yet supported for WRITE mode
Calls: <Anonymous> -> <Anonymous> -> <Anonymous> -> stopifnot
Execution halted
This PR should be merged to remove a known defect from the C++ implementation and we can continue from there.
@johnkerl when you said above
[...] when I run that:
where you on main
or on my branch adding this feature? For me, on the branch, freshly rebased to the current main, all is now good. I altered the test script to show a tibble as we now get a correct arrow dictionary and all is well:
[ ... ]
# A tibble: 5 × 3
soma_joinid int str
<int> <int> <fct>
1 6 106 c
2 7 107 b
3 8 108 c
4 9 109 c
5 10 110 b
$
But that is on the de/sc-52516/timestamprange branch.
Good to know -- awesome @eddelbuettel ! :)
Branch should hopefully be ready "Real Soon Now (TM)" too.
Issue and/or context: #2879
Changes: If we've got the array open at a timestamp, and we need to evolve it, evolve from that same timestamp.
Notes for Reviewer:
Material from @nguyenv to be incorporated into unit-test material:
Script:
Before:
After:
Difference: