man-group / ArcticDB

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
http://arcticdb.io
Other
1.52k stars 93 forks source link

Updating a subset of columns of symbol in a dynamic schema library, causes data not appearing in the update to be deleted #1946

Open rorymcstay opened 1 month ago

rorymcstay commented 1 month ago

Describe the bug

Wondering if the following is the desired behaviour or not. To me it seems like the suggested behaviour would be quite a nice feature to have. Otherwise it would be useful to make it explicit in the update method documentation.

Under a library with a dynamic schema

  1. Add some data for all columns in to the symbol
  2. Update the data for one column only for a subset of the index
  3. Data which did not appear in the update no longer exists on latest version

Steps/Code to Reproduce

from arcticdb import Arctic, LibraryOptions
mem = Arctic("mem://")

df = pd.DataFrame(1.1, columns=["ABCD", "EFGH"], index=pd.date_range("2024-01-01", "2024-10-24"))

lib = mem.get_library("prices", library_options=LibraryOptions(dynamic_schema=True), create_if_missing=True)

lib.update("mid.close", upsert=True, data=df)
print(lib.read("mid.close").data.tail(10))
#             ABCD  EFGH
# 2024-10-15   1.1   1.1
# 2024-10-16   1.1   1.1
# 2024-10-17   1.1   1.1
# 2024-10-18   1.1   1.1
# 2024-10-19   1.1   1.1
# 2024-10-20   1.1   1.1
# 2024-10-21   1.1   1.1
# 2024-10-22   1.1   1.1
# 2024-10-23   1.1   1.1
# 2024-10-24   1.1   1.1

subset = lib.read("mid.close").data.tail(10)[["ABCD"]]
print(subset)
#             ABCD
# 2024-10-15   1.1
# 2024-10-16   1.1
# 2024-10-17   1.1
# 2024-10-18   1.1
# 2024-10-19   1.1
# 2024-10-20   1.1
# 2024-10-21   1.1
# 2024-10-22   1.1
# 2024-10-23   1.1
# 2024-10-24   1.1
lib.update("mid.close", upsert=True, data=subset)
print(lib.read("mid.close").data.tail(15))
            ABCD  EFGH
# 2024-10-10   1.1   1.1
# 2024-10-11   1.1   1.1
# 2024-10-12   1.1   1.1
# 2024-10-13   1.1   1.1
# 2024-10-14   1.1   1.1
# 2024-10-15   1.1   NaN
# 2024-10-16   1.1   NaN
# 2024-10-17   1.1   NaN
# 2024-10-18   1.1   NaN
# 2024-10-19   1.1   NaN
# 2024-10-20   1.1   NaN
# 2024-10-21   1.1   NaN
# 2024-10-22   1.1   NaN
# 2024-10-23   1.1   NaN
# 2024-10-24   1.1   NaN

Expected Results

Existing column data not appearing in update is unmodified.

            ABCD  EFGH
2024-10-15   1.1   1.1
2024-10-16   1.1   1.1
2024-10-17   1.1   1.1
2024-10-18   1.1   1.1
2024-10-19   1.1   1.1
2024-10-20   1.1   1.1
2024-10-21   1.1   1.1
2024-10-22   1.1   1.1
2024-10-23   1.1   1.1
2024-10-24   1.1   1.1

OS, Python Version and ArcticDB Version

Python: 3.11.9 (main, Apr 6 2024, 17:59:24) [GCC 9.4.0] OS: Linux-5.15.0-124-generic-x86_64-with-glibc2.31 ArcticDB: 4.5.0

Backend storage used

AWS S3, MEM, LMDB

Additional Context

No response

vasil-pashov commented 3 weeks ago

Hi @rorymcstay what you are observing is the expected behavior. I'll will update the docs (#1972) and leave this as a feature request in the backlog. Note that this is a non-trivial feature and is of low priority. It's unlikely that it would appear in the road map in the near future.