single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
https://tiledbsoma.readthedocs.io
MIT License
93 stars 25 forks source link

[r] [Bug] Writing metadata of type `bit64::integer64()` yields in corrupted data #1512

Closed mojaveazure closed 1 year ago

mojaveazure commented 1 year ago

Describe the bug Writing array-level metadata where the type is bit64::integer64() yeilds corrupted data. This doesn't happen with regular integers or doubles

To Reproduce

withr::with_tempdir({
  # Create array
  array <- SOMASparseNDArrayCreate("array", type = arrow::int64(), shape = c(80, 230))
  # Write metadata
  array$set_metadata(list(
    integers = c(10L, 20L),
    doubles = c(10.0, 20.0),
    int64 = bit64::as.integer64(c(10L, 20L))
  ))
  # Close to finalize writes
  array$close()
  # Re-open and read in metadata
  array <- SOMASparseNDArrayOpen(array$uri)
  print(array$get_metadata("integers"))
  print(array$get_metadata("doubles"))
  print(array$get_metadata("int64"))
})
#> [1] 10 20
#> [1] 10 20
#> [1] 4.940656e-323 9.881313e-323

Versions (please complete the following information):

> sessionInfo(c('tiledbsoma', 'tiledb', 'bit64'))$otherPkgs
$tiledbsoma
Package: tiledbsoma
Type: Package
Title: TileDB SOMA
Description: Interface for working with
     'TileDB'-based Stack of Matrices,
     Annotated ('SOMA'): an open data model
     for representing annotated matrices, like
     those commonly used for single cell data
     analysis.
Version: 0.0.0.9030
Authors@R: c( person(given = "Aaron", family =
     "Wolen", role = c("cre", "aut"), email =
     "aaron@tiledb.com", comment = c(ORCID =
     "0000-0003-2542-2202")), person(given =
     "Dirk", family = "Eddelbuettel", email =
     "dirk@tiledb.com", role = "aut", comment
     = c(ORCID = "0000-0001-6419-907X")),
     person(given = "Paul", family =
     "Hoffman", email =
     "paul.hoffman@tiledb.com", role = "aut",
     comment = c(ORCID =
     "0000-0002-7693-8957")), person(given =
     "John", family = "Kerl", email =
     "john.kerl@tiledb.com", role = "aut"),
     person(given = "TileDB, Inc.", role =
     c("cph", "fnd")) )
URL:
     https://github.com/single-cell-data/TileDB-SOMA
BugReports:
     https://github.com/single-cell-data/TileDB-SOMA/issues
License: MIT + file LICENSE
Encoding: UTF-8
Depends: R (>= 4.0.0)
Imports: R6, methods, Matrix, stats, bit64,
     tiledb (>= 0.19.1), arrow, utils, fs,
     glue, urltools, Rcpp, data.table, spdl,
     rlang, tools
LinkingTo: Rcpp, RcppSpdlog
Additional_repositories:
     https://ghrr.github.io/drat
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.3
Suggests: rmarkdown, knitr, withr, testthat
     (>= 3.0.0), pbmc3k.tiledb, SeuratObject
     (>= 4.1.0), datasets
VignetteBuilder: knitr
Config/testthat/edition: 2
OS_type: unix
SystemRequirements: cmake, git

-- File: /home/paul/software/TileDB-SOMA/apis/r/DESCRIPTION 

$tiledb
Package: tiledb
Type: Package
Version: 0.19.1
Title: Universal Storage Engine for Sparse and
     Dense Multidimensional Arrays
Authors@R: c(person("TileDB, Inc.", role =
     c("aut", "cph")), person("Dirk",
     "Eddelbuettel", email =
     "dirk@tiledb.com", role = "cre"))
Description: The universal storage engine
     'TileDB' introduces a powerful on-disk
     format for multi-dimensional arrays. It
     supports dense and sparse arrays,
     dataframes and key-values stores, cloud
     storage ('S3', 'GCS', 'Azure'), chunked
     arrays, multiple compression, encryption
     and checksum filters, uses a fully
     multi-threaded implementation, supports
     parallel I/O, data versioning ('time
     travel'), metadata and groups. It is
     implemented as an embeddable
     cross-platform C++ library with APIs from
     several languages, and integrations.
Copyright: TileDB, Inc.
License: MIT + file LICENSE
URL: https://github.com/TileDB-Inc/TileDB-R
BugReports:
     https://github.com/TileDB-Inc/TileDB-R/issues
SystemRequirements: A C++17 compiler is
     required, and on macOS compilation for
     version 11.14 is required. Optionally
     cmake (only when TileDB source build
     selected), git (only when TileDB source
     build selected); on x86_64 and M1
     platforms pre-built TileDB Embedded
     libraries are available at GitHub and are
     used if no TileDB installation is
     detected, and no other option to build or
     download was specified by the user.
Imports: methods, Rcpp (>= 1.0.8), nanotime,
     spdl
LinkingTo: Rcpp
Suggests: tinytest, simplermarkdown, curl,
     bit64, Matrix, palmerpenguins,
     nycflights13, data.table, tibble
VignetteBuilder: simplermarkdown
RoxygenNote: 7.2.3
Encoding: UTF-8
NeedsCompilation: yes
Packaged: 2023-04-27 00:09:50 UTC; edd
Author: TileDB, Inc. [aut, cph], Dirk
     Eddelbuettel [cre]
Maintainer: Dirk Eddelbuettel
     <dirk@tiledb.com>
Repository: CRAN
Date/Publication: 2023-04-27 02:20:02 UTC
Built: R 4.2.3; x86_64-pc-linux-gnu;
     2023-05-03 22:02:35 UTC; unix

-- File: /home/paul/R/x86_64-pc-linux-gnu-library/4.2/tiledb/Meta/package.rds 

$bit64
Package: bit64
Type: Package
Title: A S3 Class for Vectors of 64bit
     Integers
Version: 4.0.5
Date: 2020-08-29
Author: Jens Oehlschlägel [aut, cre], Leonardo
     Silvestri [ctb]
Maintainer: Jens Oehlschlägel
     <Jens.Oehlschlaegel@truecluster.com>
Depends: R (>= 3.0.1), bit (>= 4.0.0), utils,
     methods, stats
Description: Package 'bit64' provides
     serializable S3 atomic 64bit (signed)
     integers.  These are useful for handling
     database keys and exact counting in
     +-2^63.  WARNING: do not use them as
     replacement for 32bit integers, integer64
     are not supported for subscripting by
     R-core and they have different semantics
     when combined with double, e.g. integer64
     + double => integer64.  Class integer64
     can be used in vectors, matrices, arrays
     and data.frames.  Methods are available
     for coercion from and to logicals,
     integers, doubles, characters and factors
     as well as many elementwise and summary
     functions.  Many fast algorithmic
     operations such as 'match' and 'order'
     support inter- active data exploration
     and manipulation and optionally leverage
     caching.
License: GPL-2 | GPL-3
LazyLoad: yes
ByteCompile: yes
URL: https://github.com/truecluster/bit64
Encoding: UTF-8
Repository: CRAN
Repository/R-Forge/Project: ff
Repository/R-Forge/Revision: 177
Repository/R-Forge/DateTimeStamp: 2018-08-17
     17:45:18
Date/Publication: 2020-08-30 07:20:02 UTC
NeedsCompilation: yes
Packaged: 2020-08-29 10:56:45 UTC; jo
Built: R 4.2.2; x86_64-pc-linux-gnu;
     2023-01-09 22:25:16 UTC; unix

-- File: /home/paul/R/x86_64-pc-linux-gnu-library/4.2/bit64/Meta/package.rds 

Additional context Note, this does not happen in Python

Python example ```python import os import tempfile import numpy import pyarrow import tiledbsoma with tempfile.TemporaryDirectory() as tmpdir: # Create array array = tiledbsoma.SparseNDArray.create( os.path.join(tmpdir, "array"), type=pyarrow.int64(), shape=[80, 230] ) # Write metadata array.metadata["integers"] = numpy.array([10, 20], dtype=numpy.int32) array.metadata["doubles"] = numpy.array([10.0, 20.0]) array.metadata["int64"] = numpy.array([10, 20], dtype=numpy.int64) # Close to finalize writes array.close() # Re-open and read in metadata array = tiledbsoma.SparseNDArray.open(array.uri) print(array.metadata["integers"]) print(array.metadata["doubles"]) print(array.metadata["int64"]) ```
eddelbuettel commented 1 year ago

Thanks for reporting this, and I can confirm. It is actually a tiledb-r bug. When we write metadate from R via tiledb_put_metadata() we call the internal function libtiledb_array_put_metadata(). Which receives the array, the key under which we write, and the value. And a (casted) bit64::integer64() becomes a REALSXP as the payload is stored in a double and then a double is returned. So I will have to catch this when we write it and make sure I differentiate between a double itself and a double acting on behalf of an int64_t representation -- and in the latter case write that.

eddelbuettel commented 1 year ago

PR coming in a moment but we have this now:

$ Rscript soma_gh_issue512.R
[1] 10 20
[1] 10 20
[1] 10 20
$ 

with a minimally modified version of your most helpful example.

withcast <- function() {
    setwd("/tmp/tiledb")
    uri <- "soma_issue_1512"
    if (dir.exists(uri)) unlink(uri, recursive=TRUE)

    ## Create array
    array <- SOMASparseNDArrayCreate(uri, type = arrow::int64(), shape = c(80, 230))
    ## Write metadata
    array$set_metadata(list(integers = c(10L, 20L),
                            doubles = c(10.0, 20.0),
                            int64 = bit64::as.integer64(c(10L, 20L))))
    ## Close to finalize writes
    array$close()
    ## Re-open and read in metadata
    array <- SOMASparseNDArrayOpen(array$uri)
    print(array$get_metadata("integers"))
    print(array$get_metadata("doubles"))
    print(array$get_metadata("int64"))
}

library(tiledbsoma)
withcast()
eddelbuettel commented 1 year ago

Addressed by https://github.com/TileDB-Inc/TileDB-R/pull/564 over on the other repo, reviews welcome!

eddelbuettel commented 1 year ago

I think this can be closed, maybe in conjunction with a version depends on tiledb-r. It was never a bug in tiledbsoma as the metadata writes are handled by tiledb-r.