gzuidhof / zarr.js

Javascript implementation of Zarr
https://guido.io/zarr.js
Apache License 2.0
133 stars 23 forks source link

How to access a column in a dataframe? #128

Closed slowkow closed 2 years ago

slowkow commented 2 years ago

Question

How do we read a dataframe?

Example

Suppose we have a structure like this:

$ tree myfile.zarr/obs/
myfile.zarr/obs/
├── __categories
│   ├── leiden_labels
│   │   └── 0
│   └── sample_id
│       └── 0
├── _index
│   └── 0
├── leiden_labels
│   └── 0
├── n_cells
│   └── 0
└── sample_id
    └── 0

7 directories, 6 files

Here is obs/.zattrs:

{
    "_index": "_index",
    "column-order": [
        "sample_id",
        "leiden_labels",
        "n_cells"
    ],
    "encoding-type": "dataframe",
    "encoding-version": "0.1.0"
}

Suppose we would like to fetch the leiden_labels column.

Here's my code:

const store = new zarr.HTTPStore('http://127.0.0.1:8000/myfile.zarr')
const x = await app.store.getItem("obs/leiden_labels")
x
ArrayBuffer(530)
byteLength: 530
[[Prototype]]: ArrayBuffer
[[Int8Array]]: Int8Array(530)
[[Uint8Array]]: Uint8Array(530)
[[Int16Array]]: Int16Array(265)
[[ArrayBufferByteLength]]: 530
[[ArrayBufferData]]: 142426

I don't know what to do with these ArrayBuffers.

How do we get a simple array with the values?

Next, I tried this:

const x = await app.store.getItem("obs/leiden_labels")
x
ArrayBuffer(3224)
byteLength: 3224
[[Prototype]]: ArrayBuffer
[[Int8Array]]: Int8Array(3224)
[[Uint8Array]]: Uint8Array(3224)
[[Int16Array]]: Int16Array(1612)
[[Int32Array]]: Int32Array(806)
[[ArrayBufferByteLength]]: 3224
[[ArrayBufferData]]: 142519

This is closer to what I want! But I still don't have an array with values...

I started to read the source code for zarr.js but I am quickly getting lost in the layers of abstraction. I can't find any tests or other code snippets that might help to figure out how to read a dataframe.

slowkow commented 2 years ago

I'm also confused why the length is 3224 when the true length is 7219:

$ cat myfile.zarr/obs/leiden_labels/.zarray
{
    "chunks": [
        7219
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dtype": "|i1",
    "fill_value": 0,
    "filters": null,
    "order": "C",
    "shape": [
        7219
    ],
    "zarr_format": 2
}

Any hints or tips would be very appreciated! Thank you in advance.

slowkow commented 2 years ago

I defined my own functions:

let bloscCodec = await import('https://cdn.skypack.dev/numcodecs/blosc')
bloscCodec = new bloscCodec.default()

async function fetchColumn(url) {
  let response = await fetch(url)
  let arrayBuffer = await response.arrayBuffer()
  let compressedBytes = new Uint8Array(arrayBuffer)
  let bytes = await bloscCodec.decode(compressedBytes)
  return bytes.buffer
}

The function almost works as expected, but there are some strange surprises.

First, let's retrieve the sample_id column from the obs dataframe:

const zarr_url = "path/to/myfile.zarr"

let sample_id = new Uint8Array(await fetchColumn(`${zarr_url}/obs/sample_id/0`))
console.log(sample_id)
// Uint8Array(14438) [116, 1, 116, 1, 116, 1, ...

The sample_id array should have length 7219, not 14438 (two times longer).

It appears that the odd-index items have the data I want (e.g. [116, 116, 116, ...]). I don't know what the even-index items are, and I need to discard them (e.g. [1, 1, 1, ...]).

OK, now let's go ahead and convert those numbers to strings by fetching the __categories for sample_id:

let sample_id_cats = new TextDecoder("utf-8").decode(
  await fetchColumn(`${zarr_url}/obs/__categories/sample_id/0`)
).split("\t")
console.log(sample_id_cats)
// (748) ['�\x02\x00\x00', '\x00\x00\x00sample1', '\x00\x00\x00sample2', ...

The sample_id_cats array should have length 751, not 748.

The correct values appear in the array (e.g. sample1), but they are surrounded by other characters I do not recognize.

Where are the 3 missing items? Did I lose them by using the wrong encoding? I don't understand what is wrong here.

I'd appreciate any tips or help!

slowkow commented 2 years ago

I learned that \x02 and \x00 are control characters:

Control Characters
                    CTRL   (^D means to hold the CTRL key and hit d)
Oct  Dec Char  Hex  Key     Comments
\000   0  NUL  \x00  ^@ \0 (Null byte)
\001   1  SOH  \x01  ^A    (Start of heading)
\002   2  STX  \x02  ^B    (Start of text)
\003   3  ETX  \x03  ^C    (End of text)

I suppose my text was encoded and decoded correctly, considering the presence of the \x02 (start of text) control character? I'm not confident about this.

slowkow commented 2 years ago

The string decoding issue can be solved by using readFloat32FromUint8() and parseVlenUtf8() from vitessce:

(I learned about these functions from this issue: https://github.com/manzt/numcodecs.js/issues/28)

const zarr_url = "data/pseudobulk/acute_bcell_pb_log1p_cpm.zarr"

import Blosc from 'https://cdn.skypack.dev/numcodecs/blosc'
const codec = new Blosc()

async function fetchColumn(url) {
  let response = await fetch(url)
  let arrayBuffer = await response.arrayBuffer()
  let compressedBytes = new Uint8Array(arrayBuffer)
  let bytes = await codec.decode(compressedBytes)
  return bytes
}

const readFloat32FromUint8 = (bytes) => {
  if (bytes.length !== 4) {
    throw new Error('readFloat32 only takes in length 4 byte buffers')
  }
  return new Int32Array(bytes.buffer)[0]
}

/**
   * Method for decoding text arrays from zarr.
   * Largerly a port of https://github.com/zarr-developers/numcodecs/blob/2c1aff98e965c3c4747d9881d8b8d4aad91adb3a/numcodecs/vlen.pyx#L135-L178
   * @returns {string[]} An array of strings.
   */
function parseVlenUtf8(buffer) {
  const HEADER_LENGTH = 4
  const decoder = new TextDecoder()
  let data = 0
  const dataEnd = data + buffer.length
  const length = readFloat32FromUint8(buffer.slice(data, HEADER_LENGTH))
  if (buffer.length < HEADER_LENGTH) {
    throw new Error('corrupt buffer, missing or truncated header')
  }
  data += HEADER_LENGTH
  const output = new Array(length)
  for (let i = 0; i < length; i += 1) {
    if (data + 4 > dataEnd) {
      throw new Error('corrupt buffer, data seem truncated')
    }
    const l = readFloat32FromUint8(buffer.slice(data, data + 4))
    data += 4
    if (data + l > dataEnd) {
      throw new Error('corrupt buffer, data seem truncated')
    }
    output[i] = decoder.decode(buffer.slice(data, data + l))
    data += l
  }
  return output
}
window.xCats = parseVlenUtf8(
  await fetchColumn(`${zarr_url}/obs/__categories/sample_id/0`)
)
console.log(xCats)
// (751) ['sample1', 'sample2', 'sample1', 'sample1', ...
slowkow commented 2 years ago

When I opened this issue, I had assumed that zarr.js contains classes like AnnDataSource. Sorry for my confusion! Now I understand that I was completely mistaken.

We can close this issue. For me, the code using AnnDataSource.js below works well:

import AnnDataSource from './AnnDataSource.js'

const zarr_path = "path/to/myfile.zarr"
const dataSource = new AnnDataSource({
    url: `http://127.0.0.1:8000/${zarr_path}`
})
let sample_id = await dataSource.loadObsColumns(['obs/sample_id'])
console.log(sample_id)
// (7219) ['sample1', 'sample2', 'sample1', 'sample1', ...