Closed slowkow closed 2 years ago
I'm also confused why the length is 3224 when the true length is 7219:
$ cat myfile.zarr/obs/leiden_labels/.zarray
{
"chunks": [
7219
],
"compressor": {
"blocksize": 0,
"clevel": 5,
"cname": "lz4",
"id": "blosc",
"shuffle": 1
},
"dtype": "|i1",
"fill_value": 0,
"filters": null,
"order": "C",
"shape": [
7219
],
"zarr_format": 2
}
Any hints or tips would be very appreciated! Thank you in advance.
I defined my own functions:
let bloscCodec = await import('https://cdn.skypack.dev/numcodecs/blosc')
bloscCodec = new bloscCodec.default()
async function fetchColumn(url) {
let response = await fetch(url)
let arrayBuffer = await response.arrayBuffer()
let compressedBytes = new Uint8Array(arrayBuffer)
let bytes = await bloscCodec.decode(compressedBytes)
return bytes.buffer
}
The function almost works as expected, but there are some strange surprises.
First, let's retrieve the sample_id
column from the obs
dataframe:
const zarr_url = "path/to/myfile.zarr"
let sample_id = new Uint8Array(await fetchColumn(`${zarr_url}/obs/sample_id/0`))
console.log(sample_id)
// Uint8Array(14438) [116, 1, 116, 1, 116, 1, ...
The sample_id
array should have length 7219, not 14438 (two times longer).
It appears that the odd-index items have the data I want (e.g. [116, 116, 116, ...]
). I don't know what the even-index items are, and I need to discard them (e.g. [1, 1, 1, ...]
).
OK, now let's go ahead and convert those numbers to strings by fetching the __categories
for sample_id
:
let sample_id_cats = new TextDecoder("utf-8").decode(
await fetchColumn(`${zarr_url}/obs/__categories/sample_id/0`)
).split("\t")
console.log(sample_id_cats)
// (748) ['�\x02\x00\x00', '\x00\x00\x00sample1', '\x00\x00\x00sample2', ...
The sample_id_cats
array should have length 751, not 748.
The correct values appear in the array (e.g. sample1
), but they are surrounded by other characters I do not recognize.
Where are the 3 missing items? Did I lose them by using the wrong encoding? I don't understand what is wrong here.
I'd appreciate any tips or help!
I learned that \x02
and \x00
are control characters:
Control Characters
CTRL (^D means to hold the CTRL key and hit d)
Oct Dec Char Hex Key Comments
\000 0 NUL \x00 ^@ \0 (Null byte)
\001 1 SOH \x01 ^A (Start of heading)
\002 2 STX \x02 ^B (Start of text)
\003 3 ETX \x03 ^C (End of text)
I suppose my text was encoded and decoded correctly, considering the presence of the \x02
(start of text) control character? I'm not confident about this.
The string decoding issue can be solved by using readFloat32FromUint8()
and parseVlenUtf8()
from vitessce:
(I learned about these functions from this issue: https://github.com/manzt/numcodecs.js/issues/28)
const zarr_url = "data/pseudobulk/acute_bcell_pb_log1p_cpm.zarr"
import Blosc from 'https://cdn.skypack.dev/numcodecs/blosc'
const codec = new Blosc()
async function fetchColumn(url) {
let response = await fetch(url)
let arrayBuffer = await response.arrayBuffer()
let compressedBytes = new Uint8Array(arrayBuffer)
let bytes = await codec.decode(compressedBytes)
return bytes
}
const readFloat32FromUint8 = (bytes) => {
if (bytes.length !== 4) {
throw new Error('readFloat32 only takes in length 4 byte buffers')
}
return new Int32Array(bytes.buffer)[0]
}
/**
* Method for decoding text arrays from zarr.
* Largerly a port of https://github.com/zarr-developers/numcodecs/blob/2c1aff98e965c3c4747d9881d8b8d4aad91adb3a/numcodecs/vlen.pyx#L135-L178
* @returns {string[]} An array of strings.
*/
function parseVlenUtf8(buffer) {
const HEADER_LENGTH = 4
const decoder = new TextDecoder()
let data = 0
const dataEnd = data + buffer.length
const length = readFloat32FromUint8(buffer.slice(data, HEADER_LENGTH))
if (buffer.length < HEADER_LENGTH) {
throw new Error('corrupt buffer, missing or truncated header')
}
data += HEADER_LENGTH
const output = new Array(length)
for (let i = 0; i < length; i += 1) {
if (data + 4 > dataEnd) {
throw new Error('corrupt buffer, data seem truncated')
}
const l = readFloat32FromUint8(buffer.slice(data, data + 4))
data += 4
if (data + l > dataEnd) {
throw new Error('corrupt buffer, data seem truncated')
}
output[i] = decoder.decode(buffer.slice(data, data + l))
data += l
}
return output
}
window.xCats = parseVlenUtf8(
await fetchColumn(`${zarr_url}/obs/__categories/sample_id/0`)
)
console.log(xCats)
// (751) ['sample1', 'sample2', 'sample1', 'sample1', ...
When I opened this issue, I had assumed that zarr.js contains classes like AnnDataSource
. Sorry for my confusion! Now I understand that I was completely mistaken.
We can close this issue. For me, the code using AnnDataSource.js below works well:
import AnnDataSource from './AnnDataSource.js'
const zarr_path = "path/to/myfile.zarr"
const dataSource = new AnnDataSource({
url: `http://127.0.0.1:8000/${zarr_path}`
})
let sample_id = await dataSource.loadObsColumns(['obs/sample_id'])
console.log(sample_id)
// (7219) ['sample1', 'sample2', 'sample1', 'sample1', ...
Question
How do we read a dataframe?
Example
Suppose we have a structure like this:
Here is
obs/.zattrs
:Suppose we would like to fetch the
leiden_labels
column.Here's my code:
I don't know what to do with these ArrayBuffers.
How do we get a simple array with the values?
Next, I tried this:
This is closer to what I want! But I still don't have an array with values...
I started to read the source code for zarr.js but I am quickly getting lost in the layers of abstraction. I can't find any tests or other code snippets that might help to figure out how to read a dataframe.